chronix poster for the poster session fast 2017

1
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager, 1 Michael Philippsen, 2 Andreas Kumlehn, 2 and Josef Adersberger 1 1 QAware GmbH, Munich, Germany 2 University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager, 1 Michael Philippsen, 2 Andreas Kumlehn, 2 and Josef Adersberger 1 1 QAware GmbH, Munich, Germany 2 University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Abstract Anomalies in the runtime behavior of software systems, especially in distributed systems, are inevitable, expensive, and hard to lo- cate. To detect and correct such anomalies one has to automati- cally collect, store, and analyze the operational data of the run- time behavior, often represented as time series. There are efficient means both to collect and analyze the runtime behavior. But general-purpose time series databases do not focus on the specific needs of anomaly detection. Chronix is a domain specific time series database targeted at anomaly detection in operational data. Detecting Anomalies in Running Software matters Resource consumption: anomalous memory consumption, high CPU usage, . . . Sporadic failure: blocking state, deadlock, dirty read, . . . Security: port scanning activity, short frequent login attempts, . . . Economic or reputation loss. Anomaly Detection Tool Chain for Operational Data Collection Framework Analysis Framework Time Series Database Collects operational data from a running application Asks the database for data and analyzes the data Stores the time series data General Purpose TSDB Brake shoe Resource hog Productivity obstacle Chronix: Domain specific TSDB Domain specific sensors and adaptors Domain specific analysis algorithms and tools Application’s Operational Data Types of operational data: Metrics: scalar values, e.g., rates, runtimes, total hits, counters, … Events: single occurrences, e.g., a user’s login, product order, … Traces: sequences within a software system, e.g., the called methods, … General-purpose TSDBs in Anomaly Detection Requirements Graphite InfluxDB OpenTSDB KairosDB Prometheus Generic data model # G # # G # # Analysis support G # G # # G # G # Lossless long term storage # G # No support for data types = Productivity obstacle No support for analyses = Productivity obstacle + Brake shoe High memory footprint = Performance hog High storage demands = Performance hog Loss of historical data = Brake shoe What makes Chronix domain specific? Option to pre-compute an extra representation of the data Optional timestamp compression for almost-periodic time series Records that meet the needs of the domain Compression technique that suits the domain’s data Underlying multi-dimensional storage Domain specific query language with server-side evaluation Domain specific commissioning of configuration parameters How it works! Example: Almost-periodic time series Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC Optional Pre-compute Extras Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC B 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC C 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC B B A C B C D C A B Lossless storage that keeps all details as analyses may need them. Programming interface to add extra domain specific "columns". These "columns" speed up anomaly detection queries. Optional Timestamp Compaction Timestamp 25.10 … :01.546 25.10 … :06.718 25.10 … :11.891 25.10 … :16.964 Timestamp 25.10 … :01.546 5.172 5.173 5.073 Timestamp 25.10 … :01.546 5.172 0.001 0.1 Timestamp 25.10 … :01.546 5.172 - - Timestamp 25.10 … :01.546 5.172 space saved space saved Drop diffs below threshold Calculate deltas Compute diffs between them If accumulated drift > threshold store delta Date-Delta-Compaction for almost-periodic time series. Functionally lossless as all relevant details are kept. Degree of inaccuracy is a configuration parameter of Domain Specifc Records Process Host SAX SmartHub QAMUC A SmartHub QAMUC B SmartHub QAMUC C SmartHub QAMUC B 1 Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B 1 BLOB chunk & convert 2 Timestamp Value Metric 25….:01.546 218.34 ingester\time 5.172 218.37 ingester\time - 218.49 ingester\time - 218.35 ingester\time 1 1 1 2 2 2 Exploit repetitiveness and bundle "lines" into data chunks. Programming interface for a specifc time series record encoding. Chunk size is a configuration parameter of Domain Specific Compression Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: 00105e0 e6b0 343b 9 07bc 0804 e7d508040 Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B Compressed BLOB serialize & compress Lossless compression techniques minimizes the record size. Domain data often has small increments, recurring patterns, etc. Choice of compression technqiue is a configuration parameter of Multi-Dimensional Storage Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC q=host:QAMUC AND metric:ingester* AND type:[metric OR trace] AND end:NOW-7MONTH Explorative: Users can use the attributes to find a record. Correlating: Queries can use and combine all types. Query Language & Server-Side eval. Basic Graphite InfluxDB OpenTSDB KairosDB Prometheus Chronix distinct × X × × × X integral X × × × × X min/max/sum X X X X X X bottom/top × X × × X X first/last X X × X × X ... ... ... ... ... ... ... nnderivative X X × × × X movavg X X × × × X divide/scale X X × X X X High-level sax [33] × × × × × X fastdtw [38] × × × × × X outlier × × × × × X trend × × × × × X frequency × × × × × X grpsize × × × × × X split × × × × × X query read result process q=metric:ingester* & cf=outlier Basic functions & high-level built-in domain specific functions. Plug-in interface to add functions for server-side evalution. Domain Specific Commissioning 0 10 20 30 40 50 60 70 80 90 0 5 10 50 100 200 1000 DDC threshold in ms Rates in % - Average in ms Inaccuracy Rate Average Deviation Space Reduction 32 34 36 38 40 42 44 46 48 32 64 128 256 512 1024 Chunk size in kBytes Total access time in sec gzip LZ4 Snappy DDC-Threshold: 200 ms. Compression & Chunk Size: gzip + 128 kByte. Easily detect pattern! Fast! Best Values! Select the ideal Compression! Projects 1–3 Remove Jitter! Evaluation Benchmark Client Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Java Benchmark Benchmark Server Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Docker InfluxDB KairosDB Chronix OpenTSDB Queries Time Series HTTP Data of 5 Industry Projects. Project 1 2 3 4 5 total time series 1,080 8,567 4,538 500 24,055 38,740 pairs (mio) metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6 lsof 0.0 0.0 0.0 0.4 0.0 0.4 strace 0.0 0.0 0.0 12.1 0.0 12.1 (a) Pairs and time series per project. Project r 0.5 1 7 14 21 28 56 91 180 1–3 q 15 30 30 10 5 3 1 2 0 96 q 2 11 15 8 12 5 1 2 2 58 4&5 b 1 6 5 7 2 4 4 1 2 32 h 2 6 10 8 6 6 3 2 0 43 (b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval, and of queries with basic (b) and high-level (h) functions. Memory footprint (in MBytes) InfluxDB OpenTSDB KairosDB Chronix Initially 33 2,726 8,763 446 Import (max) 10,336 10,111 18,905 7,002 Query (max) 8,269 9,712 11,230 4,792 Chronix has a 34%–69% smaller memory footprint. Storage demands (in GBytes) Project Raw Data InfluxDB OpenTSDB KairosDB Chronix 4 1.2 0.2 0.2 0.3 0.1 5 107.0 10.7 16.9 26.5 8.6 total 108.2 10.9 17.1 26.8 8.7 Chronix saves 20%–68% of the storage space. Data retrieval times (in s) r q InfluxDB OpenTSDB KairosDB Chronix 0.5 2 4.3 2.8 4.4 0.9 1 11 5.2 5.6 6.6 5.3 7 15 34.1 17.4 26.8 7.0 14 8 36.2 14.2 25.5 4.0 21 12 76.5 29.8 55.0 6.0 28 5 7.9 3.9 5.6 0.5 56 1 35.4 12.4 24.1 1.2 91 2 47.5 15.5 33.8 1.1 180 2 96.7 36.7 66.6 1.1 total 343.8 138.3 248.4 27.1 Chronix saves 80%–92% on data retrieval time. Times for b- and h-queries (in s) Basic (b) InfluxDB OpenTSDB KairosDB Chronix 4 Avg. 0.9 6.1 9.8 4.4 5 Max. 1.3 8.4 9.1 6.0 3 Min. 0.7 2.7 5.3 2.8 3 Dev. 6.7 16.7 21.1 2.3 5 Sum 0.7 6.0 12.0 2.0 4 Count 0.8 5.5 10.5 1.0 8 Perc. 10.2 25.8 34.5 8.6 High-level (h) 12 Outlier 30.7 29.1 117.6 18.9 14 Trend 162.7 50.4 100.6 30.2 11 Freq. 47.3 23.9 45.7 16.3 3 GroupSize 218.9 2927.8 206.3 29.6 3 Split 123.1 2893.9 47.9 37.2 75 total 604.0 5996.3 620.4 159.3 Chronix saves 73%–97% on analysis times. Typical scenario. Important! Needed for exploration! Evaluation: Projects 4&5 Conclusion Chronix exploits the characteristics of the domain in many ways and thus achieves better storage and query results. Chronix is open source. www.chronix.io Acknowledgements This research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy and Technology as an IuK-grant for the project DfD – Design for Diagnosability.

Upload: florian-lautenschlager

Post on 11-Apr-2017

84 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Chronix Poster for the Poster Session FAST 2017

Chronix: Long Term Storage and Retrieval Technologyfor Anomaly Detection in Operational Data

Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger11QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen

Chronix: Long Term Storage and Retrieval Technologyfor Anomaly Detection in Operational Data

Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger11QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen

AbstractAnomalies in the runtime behavior of software systems, especiallyin distributed systems, are inevitable, expensive, and hard to lo-cate. To detect and correct such anomalies one has to automati-cally collect, store, and analyze the operational data of the run-time behavior, often represented as time series. There are efficientmeans both to collect and analyze the runtime behavior. Butgeneral-purpose time series databases do not focuson the specific needs of anomaly detection. Chronixis a domain specific time series database targeted atanomaly detection in operational data.

Detecting Anomalies in Running Software matters

•Resource consumption: anomalousmemory consumption, high CPU usage, . . .•Sporadic failure: blocking state,deadlock, dirty read, . . .•Security: port scanning activity,short frequent login attempts, . . .

↪→ Economic or reputation loss.

Anomaly Detection Tool Chain for Operational Data

Collection

Framework

Analysis

FrameworkTime Series Database

Collects operational data

from a running application

Asks the database for data

and analyzes the dataStores the time series data

General Purpose TSDB

• Brake shoe

• Resource hog

• Productivity obstacle

Chronix:

Domain specific TSDB

Domain specific sensors

and adaptors

Domain specific analysis

algorithms and tools

Application’s

Operational

Data

Types of operational data:

• Metrics: scalar values, e.g.,

rates, runtimes, total hits,

counters, …

• Events: single occurrences,

e.g., a user’s login, product

order, …

• Traces: sequences within a

software system, e.g., the

called methods, …

General-purpose TSDBs in Anomaly Detection

Requirements

Graphite

InfluxD

BOpenT

SDB

KairosDB

Prom

etheus

Genericdata model # G# # G# #

Analysissupport G# G# # G# G#

Lossless longterm storage # G#

No support for data types= Productivity obstacle

No support for analyses= Productivity obstacle + Brake shoe

High memory footprint= Performance hogHigh storage demands= Performance hogLoss of historical data= Brake shoe

What makes Chronix domain specific?

Option to pre-compute an extra representation of the data

Optional timestamp compression for almost-periodic time series

Records that meet the needs of the domain

Compression technique that suits the domain’s data

Underlying multi-dimensional storage

Domain specific query language with server-side evaluation

Domain specific commissioning of configuration parameters

How it works!

Example: Almost-periodic time series

Timestamp Value Metric Process Host

25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC

25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC

25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC

25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC

… … … … …

Optional Pre-compute Extras

Timestamp Value Metric Process Host SAX

25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A

25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC B

25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC C

25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC B

… … … … … …

B

A

C

B

C

D

C

A

B

• Lossless storage that keeps all details as analyses may need them.• Programming interface to add extra domain specific "columns".• These "columns" speed up anomaly detection queries.

Optional Timestamp Compaction

Timestamp

25.10 … :01.546

25.10 … :06.718

25.10 … :11.891

25.10 … :16.964

Timestamp

25.10 … :01.546

5.172

5.173

5.073

Timestamp

25.10 … :01.546

5.172

0.001

0.1

Timestamp

25.10 … :01.546

5.172

-

-

Timestamp

25.10 … :01.546

5.172

space saved

space

saved

Drop diffs

below threshold

Calculate

deltas

Compute

diffs

between

them

If accumulated

drift > threshold store delta

•Date-Delta-Compaction for almost-periodic time series.• Functionally lossless as all relevant details are kept.

• Degree of inaccuracy is a configuration parameter of

Domain Specifc Records

Process Host SAX

SmartHub QAMUC A

SmartHub QAMUC B

SmartHub QAMUC C

SmartHub QAMUC B

… … …

1Record

metric: ingester\time

process: SmartHub

host: QAMUC

start: 25.10.2016 00:00:01.546

end: …

type: metric

data: Timestamp Value SAX

25.10.2016 00:00:01.546 218.34 A

5.172 218.37 B

- 218.49 C

- 218.35 B

1

BLOB

chunk

& convert

2

Timestamp Value Metric

25….:01.546 218.34 ingester\time

5.172 218.37 ingester\time

- 218.49 ingester\time

- 218.35 ingester\time

… … …

1

1 1 2

22

• Exploit repetitiveness and bundle "lines" into data chunks.• Programming interface for a specifc time series record encoding.

• Chunk size is a configuration parameter of

Domain Specific Compression

Record

metric: ingester\time

process: SmartHub

host: QAMUC

start: 25.10.2016 00:00:01.546

end: …

type: metric

data: 00105e0 e6b0 343b 9

07bc 0804 e7d508040

Record

metric: ingester\time

process: SmartHub

host: QAMUC

start: 25.10.2016 00:00:01.546

end: …

type: metric

data: Timestamp Value SAX

25.10.2016 00:00:01.546 218.34 A

5.172 218.37 B

- 218.49 C

- 218.35 B Compressed BLOB

serialize

& compress

• Lossless compression techniques minimizes the record size.• Domain data often has small increments, recurring patterns, etc.

• Choice of compression technqiue is a configuration parameter of

Multi-Dimensional Storage

Timestamp Value Metric Process Host

25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC

25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC

25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC

25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC

… … … … …

q=host:QAMUC AND metric:ingester*AND type:[metric OR trace] AND end:NOW-7MONTH

• Explorative: Users can use the attributes to find a record.•Correlating: Queries can use and combine all types.

Query Language & Server-Side eval.

Basic GraphiteInfluxDB

OpenTSDBKairosDB

PrometheusChronix

distinct × X × × × Xintegral X × × × × X

min/max/sum X X X X X Xbottom/top × X × × X X

first/last X X × X × X. . . . . . . . . . . . . . . . . . . . .

nnderivative X X × × × Xmovavg X X × × × X

divide/scale X X × X X X

High-levelsax [33] × × × × × X

fastdtw [38] × × × × × Xoutlier × × × × × X

trend × × × × × Xfrequency × × × × × X

grpsize × × × × × Xsplit × × × × × X

queryread

resultprocess

q=metric:ingester* & cf=outlier

• Basic functions & high-level built-in domain specific functions.• Plug-in interface to add functions for server-side evalution.

Domain Specific Commissioning

0102030405060708090

0 5 10 50 100 200 1000DDC threshold in ms

Rat

es in

% −

Ave

rage

in m

s

Inaccuracy RateAverage DeviationSpace Reduction

323436384042444648

32 64 128 256 512 1024Chunk size in kBytes

Tota

l acc

ess

time

in s

ec

gzipLZ4Snappy

•DDC-Threshold: 200 ms.•Compression & Chunk Size: gzip + 128 kByte.

Easily detectpattern!

Fast!

Best Values!

Select the

ideal

Compression!

Projects 1–3

Remove

Jitter!

EvaluationBenchmark ClientUbuntu 16.04.1 x64, 12 core, 32 GB Ram,

512 GB SSD

Java

Benchmark

Benchmark ServerUbuntu 16.04.1 x64, 12 core, 32 GB Ram,

512 GB SSD

Docker InfluxDB KairosDB

ChronixOpenTSDB

Queries

Time Series

HTTP

Data of 5 Industry Projects.Project 1 2 3 4 5 total

time series 1,080 8,567 4,538 500 24,055 38,740

pairs

(mio) metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6

lsof 0.0 0.0 0.0 0.4 0.0 0.4strace 0.0 0.0 0.0 12.1 0.0 12.1

(a) Pairs and time series per project.

Project r 0.5 1 7 14 21 28 56 91 1801 – 3 q 15 30 30 10 5 3 1 2 0 96

q 2 11 15 8 12 5 1 2 2 584 & 5 b 1 6 5 7 2 4 4 1 2 32

h 2 6 10 8 6 6 3 2 0 43

(b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval,and of queries with basic (b) and high-level (h) functions.

Memory footprint (in MBytes)

InfluxDB

OpenTSDB

KairosD

B

Chronix

Initially 33 2,726 8,763 446Import (max) 10,336 10,111 18,905 7,002Query (max) 8,269 9,712 11,230 4,792

Chronix has a 34%–69% smaller memory footprint.

Storage demands (in GBytes)

Project RawData

InfluxDB

OpenTSDB

KairosD

B

Chronix

4 1.2 0.2 0.2 0.3 0.15 107.0 10.7 16.9 26.5 8.6

total 108.2 10.9 17.1 26.8 8.7

Chronix saves 20%–68% of the storage space.

Data retrieval times (in s)

r q InfluxDB

OpenTSDB

KairosD

B

Chronix

0.5 2 4.3 2.8 4.4 0.91 11 5.2 5.6 6.6 5.37 15 34.1 17.4 26.8 7.0

14 8 36.2 14.2 25.5 4.021 12 76.5 29.8 55.0 6.028 5 7.9 3.9 5.6 0.556 1 35.4 12.4 24.1 1.291 2 47.5 15.5 33.8 1.1

180 2 96.7 36.7 66.6 1.1total 343.8 138.3 248.4 27.1

Chronix saves 80%–92% on data retrieval time.

Times for b- and h-queries (in s)

Basic (b) InfluxDB

OpenTSDB

KairosD

B

Chronix

4 Avg. 0.9 6.1 9.8 4.45 Max. 1.3 8.4 9.1 6.03 Min. 0.7 2.7 5.3 2.83 Dev. 6.7 16.7 21.1 2.35 Sum 0.7 6.0 12.0 2.04 Count 0.8 5.5 10.5 1.08 Perc. 10.2 25.8 34.5 8.6High-level (h)

12 Outlier 30.7 29.1 117.6 18.914 Trend 162.7 50.4 100.6 30.211 Freq. 47.3 23.9 45.7 16.33 GroupSize 218.9 2927.8 206.3 29.63 Split 123.1 2893.9 47.9 37.2

75 total 604.0 5996.3 620.4 159.3

Chronix saves 73%–97% on analysis times.

Typical sce

nario.

Impo

rtan

t!

Neede

dfo

r

exp

lora

tio

n!

Evaluatio

n:

Projects

4 & 5

ConclusionChronix exploits the characteristics of the domain in many ways andthus achieves better storage and query results. Chronix is open source. www.chronix.io

AcknowledgementsThis research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy andTechnology as an IuK-grant for the project DfD – Design for Diagnosability.