target holding - big dikes and big data

Big Dikes and Big Data

12 november 2014Big Data Groningen Meetup

Frens Jan RumphMichiel van der Ree

Target Holding

Big Data to IntelligenceBig Data Analytics is our key competence

– using machine learning and pattern recognition techniques to extract value from large data sets

Founded in 2009 and founding partner of Target

– Dutch Public-Private Cooperation on Big Data, partners including IBM, Oracle, Astron/Lofar, RUG, UMCG

Developed various innovative algorithms and technology which we apply across multiple market:

– Energy & Water management

– Media & Entertainment

– Healthy Ageing

– High Tech Systems

Target Holding

Big Data to IntelligenceCollect big data

– Domain specific data

– Web / public data & Social Media

– Sensor data

Enrich the data– Feature extraction & machine learning

– Classification, ranking, forecasting, segmentation, clustering, natural language processing

Present & visualize

3S timeseries representation

IJkdijk

Field lab Livedijk (XL)

Big Dikes and Big DataStichting IJkdijk

● Dijkgraaf: I see something weird at sensor X..

– .. have I seen it before at sensor X? (my talk)

– .. have I seen it before at other sensors? (Frens Jan's talk)

● Query sensor's history by example

● Time series might be..

– .. too big to store

– .. too big to analyze

● Solution: reduced representation

● Seminal techniques:

– Piecewise linear approximation

– Symbolic aggregate approximation

Big Dikes and Big DataReduced Time Series Representation

Basic idea:

– Represent time series as a sequence of straight lines

– Line can be connected (N/2) lines or disconnected (N/3) lines

– High compression rates

– Segment as you like, dynamic lengths

Each line segment has • length • left_height

Each line segment has • length • left_height • right_height

Big Dikes and Big DataPiecewise Linear Approximation

Basic idea:

– Segment using fixed frame width

– Converts numerical time series into an equivalent symbolic representation

– String analysis technique can be used for analyzing time series

baabccbc

Big Dikes and Big DataSymbolic Aggregate Approximation

Big Dikes and Big DataSymbolic Aggregate Approximation

Basic idea:

– A time series is decomposed in monotonic segments of variable lengths

– Each segment is fitted to a monotonic shape and therefore represented as a symbol of an alphabet

– Symbolic (SAX) but the symbols also capture shape and direction

3S Representation

Big Dikes and Big DataSegment Symbolic Shape-Based Representation

Storing more than one symbol:

– “String Matching” (Levenshtein, Hamming, etc.) → INFORMATION LOSS!

– Each segment is approximated by:

● μ + σ θ f(xn)

– Physical meaning:

● μ offset,→

● σ amplitude,→

● θ linear drift with regard to...→

● f shape→

● N longitude, # of data points→

3S Representation

(f,μ,σ,θ,N)

Big Dikes and Big DataSegment Symbolic Shape-Based Representation

Fast and accurate matching:

– Euclidean distance between segments

– In constant time, i.e. independent of segment length N

● summation of polynomials

– Allows for different invariances:

3S Representation

[(f,μ,σ,θ,N)i]

→ μ → σ → N → θ

Big Dikes and Big DataTime Series Retrieval

Fast, flexible and accurate matching using the 3S representation:



But what if you want to search in the history of multiple sensors?

Storage and processingwith 3S representation


use case : search timeseries by exampleon timeseries for many sensors

● Storage and processing of sensor data hits the limits of 'traditional' database or file system based approaches – given for 'enough' sensors.

● Technical dive in a distributed architecture:– with distributed storage : Apache Cassandra– with distributed processing : Apache Spark

(no guarantees, the ideal architecture highly depends on use case specifics ...)

Distributed Storage

● Distributed storage advantages:– scalability more nodes more storage and i/o capacity→– availability more nodes make progress during failure→– reliability more nodes don't lose data on failure→

● Many solutions available, at Target Holding we extensively useApache Cassandra ( C* ) for high volume data– because i.a. it scales well, is easy to operate and performs OK on disks

(don't need SSD per se), allows easier access of data in comparison to file system

– also storage system of choice within DDSC

Distributed Processing

● Distributed processing advantages:– scalability more nodes more compute capacity→– availability more nodes make progress during failure→– reliability more nodes don't lose data on failure→– with local processing being CPU bound instead of IO bound

● With C* as a starting point, options are to: build our own, use Hadoop M/R, or Apache Spark as we are investigating– because of its integration with C*, high level abstraction, rich tool set,

stream processing capability

– and its getting a lot of traction in the Hadoop ecosystem

– (adaption by Cloudera, Hortonworks, MapR, Apache Mahout, etc.)

Spark with Cassandra

● A typical Spark with Cassandra deployment collocates Spark Workers with Cassandra nodes:

image courtesy of DataStax

● Allows data locality: push down filtering, transformations, etc.


● Distribution based on timeseries identifier, e.g. the sensor id– or something which identifies the location of measurement, …

● Store the tuple <f, μ, σ, θ, N> together with a timestamp– the full 3s timeseries for each sensor must fit completely on one node

● Goal: Find series of segments which are closest to the example (simplified for presentation)

● Approach: Produce a global top-k out of local top-k's(applies also without simplification)


● Locally find the best matches, then repeat globally– Parallelizes and distributes most of the computation

– Limits IO to the communication of the local best matches

Group by sensor idRead segments

possibly restricted by sensor ids and time range

Select best local matches

Take top k ordered by distance

Create sliding window over time series

Calculate distance for each window


Zip with example

Calculate euclidean distance per segment pair Sum

Parallel distributed execution


c*

Worker

c*

Worker

c*

Worker

...

Master

Application(aka driver)

Read segmentsSelect best

local matchesTake top k ordered

by distance


Coordinate cluster

Apache Cassandra

Apache Cassandra

● Key value store (with some enhancements)● Based on Dynamo distribution and Big Table local storage

● Partitioned (distributed) map of– sorted maps of

● primitives, structs● maps, lists, sets● counters (crdt)

warning …

● personal mental model …

● the truth is in the code …

● caveat emptor

CQL

● Cassandra Query Language helps with working with C*

/* 3s timeseries in CQL */CREATE TABLE symbolic ( s text, -- sensor identifier t timestamp, -- start of segment o float, -- offset a float, -- amplitude d float, -- drift f int, -- function / shape l int, -- longitude

/* partition by sensor identifier, order by timestamp */ PRIMARY KEY ((s), t))

Distribution & Consistency

● Partitioning based on hashed key in conjunction with positions of node tokens.

image courtesy of DataStax

● Consistent replication when R + W > N R = # nodes read from, W = # nodes written to,N = replication factor

Apache Spark

Apache Spark

● Spark is a distributed computing platform with fairlyrich primitives operating on distributed data sets.

● Spark can be used with data from different data sources– HDFS, Cassandra, elastic search to name a few

● Spark has libraries for: SQL, graph processing and machine learning

Operator graphs

● It allows execution of DAGs of operators– without using disk for

intermediary results– employs pipelining

if possible– (cyclic / iterative data

flows are not cyclic ...)

Architecture

● Applications which allocate CPU's and memoryon Worker Nodes coordinated by a Master

● Applications schedule Jobs which are DAGs of Tasks● Tasks consume & produce Resilient Distributed Datasets

Expressive 'language'

● Spark is developed in Scala, supports Java and Python.

● I consider Spark as expressive

val wordount = sc

.textFile("hdfs://...")

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)


● Locally find the best matches, then repeat globally– Parallelizes and distributes most of the computation

– Limits IO to the communication of the local best matches

Group by sensor idRead segments

possibly restricted by sensor ids and time range

Select best local matches



Calculate distance for each window


Zip with example

Calculate euclidean distance per segment pair Sum


Algorithm in Spark

val conf = new SparkConf()

.setAAA(...).setBBB(...) ...setZZZ(...)

val sc = new SparkContext(conf)

val example = sc.broadcast(Array(

new Segment(...), ..., new Segment(...)

))

val k = 10

val segments = sc

.cassandraTable(keyspace, table)

.map(fromRow)

Select bestlocal matches

Read segments


Distance foreach window

Algorithm in Spark

val matches = segments.mapPartitions(

_.groupBy(seg => seg.s)

.flatMap({

case (s, segs) =>

segs

.sliding(example.value.length)

.map(w => (

s, w,

w.zip(example.value)

.map(distance)

.map(math.abs)

.sum

))

}))

Group by sensor id


Zip with example

Calculate euclidean distance per segment pair

Sum


Algorithm in Spark

val top = matches

.takeOrdered(k)(Ordering.by(_._3))



Questions?

target holding - big dikes and big data

Data & Analytics

big data12

ni n big dikes

big data groningen meetupfrens

target holdingbig data

numerical time series

example time series

data points3s representationf

intelligencebig data