target holding - big dikes and big data

37
Big Dikes and Big Data 12 november 2014 Big Data Groningen Meetup Frens Jan Rumph Michiel van der Ree

Upload: frens-jan-rumph

Post on 05-Jul-2015

310 views

Category:

Data & Analytics


0 download

DESCRIPTION

Slides used during November 2014 Big Data Groningen meetup (http://meetup.com/big-data-groningen/) about Big Dikes and Big Data.

TRANSCRIPT

Page 1: Target Holding - Big Dikes and Big Data

Big Dikes and Big Data

12 november 2014Big Data Groningen Meetup

Frens Jan RumphMichiel van der Ree

Page 2: Target Holding - Big Dikes and Big Data

Target Holding

Big Data to IntelligenceBig Data Analytics is our key competence

– using machine learning and pattern recognition techniques to extract value from large data sets

Founded in 2009 and founding partner of Target

– Dutch Public-Private Cooperation on Big Data, partners including IBM, Oracle, Astron/Lofar, RUG, UMCG

Developed various innovative algorithms and technology which we apply across multiple market:

– Energy & Water management

– Media & Entertainment

– Healthy Ageing

– High Tech Systems

Page 3: Target Holding - Big Dikes and Big Data

Target Holding

Big Data to IntelligenceCollect big data

– Domain specific data

– Web / public data & Social Media

– Sensor data

Enrich the data– Feature extraction & machine learning

– Classification, ranking, forecasting, segmentation, clustering, natural language processing

Present & visualize

Page 4: Target Holding - Big Dikes and Big Data

3S timeseries representation

Page 5: Target Holding - Big Dikes and Big Data

IJkdijk

Field lab Livedijk (XL)

Big Dikes and Big DataStichting IJkdijk

Page 6: Target Holding - Big Dikes and Big Data

● Dijkgraaf: I see something weird at sensor X..

– .. have I seen it before at sensor X? (my talk)

– .. have I seen it before at other sensors? (Frens Jan's talk)

● Query sensor's history by example

● Time series might be..

– .. too big to store

– .. too big to analyze

● Solution: reduced representation

● Seminal techniques:

– Piecewise linear approximation

– Symbolic aggregate approximation

Big Dikes and Big DataReduced Time Series Representation

Page 7: Target Holding - Big Dikes and Big Data

Basic idea:

– Represent time series as a sequence of straight lines

– Line can be connected (N/2) lines or disconnected (N/3) lines

– High compression rates

– Segment as you like, dynamic lengths

Each line segment has • length • left_height

Each line segment has • length • left_height • right_height

Big Dikes and Big DataPiecewise Linear Approximation

Page 8: Target Holding - Big Dikes and Big Data

Basic idea:

– Segment using fixed frame width

– Converts numerical time series into an equivalent symbolic representation

– String analysis technique can be used for analyzing time series

baabccbc

Big Dikes and Big DataSymbolic Aggregate Approximation

Page 9: Target Holding - Big Dikes and Big Data

Big Dikes and Big DataSymbolic Aggregate Approximation

Page 10: Target Holding - Big Dikes and Big Data

Big Dikes and Big DataSymbolic Aggregate Approximation

Page 11: Target Holding - Big Dikes and Big Data

Basic idea:

– A time series is decomposed in monotonic segments of variable lengths

– Each segment is fitted to a monotonic shape and therefore represented as a symbol of an alphabet

– Symbolic (SAX) but the symbols also capture shape and direction

3S Representation

Big Dikes and Big DataSegment Symbolic Shape-Based Representation

Page 12: Target Holding - Big Dikes and Big Data

Storing more than one symbol:

– “String Matching” (Levenshtein, Hamming, etc.) → INFORMATION LOSS!

– Each segment is approximated by:

● μ + σ θ f(xn)

– Physical meaning:

● μ offset,→

● σ amplitude,→

● θ linear drift with regard to...→

● f shape→

● N longitude, # of data points→

3S Representation

(f,μ,σ,θ,N)

Big Dikes and Big DataSegment Symbolic Shape-Based Representation

Page 13: Target Holding - Big Dikes and Big Data

Fast and accurate matching:

– Euclidean distance between segments

– In constant time, i.e. independent of segment length N

● summation of polynomials

– Allows for different invariances:

3S Representation

[(f,μ,σ,θ,N)i]

→ μ → σ → N → θ

Big Dikes and Big DataTime Series Retrieval

Page 14: Target Holding - Big Dikes and Big Data

Fast, flexible and accurate matching using the 3S representation:

Big Dikes and Big DataTime Series Retrieval

Page 15: Target Holding - Big Dikes and Big Data

Big Dikes and Big DataTime Series Retrieval

But what if you want to search in the history of multiple sensors?

Page 16: Target Holding - Big Dikes and Big Data

Storage and processingwith 3S representation

Page 17: Target Holding - Big Dikes and Big Data

Storage and processingwith 3S representation

use case : search timeseries by exampleon timeseries for many sensors

● Storage and processing of sensor data hits the limits of 'traditional' database or file system based approaches – given for 'enough' sensors.

● Technical dive in a distributed architecture:– with distributed storage : Apache Cassandra– with distributed processing : Apache Spark

(no guarantees, the ideal architecture highly depends on use case specifics ...)

Page 18: Target Holding - Big Dikes and Big Data

Distributed Storage

● Distributed storage advantages:– scalability more nodes more storage and i/o capacity→– availability more nodes make progress during failure→– reliability more nodes don't lose data on failure→

● Many solutions available, at Target Holding we extensively useApache Cassandra ( C* ) for high volume data– because i.a. it scales well, is easy to operate and performs OK on disks

(don't need SSD per se), allows easier access of data in comparison to file system

– also storage system of choice within DDSC

Page 19: Target Holding - Big Dikes and Big Data

Distributed Processing

● Distributed processing advantages:– scalability more nodes more compute capacity→– availability more nodes make progress during failure→– reliability more nodes don't lose data on failure→– with local processing being CPU bound instead of IO bound

● With C* as a starting point, options are to: build our own, use Hadoop M/R, or Apache Spark as we are investigating– because of its integration with C*, high level abstraction, rich tool set,

stream processing capability

– and its getting a lot of traction in the Hadoop ecosystem

– (adaption by Cloudera, Hortonworks, MapR, Apache Mahout, etc.)

Page 20: Target Holding - Big Dikes and Big Data

Spark with Cassandra

● A typical Spark with Cassandra deployment collocates Spark Workers with Cassandra nodes:

image courtesy of DataStax

● Allows data locality: push down filtering, transformations, etc.

Page 21: Target Holding - Big Dikes and Big Data

Storage and processingwith 3S representation

● Distribution based on timeseries identifier, e.g. the sensor id– or something which identifies the location of measurement, …

● Store the tuple <f, μ, σ, θ, N> together with a timestamp– the full 3s timeseries for each sensor must fit completely on one node

● Goal: Find series of segments which are closest to the example (simplified for presentation)

● Approach: Produce a global top-k out of local top-k's(applies also without simplification)

Page 22: Target Holding - Big Dikes and Big Data

Storage and processingwith 3S representation

● Locally find the best matches, then repeat globally– Parallelizes and distributes most of the computation

– Limits IO to the communication of the local best matches

Group by sensor idRead segments

possibly restricted by sensor ids and time range

Select best local matches

Take top k ordered by distance

Create sliding window over time series

Calculate distance for each window

Take top k ordered by distance

Zip with example

Calculate euclidean distance per segment pair Sum

Parallel distributed execution

Page 23: Target Holding - Big Dikes and Big Data

Storage and processingwith 3S representation

c*

Worker

c*

Worker

c*

Worker

...

Master

Application(aka driver)

Read segmentsSelect best

local matchesTake top k ordered

by distance

Parallel distributed execution

Coordinate cluster

Page 24: Target Holding - Big Dikes and Big Data

Apache Cassandra

Page 25: Target Holding - Big Dikes and Big Data

Apache Cassandra

● Key value store (with some enhancements)● Based on Dynamo distribution and Big Table local storage

● Partitioned (distributed) map of– sorted maps of

● primitives, structs● maps, lists, sets● counters (crdt)

warning …

● personal mental model …

● the truth is in the code …

● caveat emptor

Page 26: Target Holding - Big Dikes and Big Data

CQL

● Cassandra Query Language helps with working with C*

/* 3s timeseries in CQL */CREATE TABLE symbolic ( s text, -- sensor identifier t timestamp, -- start of segment o float, -- offset a float, -- amplitude d float, -- drift f int, -- function / shape l int, -- longitude

/* partition by sensor identifier, order by timestamp */ PRIMARY KEY ((s), t))

Page 27: Target Holding - Big Dikes and Big Data

Distribution & Consistency

● Partitioning based on hashed key in conjunction with positions of node tokens.

image courtesy of DataStax

● Consistent replication when R + W > N R = # nodes read from, W = # nodes written to,N = replication factor

Page 28: Target Holding - Big Dikes and Big Data

Apache Spark

Page 29: Target Holding - Big Dikes and Big Data

Apache Spark

● Spark is a distributed computing platform with fairlyrich primitives operating on distributed data sets.

● Spark can be used with data from different data sources– HDFS, Cassandra, elastic search to name a few

● Spark has libraries for: SQL, graph processing and machine learning

Page 30: Target Holding - Big Dikes and Big Data

Operator graphs

● It allows execution of DAGs of operators– without using disk for

intermediary results– employs pipelining

if possible– (cyclic / iterative data

flows are not cyclic ...)

Page 31: Target Holding - Big Dikes and Big Data

Architecture

● Applications which allocate CPU's and memoryon Worker Nodes coordinated by a Master

● Applications schedule Jobs which are DAGs of Tasks● Tasks consume & produce Resilient Distributed Datasets

Page 32: Target Holding - Big Dikes and Big Data

Expressive 'language'

● Spark is developed in Scala, supports Java and Python.

● I consider Spark as expressive

val wordount = sc

.textFile("hdfs://...")

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

Page 33: Target Holding - Big Dikes and Big Data

Storage and processingwith 3S representation

● Locally find the best matches, then repeat globally– Parallelizes and distributes most of the computation

– Limits IO to the communication of the local best matches

Group by sensor idRead segments

possibly restricted by sensor ids and time range

Select best local matches

Take top k ordered by distance

Create sliding window over time series

Calculate distance for each window

Take top k ordered by distance

Zip with example

Calculate euclidean distance per segment pair Sum

Parallel distributed execution

Page 34: Target Holding - Big Dikes and Big Data

Algorithm in Spark

val conf = new SparkConf()

.setAAA(...).setBBB(...) ...setZZZ(...)

val sc = new SparkContext(conf)

val example = sc.broadcast(Array(

new Segment(...), ..., new Segment(...)

))

val k = 10

val segments = sc

.cassandraTable(keyspace, table)

.map(fromRow)

Select bestlocal matches

Read segments

Page 35: Target Holding - Big Dikes and Big Data

Select bestlocal matches

Distance foreach window

Algorithm in Spark

val matches = segments.mapPartitions(

_.groupBy(seg => seg.s)

.flatMap({

case (s, segs) =>

segs

.sliding(example.value.length)

.map(w => (

s, w,

w.zip(example.value)

.map(distance)

.map(math.abs)

.sum

))

}))

Group by sensor id

Create sliding window over time series

Zip with example

Calculate euclidean distance per segment pair

Sum

Page 36: Target Holding - Big Dikes and Big Data

Select bestlocal matches

Algorithm in Spark

val top = matches

.takeOrdered(k)(Ordering.by(_._3))

Take top k ordered by distance

Take top k ordered by distance

Page 37: Target Holding - Big Dikes and Big Data

Questions?