target holding - big dikes and big data
DESCRIPTION
Slides used during November 2014 Big Data Groningen meetup (http://meetup.com/big-data-groningen/) about Big Dikes and Big Data.TRANSCRIPT
Big Dikes and Big Data
12 november 2014Big Data Groningen Meetup
Frens Jan RumphMichiel van der Ree
Target Holding
Big Data to IntelligenceBig Data Analytics is our key competence
– using machine learning and pattern recognition techniques to extract value from large data sets
Founded in 2009 and founding partner of Target
– Dutch Public-Private Cooperation on Big Data, partners including IBM, Oracle, Astron/Lofar, RUG, UMCG
Developed various innovative algorithms and technology which we apply across multiple market:
– Energy & Water management
– Media & Entertainment
– Healthy Ageing
– High Tech Systems
Target Holding
Big Data to IntelligenceCollect big data
– Domain specific data
– Web / public data & Social Media
– Sensor data
Enrich the data– Feature extraction & machine learning
– Classification, ranking, forecasting, segmentation, clustering, natural language processing
Present & visualize
3S timeseries representation
IJkdijk
Field lab Livedijk (XL)
Big Dikes and Big DataStichting IJkdijk
● Dijkgraaf: I see something weird at sensor X..
– .. have I seen it before at sensor X? (my talk)
– .. have I seen it before at other sensors? (Frens Jan's talk)
● Query sensor's history by example
● Time series might be..
– .. too big to store
– .. too big to analyze
● Solution: reduced representation
● Seminal techniques:
– Piecewise linear approximation
– Symbolic aggregate approximation
Big Dikes and Big DataReduced Time Series Representation
Basic idea:
– Represent time series as a sequence of straight lines
– Line can be connected (N/2) lines or disconnected (N/3) lines
– High compression rates
– Segment as you like, dynamic lengths
Each line segment has • length • left_height
Each line segment has • length • left_height • right_height
Big Dikes and Big DataPiecewise Linear Approximation
Basic idea:
– Segment using fixed frame width
– Converts numerical time series into an equivalent symbolic representation
– String analysis technique can be used for analyzing time series
baabccbc
Big Dikes and Big DataSymbolic Aggregate Approximation
Big Dikes and Big DataSymbolic Aggregate Approximation
Big Dikes and Big DataSymbolic Aggregate Approximation
Basic idea:
– A time series is decomposed in monotonic segments of variable lengths
– Each segment is fitted to a monotonic shape and therefore represented as a symbol of an alphabet
– Symbolic (SAX) but the symbols also capture shape and direction
3S Representation
Big Dikes and Big DataSegment Symbolic Shape-Based Representation
Storing more than one symbol:
– “String Matching” (Levenshtein, Hamming, etc.) → INFORMATION LOSS!
– Each segment is approximated by:
● μ + σ θ f(xn)
– Physical meaning:
● μ offset,→
● σ amplitude,→
● θ linear drift with regard to...→
● f shape→
● N longitude, # of data points→
3S Representation
(f,μ,σ,θ,N)
Big Dikes and Big DataSegment Symbolic Shape-Based Representation
Fast and accurate matching:
– Euclidean distance between segments
– In constant time, i.e. independent of segment length N
● summation of polynomials
– Allows for different invariances:
3S Representation
[(f,μ,σ,θ,N)i]
→ μ → σ → N → θ
Big Dikes and Big DataTime Series Retrieval
Fast, flexible and accurate matching using the 3S representation:
Big Dikes and Big DataTime Series Retrieval
Big Dikes and Big DataTime Series Retrieval
But what if you want to search in the history of multiple sensors?
Storage and processingwith 3S representation
Storage and processingwith 3S representation
use case : search timeseries by exampleon timeseries for many sensors
● Storage and processing of sensor data hits the limits of 'traditional' database or file system based approaches – given for 'enough' sensors.
● Technical dive in a distributed architecture:– with distributed storage : Apache Cassandra– with distributed processing : Apache Spark
(no guarantees, the ideal architecture highly depends on use case specifics ...)
Distributed Storage
● Distributed storage advantages:– scalability more nodes more storage and i/o capacity→– availability more nodes make progress during failure→– reliability more nodes don't lose data on failure→
● Many solutions available, at Target Holding we extensively useApache Cassandra ( C* ) for high volume data– because i.a. it scales well, is easy to operate and performs OK on disks
(don't need SSD per se), allows easier access of data in comparison to file system
– also storage system of choice within DDSC
Distributed Processing
● Distributed processing advantages:– scalability more nodes more compute capacity→– availability more nodes make progress during failure→– reliability more nodes don't lose data on failure→– with local processing being CPU bound instead of IO bound
● With C* as a starting point, options are to: build our own, use Hadoop M/R, or Apache Spark as we are investigating– because of its integration with C*, high level abstraction, rich tool set,
stream processing capability
– and its getting a lot of traction in the Hadoop ecosystem
– (adaption by Cloudera, Hortonworks, MapR, Apache Mahout, etc.)
Spark with Cassandra
● A typical Spark with Cassandra deployment collocates Spark Workers with Cassandra nodes:
image courtesy of DataStax
● Allows data locality: push down filtering, transformations, etc.
Storage and processingwith 3S representation
● Distribution based on timeseries identifier, e.g. the sensor id– or something which identifies the location of measurement, …
● Store the tuple <f, μ, σ, θ, N> together with a timestamp– the full 3s timeseries for each sensor must fit completely on one node
● Goal: Find series of segments which are closest to the example (simplified for presentation)
● Approach: Produce a global top-k out of local top-k's(applies also without simplification)
Storage and processingwith 3S representation
● Locally find the best matches, then repeat globally– Parallelizes and distributes most of the computation
– Limits IO to the communication of the local best matches
Group by sensor idRead segments
possibly restricted by sensor ids and time range
Select best local matches
Take top k ordered by distance
Create sliding window over time series
Calculate distance for each window
Take top k ordered by distance
Zip with example
Calculate euclidean distance per segment pair Sum
Parallel distributed execution
Storage and processingwith 3S representation
c*
Worker
c*
Worker
c*
Worker
...
Master
Application(aka driver)
Read segmentsSelect best
local matchesTake top k ordered
by distance
Parallel distributed execution
Coordinate cluster
Apache Cassandra
Apache Cassandra
● Key value store (with some enhancements)● Based on Dynamo distribution and Big Table local storage
● Partitioned (distributed) map of– sorted maps of
● primitives, structs● maps, lists, sets● counters (crdt)
warning …
● personal mental model …
● the truth is in the code …
● caveat emptor
CQL
● Cassandra Query Language helps with working with C*
/* 3s timeseries in CQL */CREATE TABLE symbolic ( s text, -- sensor identifier t timestamp, -- start of segment o float, -- offset a float, -- amplitude d float, -- drift f int, -- function / shape l int, -- longitude
/* partition by sensor identifier, order by timestamp */ PRIMARY KEY ((s), t))
Distribution & Consistency
● Partitioning based on hashed key in conjunction with positions of node tokens.
image courtesy of DataStax
● Consistent replication when R + W > N R = # nodes read from, W = # nodes written to,N = replication factor
Apache Spark
Apache Spark
● Spark is a distributed computing platform with fairlyrich primitives operating on distributed data sets.
● Spark can be used with data from different data sources– HDFS, Cassandra, elastic search to name a few
● Spark has libraries for: SQL, graph processing and machine learning
Operator graphs
● It allows execution of DAGs of operators– without using disk for
intermediary results– employs pipelining
if possible– (cyclic / iterative data
flows are not cyclic ...)
Architecture
● Applications which allocate CPU's and memoryon Worker Nodes coordinated by a Master
● Applications schedule Jobs which are DAGs of Tasks● Tasks consume & produce Resilient Distributed Datasets
Expressive 'language'
● Spark is developed in Scala, supports Java and Python.
● I consider Spark as expressive
val wordount = sc
.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
Storage and processingwith 3S representation
● Locally find the best matches, then repeat globally– Parallelizes and distributes most of the computation
– Limits IO to the communication of the local best matches
Group by sensor idRead segments
possibly restricted by sensor ids and time range
Select best local matches
Take top k ordered by distance
Create sliding window over time series
Calculate distance for each window
Take top k ordered by distance
Zip with example
Calculate euclidean distance per segment pair Sum
Parallel distributed execution
Algorithm in Spark
val conf = new SparkConf()
.setAAA(...).setBBB(...) ...setZZZ(...)
val sc = new SparkContext(conf)
val example = sc.broadcast(Array(
new Segment(...), ..., new Segment(...)
))
val k = 10
val segments = sc
.cassandraTable(keyspace, table)
.map(fromRow)
Select bestlocal matches
Read segments
Select bestlocal matches
Distance foreach window
Algorithm in Spark
val matches = segments.mapPartitions(
_.groupBy(seg => seg.s)
.flatMap({
case (s, segs) =>
segs
.sliding(example.value.length)
.map(w => (
s, w,
w.zip(example.value)
.map(distance)
.map(math.abs)
.sum
))
}))
Group by sensor id
Create sliding window over time series
Zip with example
Calculate euclidean distance per segment pair
Sum
Select bestlocal matches
Algorithm in Spark
val top = matches
.takeOrdered(k)(Ordering.by(_._3))
Take top k ordered by distance
Take top k ordered by distance
Questions?