data stream processing and analytics

110
Data Stream Processing and Analytics Summer schools – CEA – EDF – IN 06-20-2014 Alexis Bondu, EDF R&D [email protected] Thank to Vincent Lemaire, OrangeLab

Upload: thom

Post on 10-Jan-2016

38 views

Category:

Documents


6 download

DESCRIPTION

Data Stream Processing and Analytics. Alexis Bondu , EDF R&D [email protected]. Thank to Vincent Lemaire , OrangeLab. 06-20-2014. Summer schools – CEA – EDF – INRIA. Outline. Introduction on data-streams Part 1 : Querying Part 2 : Unsupervised Learning Part 3 : Supervised Learning - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Stream Processing and Analytics

Data Stream Processingand Analytics

Summer schools – CEA – EDF – INRIA06-20-2014

Alexis Bondu, EDF R&[email protected]

Thank to Vincent Lemaire, OrangeLab

Page 2: Data Stream Processing and Analytics

Outline

• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !

2

Page 3: Data Stream Processing and Analytics

Outline

• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !

3

Page 4: Data Stream Processing and Analytics

VLDB vs. CEP

• Very Large Database :– Static data– Storage : distributed on several computers – Query & Analysis : distributed and parallel processing

• Complex Event Processing : – Data in motion – Storage : none (only buffer in memory)– Query & Analysis : processing on the fly (and parallel)

~ More than 10 To

~ More than 1000 operations / secTwo complementary approaches

4

Page 5: Data Stream Processing and Analytics

Application Areas• Finance: High frequency trading

– Find correlations between the prices of stocks within the historical data; – Evaluate the stationarity of these correlations over the time;– Give more weight to recent data.

• Banking : Detection of frauds with credit cards

– Automatically monitor a large amount of transactions;– Detects patterns of events that indicate a likelihood of fraud;– Stop the processing and send an alert for a human adjudication.

• Medicine: Health monitoring

– Perform automatic medical analysis to reduce workload on nurses;– Analyze measurements of devices to detect early signs of disease.;– Help doctors to make a diagnosis in real time.

• Smart Cities & Smart gird :

– Optimization of public transports;– Management of the local production of electricity; – Flattening of the peaks of consumption.

5

Page 6: Data Stream Processing and Analytics

An example of data streamInput data stream

A tuple :

(1,1);(1,2);(2,2);(1,3)

All tuples can be coded by 4 couples of integers

Online processing :Rotate and combine tuples in a compact way

6

Page 7: Data Stream Processing and Analytics

Specific constrains of stream-processing

• A data stream continuously emits tuples• The order of tuples is not controlled• The emission rate of tuples is not controlled• Stream processing is an on-line process

In the end, the quality of the pocessing is the ajusting variable

What is a data stream ?

What is a tuple ?

• A small piece of information in motion• Composed by several variables• All tuples share the same structure (i.e. the variables)

7

Page 8: Data Stream Processing and Analytics

How to manage the time?

• A timestamp is associated with each tuple :

– Explicit timestamp : defined as a variable within the structure of the data stream– Implicit timestamp : assigned by the system when tuples are processed

• Two ways of representing the time :

– Logical time : only the order of processed tuples is considered– Physical time : characterizes the time when the tuple was emitted

• Buffer issues :

– The tuples are not necessarily received in the order– How long a missing tuple can be waited ?

8

Page 9: Data Stream Processing and Analytics

Complex Events Processing (CEP)

OperatorOperator

• An operator implements a query or a more complex analysis• An operator processes data in motion with a low latency• Several operators run at the same time, parallelized on several CPUs and/or

Computers• The graph of operators is defined before the processing of data-streams• Connectors allows to interact with: external data streams, static data in SGBD,

visualization tools.

OperatorOperator

OperatorOperator

OperatorOperator OperatorOperator

E-mail

Twitter

RSS

Stocks

XML

Visualization

Database

Input data stream Output data stream

9

Page 10: Data Stream Processing and Analytics

Complex Events Processing (CEP)

Main features:• High frequency processing• Parallel computing • Fault-tolerant• Robust to imperfect and asynchronous data• Extensible (implementation of new operators)

Notable products:• StreamBase (Tibco)• InfoSphere Streams (IBM) • STORM (Open source – Twitter)• KINESIS (API - Amazon)• SQLstream• Apama

10

Page 11: Data Stream Processing and Analytics

Outline

• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !

11

Page 12: Data Stream Processing and Analytics

Time-window• A query is performed on a finite part of the past tuples

FuturCurrent timePast

• Fixed window : “June 2000”• Sliding window : “last week”• Landmark window : “since 1 January 2000”

Define a time-window:

t1

t2

t3

Sliding windowt1

t2

t3

Jumping window

t1

t2

t3

Tumbling window

Update interval:• A result is produced at each update• The type of window depends on the update interval

12

Page 13: Data Stream Processing and Analytics

• Most of the CEP provide a SQL-like language• Few CEP provide a user-friendly interface • Each software publisher propose its own language (not standardized)• Main features :

– Define the structure of the connection of the data streams– Define time-windows on data streams– Extend the SQL language (able to run SQL queries on relational data bases) – Run queries on data streams within time-windows

• Additional functions :– Statistics (min, max, mean, standard deviation … etc)– Math (trigonometry, logarithm, exponential … etc)– String (regular expression, trim, substring … etc)– Date (getDayType, getSecond, now … etc)

SQL-like language

13

Page 14: Data Stream Processing and Analytics

SQL-like language

CREATE INPUT STREAM InputStream(Compteur string(12),Type string(12),Souscription int,C_index int,Date timestamp

);

CREATE WINDOW OneMinuteWindow(SIZE 60 ADVANCE 60 ON Date);

Geek zoneA simple example with StreamBase :

14

Page 15: Data Stream Processing and Analytics

SQL-like language

CREATE OUTPUT STREAM OuputStream;

SELECT firstval(Compteur) AS Compteur,lastval(C_index) – firstval(C_index) AS Consoopenval(Date) AS StartTimecloseval(Date) AS EndTime

FROM InputStream[OneMinuteWindow]

INTO OutputStream;

A simple example with StreamBase : Geek zone

15

Page 16: Data Stream Processing and Analytics

SQL-like language

User-friendly interfaceA simple example with StreamBase :

16

Page 17: Data Stream Processing and Analytics

Outline

• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !

17

Page 18: Data Stream Processing and Analytics

Online vs. Batch mode

• An entire dataset is available • The examples can be processed several times • Weak constrain on the computing time• The distribution of data does not change over time

What is unsupervised learning ?• Mining data in order to find new knowledge • No idea about the expected result

• Tuples are emitted one by one• Tuples are processed on the fly due to their high rate

(one pass algorithms)• Real-time computing (low latency)• The distribution of tuples changes over time (drift)

Batch mode :

Online processing :

18

Page 19: Data Stream Processing and Analytics

Summarizing data streams

Why we need to summarize data streams ?

• The number of tuples in infinite … • Their emission rate is potentially very high … • The hardware resources are limited (CPU, RAM & I/O)

What is a summary ?

• A compact representation of the past tuples• With a controlled memory space, accuracy and latency • Which allows to query (or analyze) the history of the

stream, in an approximated way

The objective is to maximize the accuracy of the queries, given technical constrains (stream rate, CPU, RAM & I/O)

19

Page 20: Data Stream Processing and Analytics

Two types of summary

Specific summaries : dedicated to a single query (or few)

• Flajolet-Martin Sketch : approximates the number of unique objects in a stream;

• Bloom Filter : efficiently tests if an element is a member of a set;

• Count-Sketch : efficiently finds the k most frequent elements of a set;

• Count-Min Sketch : enumerates the number of elements with a particular value, or within an interval of values.

Generic summaries : allow a large range of queries on any past period

• StreamSamp : based on successive windowing and sampling;

• CluStream : based on micro-clustering;

• DenStream : based on evolving micro-clustering;

Detailed in this talk 20

Page 21: Data Stream Processing and Analytics

Flajolet-Martin Sketch [1]

• S is a collection of N elements : S = {s1, s2 …. sN}

• Two elements of S may be identical

• S includes only F distinct elements

• The objective is to efficiently estimate F in terms of:– Time complexity (One pass algorithm)

– Space complexity

– Probabilistic guarantee

Problem statement:

How many 21

Page 22: Data Stream Processing and Analytics

Hash function : h(.)

• Associates an element si with a random binary value

• h(.) is a deterministic function

• w is the length of binary values (number of bits)

• w is an integer such that

• Random values are uniformly drawn within

Intuition :

Given a large set of random binary values,

of them begin with “1”

of them begin with “11”

of them begin with “111”

of them begin with k “1”

11

1111 1

11

111

10010110

00100011

10001010

01011011

00011001

01010011

Flajolet-Martin Sketch [1]

22

Page 23: Data Stream Processing and Analytics

Example : h(si) = 01001111011010 t(h(si)) = 01000000000000

t(.) is the function which keeps only the first “1” (counting from left), other bits are set to “0”

Location of the first “1” within h(.) 10000000

00100000

10000000

01000000

00010000

01000000

B is the fusion of all the binary words t(h(si)) by using the OR operator denoted by

Fusion of binary words :

B = 11110000

R is the rank of the first “0” (counting from left) within B.That is a random variable related with F !

R = 5

Flajolet-Martin Sketch [1]

23

Page 24: Data Stream Processing and Analytics

FM SketchFM SketchA single-pass algorithm :

h(a) = 01001111011010 t(h(a)) = 01000000000000

B = 00000000000000

Input data stream

R = 0

Flajolet-Martin Sketch [1]

24

Page 25: Data Stream Processing and Analytics

FM SketchFM SketchA single-pass algorithm :

h(a) = 01001111011010 t(h(a)) = 01000000000000

B = 01000000000000

Input data stream

h(b) = 10001010011011 t(h(b)) = 10000000000000

R = 2

Flajolet-Martin Sketch [1]

25

Page 26: Data Stream Processing and Analytics

FM SketchFM SketchA single-pass algorithm :

h(a) = 01001111011010 t(h(a)) = 01000000000000

B = 11000000000000

Input data stream

h(b) = 10001010011011 t(h(b)) = 10000000000000

R = 2

h(a) = 01001111011010 t(h(a)) = 01000000000000

Flajolet-Martin Sketch [1]

26

Page 27: Data Stream Processing and Analytics

FM SketchFM SketchA single-pass algorithm :

h(a) = 01001111011010 t(h(a)) = 01000000000000

B = 11000000000000

Input data stream

h(b) = 10001010011011 t(h(b)) = 10000000000000

R = 2

h(a) = 01001111011010 t(h(a)) = 01000000000000

h(c) = 00010110010110 t(h(c)) = 00010000000000

Flajolet-Martin Sketch [1]

27

Page 28: Data Stream Processing and Analytics

FM SketchFM SketchA single-pass algorithm :

h(a) = 01001111011010 t(h(a)) = 01000000000000

B = 11010000000000

Input data stream

h(b) = 10001010011011 t(h(b)) = 10000000000000

R = 4

h(a) = 01001111011010 t(h(a)) = 01000000000000

h(c) = 00010110010110 t(h(c)) = 00010000000000

h(b) = 10001010011011 t(h(b)) = 10000000000000

Flajolet-Martin Sketch [1]

28

Page 29: Data Stream Processing and Analytics

FM SketchFM SketchA single-pass algorithm :

h(a) = 01001111011010 t(h(a)) = 01000000000000

B = 11010000000000

Input data stream

h(b) = 10001010011011 t(h(b)) = 10000000000000

R = 4

h(a) = 01001111011010 t(h(a)) = 01000000000000

h(c) = 00010110010110 t(h(c)) = 00010000000000

h(b) = 10001010011011 t(h(b)) = 10000000000000

• This single-pass algorithm is adapted to data streams

• Few pieces of information need to be stored in the RAM

• R is a random variable such that :

Flajolet-Martin Sketch [1]

29

Page 30: Data Stream Processing and Analytics

How to estimate E(R) ?

Input data stream

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM Sketch

Bm = 11010000000000

Rm = 4

Collection of m Sketch

Deterministic rooting of h(si)

Deterministic rooting of h(si)

• The first bits of h(si) are used to affect each element to a sketchif m = 16, the 4 first bits of h(si) represent the ID of the corresponding Sketch

h(s1) = 0011001000101 -> 1011 = 3 -> s1 is affected to the 3th sketch

• Each Sketch counts approximately F/m distinct elements

Flajolet-Martin Sketch [1]

30

Page 31: Data Stream Processing and Analytics

How to estimate E(R) ?

Input data stream

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM SketchFM SketchFM Sketch

FM SketchFM Sketch

Bm = 11010000000000

Rm = 4

Collection of m Sketch

Deterministic rooting of h(si)

Deterministic rooting of h(si)

Stochastic average

Flajolet-Martin Sketch [1]

31

Page 32: Data Stream Processing and Analytics

Specific summaries : dedicated to a single query (or few)

• Flajolet-Martin Sketch : approximates the number of unique objects in a stream;

• Bloom Filter : efficiently tests if an element is a member of a set;

• Count-Sketch : efficiently finds the k most frequent elements of a set;

• Count-Min Sketch : enumerates the number of elements with a particular value, or within an interval of values.

Generic summaries : allow a large range of queries on any past period

• StreamSamp : based on successive windowing and sampling;

• CluStream : based on micro-clustering;

• DenStream : based on evolving micro-clustering;

Outline

32

Page 33: Data Stream Processing and Analytics

Sampling based summaries

Objectives of a generic summary :

• Summarizes the entire history of the data stream

• Requires a bounded memory space

• Allows a large range of queries, including supervised and supervised analysis

Summarize by sampling the tuples :

• The sampling technics are adapted to incremental processing

• A limited number of tuples are stored

• The stored tuples constitute a representative sample

• The recent past can be favored in terms of accuracy (i.e. sampling rate)

33

Page 34: Data Stream Processing and Analytics

Sampling based summariesReservoir sampling [2]

• The reservoir is a uniform sampling;

• The sampling rate decreases over time;

• The probability that tuples are included in the reservoir is : k / Nb_Emitted_Tuples

Time

Tuple 1Tuple 2

Tuple k

……

……

..

t

Reservoir

Selection of the tuple to delete

(uniform sampling)

Insert

t +1

i-th tuple

Uniform sampling(sampling rate = k/i)

34

Page 35: Data Stream Processing and Analytics

Sampling based summaries

Tuple 1Tuple 2Tuple 3Tuple 4

Sample 1

Order 0 Order 1 Order 2

Input data stream

Uniform sampling

StreamSamp [3]

35

Page 36: Data Stream Processing and Analytics

Sampling based summaries

Tuple 5Tuple 6Tuple 7Tuple 8

Sample 2

Order 0 Order 1 Order 2

Input data stream

Tuple 1

Tuple 3Tuple 4

Sample 1

Tuple 2

Uniform sampling

StreamSamp [3]

36

Page 37: Data Stream Processing and Analytics

Sampling based summaries

Tuple 9Tuple 10Tuple 11Tuple 12

Sample 3

Order 0 Order 1 Order 2

Input data stream

Tuple 5

Tuple 7Tuple 8

Sample 2

Tuple 6

Tuple 1Tuple 2Tuple 3Tuple 4

Sample 1 Fusion

Uniform sampling

Tuple 2Tuple 3Tuple 5Tuple 8

Sample 1-2

StreamSamp [3]

37

Page 38: Data Stream Processing and Analytics

Sampling based summaries

Tuple 13Tuple 14Tuple 15Tuple 16

Sample 4

Order 0 Order 1 Order 2

Input data stream

Tuple 9

Tuple 11Tuple 12

Sample 3

Tuple 10

Uniform sampling

Tuple 2Tuple 3Tuple 5Tuple 8

Sample 1-2

StreamSamp [3]

38

Page 39: Data Stream Processing and Analytics

Sampling based summaries

Tuple 17Tuple 18Tuple 19Tuple 20

Sample 5

Order 0 Order 1 Order 2

Input data stream

Uniform sampling

Tuple 2Tuple 3Tuple 5Tuple 8

Sample 1-2

Tuple 13

Tuple 15Tuple 16

Sample 4

Tuple 14

Tuple 9Tuple 10Tuple 11Tuple 12

Sample 3

StreamSamp [3]

39

Page 40: Data Stream Processing and Analytics

Sampling based summaries

Tuple 17Tuple 18Tuple 19Tuple 20

Sample 5

Order 0 Order 1 Order 2

Input data stream

Uniform sampling

Tuple 2Tuple 3Tuple 5Tuple 8

Sample 1-2

Tuple 13

Tuple 15Tuple 16

Sample 4

Tuple 14

Tuple 9Tuple 10Tuple 11Tuple 12

Sample 3 Fusion

Tuple 9Tuple 11Tuple 14Tuple 16

Sample 3-4

StreamSamp [3]

40

Page 41: Data Stream Processing and Analytics

Sampling based summaries

Order 0 Order 1 Order 2

Input data stream

Uniform sampling

Fusion Fusion

StreamSamp [3]

41

Page 42: Data Stream Processing and Analytics

Sampling based summaries

• A sample gathers k uniformly drawn tuples

• A collection of samples gathers h samples

• Each collection has an order o

• The sampling rate of samples is equal to

Time

(Present)(Past)

Order 0Order 1Order 2

StreamSamp [3]

42

Page 43: Data Stream Processing and Analytics

Sampling based summaries

TimeOrder 0Order 1Order 2

How to exploit this summary offline ?

Fusion of all samples

Weighting of tuples to keep their representativeness

Use of any Datamining approach able to process a weighted training set

StreamSamp [3]

43

Page 44: Data Stream Processing and Analytics

Specific summaries : dedicated to a single query (or few)

• Flajolet-Martin Sketch : approximates the number of unique objects in a stream;

• Bloom Filter : efficiently tests if an element is a member of a set;

• Count-Sketch : efficiently finds the k most frequent elements of a set;

• Count-Min Sketch : enumerates the number of elements with a particular value, or within an interval of values.

Generic summaries : allow a large range of queries on any past period

• StreamSamp : based on successive windowing and sampling;

• CluStream : based on micro-clustering;

• DenStream : based on evolving micro-clustering;

Outline

44

Page 45: Data Stream Processing and Analytics

Micro-clustering based summaries

What is a micro-clustering ?

• Micro-clusters (MC) are small groups of tuples,

• MC are represented by features which locally describe the density of tuples,

• Micro-clustering approaches handle evolving data,

• MC are maintained in RAM memory within a bounded memory space

• MC summarize the density of the input data stream, while giving more importance

to the recent past.

45

Page 46: Data Stream Processing and Analytics

DenStream [4]

• Density based micro-clustering • Weighting of the tuples over time

Micro-clustering based summaries

46

Page 47: Data Stream Processing and Analytics

Var 1

Var 2

Initialization of the micro-clusters with DBscan

MC(cj,rj,wj)MC(cj,rj,wj)

Parameters :- Minimum weight of Mc- Maximum variance of Mc- Fading factor

Parameters :- Minimum weight of Mc- Maximum variance of Mc- Fading factor

Micro-clustering based summariesDenStream [4]

47

Page 48: Data Stream Processing and Analytics

Var 1

Var 2

Received tuple

Micro-clustering based summariesDenStream [4]

48

Page 49: Data Stream Processing and Analytics

Var 1

Var 2

Fading of the micro-clusters

Micro-clustering based summariesDenStream [4]

49

Page 50: Data Stream Processing and Analytics

Var 1

Var 2

Closer Micro-cluster

Micro-clustering based summariesDenStream [4]

50

Page 51: Data Stream Processing and Analytics

Var 1

Var 2

Test on the new variance

Micro-clustering based summariesDenStream [4]

51

Page 52: Data Stream Processing and Analytics

Var 1

Var 2

Here, the new variance is greater than the maximum variance

Micro-clustering based summariesDenStream [4]

52

Page 53: Data Stream Processing and Analytics

Var 1

Var 2

An “outlier” micro-cluster is created

Micro-clustering based summariesDenStream [4]

53

Page 54: Data Stream Processing and Analytics

Var 1

Var 2

Received tuple

Micro-clustering based summariesDenStream [4]

54

Page 55: Data Stream Processing and Analytics

Var 1

Var 2

Fading of the micro-clusters

Micro-clustering based summariesDenStream [4]

55

Page 56: Data Stream Processing and Analytics

Var 1

Var 2

Closer Micro-cluster

Micro-clustering based summariesDenStream [4]

56

Page 57: Data Stream Processing and Analytics

Var 1

Var 2

Test on the new variance

Micro-clustering based summariesDenStream [4]

57

Page 58: Data Stream Processing and Analytics

Var 1

Var 2

Here, the new variance is less than the maximum variance

Micro-clustering based summariesDenStream [4]

58

Page 59: Data Stream Processing and Analytics

Var 1

Var 2

1) The tuple is assigned to the micro-cluster

2) The weight, the variance and the mean are updated

Micro-clustering based summariesDenStream [4]

59

Page 60: Data Stream Processing and Analytics

Var 1

Var 2

Received tuple

Micro-clustering based summariesDenStream [4]

60

Page 61: Data Stream Processing and Analytics

Var 1

Var 2

Fading of the micro-clusters

Micro-clustering based summariesDenStream [4]

61

Page 62: Data Stream Processing and Analytics

Var 1

Var 2

Closer Micro-cluster

Micro-clustering based summariesDenStream [4]

62

Page 63: Data Stream Processing and Analytics

Var 1

Var 2

Test on the new variance, which is to important

Micro-clustering based summariesDenStream [4]

63

Page 64: Data Stream Processing and Analytics

Var 1

Var 2

Closer “outlier” Micro-cluster

Micro-clustering based summariesDenStream [4]

64

Page 65: Data Stream Processing and Analytics

Var 1

Var 2

Test on the new variance

Micro-clustering based summariesDenStream [4]

65

Page 66: Data Stream Processing and Analytics

Var 1

Var 2

In this case, the tuple is assigned to the “outlier” micro-cluster

Micro-clustering based summariesDenStream [4]

66

Page 67: Data Stream Processing and Analytics

Var 1

Var 2

The weight of the “outlier” micro-cluster is greater than the minimum weight

Micro-clustering based summariesDenStream [4]

67

Page 68: Data Stream Processing and Analytics

Var 1

Var 2

The “outlier” micro-cluster becomes a regular micro-cluster

Micro-clustering based summariesDenStream [4]

68

Page 69: Data Stream Processing and Analytics

Micro-clustering based summaries

How to exploit this summary to estimate the density of the data stream ?

… the example of the Parzen widows estimator [5] …

DenStream [4]

69

Page 70: Data Stream Processing and Analytics

Micro-clustering based summaries

Adapted Parzen window [5] :

• W : total weight of the data stream

• C : number of micro-clusters

• wj : weight of the j-th micro-cluster

• rj : standard deviation of the j-th micro-cluster

• : smoothing parameter

Law of total variance

Hypothesis : each tuple represents a set of none-

observed tuples, with a fixed effective and a standard

deviation equal to

DenStream [4]

70

Page 71: Data Stream Processing and Analytics

Conclusion

Main ideas to retain :

•Summaries allow to process data streams with very high emission rate,

•By using limited hardware resources (CPU, RAM).

•In most cases, a trade off must be reached between the accuracy and the available memory.

•There are two types of summary (specific and generic)

•Limitation : most of generic summaries involves user parameters.

71

Page 72: Data Stream Processing and Analytics

References

[1] Flajolet, P. et G. Martin (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences 31(2), 182–209

[2] J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1) :37– 57, 1985.

[3] B. Csernel, F. Clerot, and G. Hebrail. Streamsamp : Datastream clustering over tilted windows through sampling. In ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams, 2006.

[4] Feng Cao, F., M. Ester, W. Qian, and A. Zhou (2006). Density-based clustering over an evolving data stream with noise. In SIAM Conference on Data Mining, pp. 328–339.

[5] A. Bondu, B. Grossin, M.L. Picard. Density estimation on data streams : an application to Change Detection.In EGC (Extraction et Gestion de la connaissance) 2010.

Related documents :

Thesis of Gasbi, N. Extension et interrogation des resumes de flux de donnees, 2011

72

Page 73: Data Stream Processing and Analytics

Outline

• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !

73

Page 74: Data Stream Processing and Analytics

From Batch mode to Online Learning

• Categorical target variable -> Classifier

• Numeric target variable -> Regression

• Time series -> Forecasting

What is supervised learning ?

• Output : prediction of a target variable for new observations

• Data : a supervised model is learned from labeled examples

• Objective : learn regularities from the training set and generalize it (with parsimony)

Several types of supervised models :

In this talk …

74

Page 75: Data Stream Processing and Analytics

From Batch mode to Online Learning

Var 1 Var 2 … Class

O 12 … A

Y 98 … B

Y 4 … A

Training set

ClassifierClassifier

Var 1

Var 2

Var k…

ClassA / B

A learning algorithm exploits the training set to automatically adjust the classifier

75

Page 76: Data Stream Processing and Analytics

From Batch mode to Online Learning

• An entire dataset is available • The examples can be processed several times • Weak constrain on the computing time• The distribution of data does not change

Batch mode learning :

• Can be interrupted before its end• Returns a valid classifier at any time• Is expected to find better and better classifier • Relevant for time-critical application

Any time learning algorithm :

76

Page 77: Data Stream Processing and Analytics

From Batch mode to Online Learning

• Only a single pass on the training examples is required. • The classifier is updated at each example. • Avoid the exhaustive storage of the examples in the RAM. • Relevant for time-critical applications and for progressively

recorded data.

Incremental learning algorithm :

• The training set is substituted by an input data stream• The classifier is continually updated over time,• By exploiting the current tuple,• With a very low latency.• The distribution of data can change over time (concept drift)

Online learning algorithm :

77

Page 78: Data Stream Processing and Analytics

Implementation of on-line classifiers

OnlineClassifier

OnlineClassifier

Input stream : explicative variables

Output stream : predicted labels

78

Page 79: Data Stream Processing and Analytics

Implementation of on-line classifiers

OnlineClassifier

OnlineClassifier

Update

Second input stream : Real labels

Comparison of real and predicted labels

79

Page 80: Data Stream Processing and Analytics

Implementation of on-line classifiers

OnlineClassifier

OnlineClassifier

Update

Evaluation

Perf

Time

80

Page 81: Data Stream Processing and Analytics

Implementation of on-line classifiers

OnlineClassifier

OnlineClassifier

Update

Evaluation

Perf

Time

In practice, this input stream may be delayed

A on-line classifier predicts the class label of tuples before receiving the true label … 81

Page 82: Data Stream Processing and Analytics

Implementation of on-line classifiers

OnlineClassifier

OnlineClassifier

Example : online advertising targeting

User

Ad

• Input tuples : couples “User – Ad”

P( | )

• Out tuples : estimated probability that a User clicks on an Ad

82

Page 83: Data Stream Processing and Analytics

Implementation of on-line classifiers

OnlineClassifier

OnlineClassifier

Example : online advertising targeting

User

Ad

P( | )

Browser Sending the Ad

AgrMax(Ads)

Waiting for a click

83

Page 84: Data Stream Processing and Analytics

Implementation of online classifiers

OnlineClassifier

OnlineClassifier

Example : online advertising targeting

User

Ad

P( | )

Browser Sending the Ad

Waiting for a click After a fixed delay

Update

Real labels

$If clicked

84

Page 85: Data Stream Processing and Analytics

Evaluation of online classifiers

OnlineClassifier

OnlineClassifier

Update

The stream of labeled tuples is split

Evaluation on the recent past

tt - k

Sliding window

Use of standard evaluation criteria(Accuracy, BER, Lift curve, AUC … etc.)

Unbiased evaluation

A – Holdout Evaluation

85

Page 86: Data Stream Processing and Analytics

Evaluation of online classifiers

OnlineClassifier

OnlineClassifier

2 - Update

B – Prequential Evaluation

1 - Update

On-lineEvaluation

On-lineEvaluation

n

1iii )y,y(LS

From the beginning of the stream

On the recent past(buffer on a sliding window)

Each labeled tuples is used twice

86

Page 87: Data Stream Processing and Analytics

Evaluation of online classifiersC – Kappa Statistic

• p0: prequential accuracy of the classifier

• pc: probability that a random classifier makes a correct prediction.

Κ = (p0 − pc)/(1 − pc)

• K = 1 if the classifier is always correct

• K = 0 if the predictions coincide with the correct ones as often as those of the random classifier

87

Page 88: Data Stream Processing and Analytics

Concept drift

• The input stream is not stationary • The distribution of data changes over time• Two strategies : adaptive learning or drift detection• Several types of concept drift :

What does it means ?

Virtual drift [B](or covariate shift)

Concept shift [A]

P(x,y) = P(x) . P(y|x)

Original data

88

Page 89: Data Stream Processing and Analytics

Concept driftWhat kinds of drift can be expected [C]?

Abrupt

Gradual

Incremental

Reoccuring

On-line adaptive learning

Drift detection & models management

89

Page 90: Data Stream Processing and Analytics

Concept driftSome specific constrains to manage :

• Adapt to concept drift asap

• Distinguish noise from changes (Robust to noise, Adaptive to changes)

• Recognizing and reacting to reoccurring contexts

• Adapting with limited hardware resources (CPU, RAM, I/O)

90

Page 91: Data Stream Processing and Analytics

Incremental classifierThe example of the Hoeffding Trees [D]

Recall on decision tree :

• Each node contains a test on an attribute

• Each branch corresponds to a possible outcome of the test

• Each leaf contains a class prediction

A decision tree is progressively learned by replacing leaves by test nodes, starting at the root.

Age<30?

A

Yes No

Car Type=Sports Car?

B

Yes No

91

Page 92: Data Stream Processing and Analytics

Incremental classifier

How an incremental decision trees is learned ?

• Single pass algorithm,• With a low latency, • Which avoids the exhaustive storage of training examples in the RAM.• The drift is not managed

Var 1 Var 2 … Class

O 12 … A

Y 98 … B

Y 4 … A

Training examples are processed one by one Input stream : labeled examples

The example of the Hoeffding Trees [D]

92

Page 93: Data Stream Processing and Analytics

Incremental classifier

Key ideas :

The best attribute at a node is found by exploiting a small subset of the labeled examples that pass through that node :•The first examples are exploited to choose the root attribute•Then, the other examples are passed down to the corresponding leaves•The attributes to be split are recursively chosen …

The Hoeffding bound answers the question : How many examples are required to split an attribute ?

Age<30?

Input stream

Sub-stream (Yes) Sub-stream (No)

Car Type=Sports Car?

Status =Married?

The example of the Hoeffding Trees [D]

93

Page 94: Data Stream Processing and Analytics

• Consider a random variable a whose range is R• Suppose we have n observations of a• The mean value is denoted by : • Hoeffding bound states:

With probability , the true mean of a is at least where

Incremental classifier

The Hoeffding bound :

The example of the Hoeffding Trees [D]

94

Page 95: Data Stream Processing and Analytics

• Let G(Xi) be the criterion used to choose the test attributes (e.g. Information Gain, Gini)

• Xa : the attribute with the highest value after having processed n examples.• Xb : the attribute with the second highest value after having processing n examples.

• If after having processed n examples at a node,

Hoeffding bound guarantees the true , with probability This node can be split using Xa, the succeeding examples will be passed to the new leaves

Incremental classifier

How many examples ?

The example of the Hoeffding Trees [D]

95

Page 96: Data Stream Processing and Analytics

Incremental classifier

The algorithm

Input stream• Find the two best attributes• Check the condition

If not satisfied

Age<30?

• Create a new test at the current node• Split the stream of examples• Create 2 new leaves

If satisfied

Car Type=Sports Car?

Status =Married?

• Recursively run the algorithm on new leaves

The example of the Hoeffding Trees [D]

This algorithm has been adapted in order to manage concept drift [E]By maintaining an incremental tree on a sliding windowsWhich allows to forget the old tuplesA collection of alternative sub-trees is maintained in memory and used in case of drift

96

Page 97: Data Stream Processing and Analytics

Incremental classifier

To learn more about incremental decision trees …

The example of the Hoeffding Trees [D]

The 4 essential elements :

• Bounds - How many examples before cutting an attribute ?• Split criterion - Which attribute and which cut point ?• Summaries in the leaves – How to manage high speed data streams ?• Local models – How to improve the classifier ?

97

Page 98: Data Stream Processing and Analytics

Incremental classifier

To learn more about incremental decision trees …

The example of the Hoeffding Trees [D]

Examples of summaries in the leaves :

• Numerical attributes– Exhaustive counts [Gama2003]– Incremental Discretization [Gama2006]– Gaussian approximation [Pfahringer2008]– Quantiles based summary [GK2001]

• Categorical attributes– for each categorical variable and for each modality the number

of occurrences is stored

98

Page 99: Data Stream Processing and Analytics

Drift detection

Fixed Classifier(applied online)Fixed Classifier(applied online)

DriftDetection

DriftDetection

If detected• Train a new classifier on

the recent past• Adapt the size of the

history

• Train a new classifier on the recent past

• Adapt the size of the history

Replace the classifier

General schema :

99

Page 100: Data Stream Processing and Analytics

Drift DetectionDrift Detection

Drift detection

How to detect the drift ?

Based on the online evaluation :

• Main idea : if the performance of the classifier changes, that means a drift is occurring ...

• For instance : if the error rate increases, the size of the sliding windows decreases and the classifier is retrained [F].

• Limitation : the user has to define a threshold

classifierclassifier

Update

Evaluation

Perf

Time

Learning Algorith

m

Learning Algorith

m

If detected 100

Page 101: Data Stream Processing and Analytics

Drift detection

How to detect the drift ?

• Main idea : if the distributions of the “current window” and the “reference window” are significantly different, that means a drift is occurring ….

Reference window Current window

time

Based on the distribution of tuples :

101

Page 102: Data Stream Processing and Analytics

Drift detection

How to detect the drift ?

Based on the distribution of tuples :

• In [G] the author uses statistical tests in order to compare the both distributions• Welch test – Mean values are the same ?• Kolmogorov Smirnov test – Both samples of tuples come from the same distribution ?

• A classifier can be exploited to discriminate tuples belonging to both windows [H]• If the quality of the classifier is good, that means a drift is occurring … • Explicative variables : X• Target variable : W (the window)

Detection of covariate shift : P(X)

• In [I] a classifier is exploited, the class value is considered as an additional input variable • Explicative variables : X and Y

• Target variable : W (the window)

Detection of concept shift : P(Y|X)

?=

102

Page 103: Data Stream Processing and Analytics

Conclusion

• Decision Tree– ID4 (Schlimmer - ML’86)– ID5/ITI (Utgoff – ML’97)– SPRINT (Shaffer - VLDB’96)

• Naive Bayes– Incremental (for the standard NB)– Learn fastly with a low variance (Domingos – ML’97)– Can be combined with decision tree: NBTree (Kohavi – KDD’96)

• Neural Networks– IOLIN (Cohen - TDM’04)– learn++ (Polikar - IJCNN’02),…

• Support Vector Machine– TSVM (Transductive SVM – Klinkenberg IJCAI’01), – PSVM (Proximal SVM – Mangasarian KDD’01),…– LASVM (Bordes 2005)

• Rules based systems– AQ15 (Michalski - AAAI’86), AQ-PM (Maloof/Michalski - ML’00)– STAGGER (Schlimmer - ML’86)– FLORA (Widmer - ML’96)

Examples of incremental learning algorithms :

103

Page 104: Data Stream Processing and Analytics

ConclusionExamples of online learning algorithms :

• Rules– FACIL (Ferrer-Troyano – SAC’04,05,06)

• Ensemble – SEA (Street - KDD’01) based on C4.5

• K-nn– ANNCAD (Law – LNCS‘05).– IBLS-Stream (Shaker et al – Evolving Systems” journal 2012)

• SVM– CVM (Tsang – JMLR’06)

• Decision Tree – Domingos : VFDT (KDD’00), CVFDT (KDD’01)– Gama : VFDTc (KDD’03), UFFT (SAC’04)– Kirkby : Ensemble of Hoeffding Trees (KDD’09)– del Campo-Avila : IADEM (LNCS’06)

An extensive literature

104

Page 105: Data Stream Processing and Analytics

Conclusion

Main ideas to retain :

•Online learning algorithm are designed in accordance with specific constrains– One pass

– Low latency

– Adaptive … etc

•In practice the true labels are delayed : an online classifier predicts the labels before observe it

•The evaluation of the classifiers is specific to data streams processing

•The distribution of the tuples may change over time :– Some approaches detect the drifts, and then update the classifier (abrupt drift)

– Other approaches progressively adapt the classifier (incremental drift)

•In practice, the type of expected drift must be known in order to choose an appropriate approach

•The distinction between noise and drifts can be viewed as a plasticity / stability dilemma105

Page 106: Data Stream Processing and Analytics

References[A] SALGANICOFF, M. 1993. Density-adaptive learning and forgetting. In Proc. of the Int. Conf. on Mach. Learn. ICML. 276–283.

[B] WIDMER, G. AND KUBAT, M. 1993. Effective learning in dynamic environments by explicit context tracking. In Proc. of the Eur. Conf. on Mach. Learn. ECML. 227–243.

[C] Zliobaite, I. Learning under Concept Drift : an Overview.

[D] Domingos, P. Hulten, G. Mining High-Speed Data Streams, KDD 2000.

[E] Bifet, A. and Gavalda, R. Adaptive Parameter-free Learning from Evolving Data Streams. In Advances in Intelligent Data AnalysisVIII, 2009 Springer

[F] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1) :69-101, 1996.

[G] Gopal K Kanji. 100 Statistical Tests, volume 129. Sage, 2006.

[H] A. Bondu, B. Grossin, M.L. Picard. Density estimation on data streams : an application to Change Detection.In EGC (Extraction et Gestion de la connaissance) 2010

[I] C. Salperwyck, M. Boulle, V. Lemaire. Grille bivariee pour la detection de changement dans un flux etiquete”, EGC 2013

Related documents :

Thesis of Salperwyck, C. Apprentissage incremental en-ligne sur flux de donees, 2012

106

Page 107: Data Stream Processing and Analytics

Outline

• Introduction on data-streams• Part 1 : Querying• Part 2 : Unsupervised Learning• Part 3 : Supervised Learning• Tutorial : Let’s try a CEP !

107

Page 108: Data Stream Processing and Analytics

Thanks to Benoît Grossin

TutorialAn introduction to StreamBase

Files of your projectFiles of your project

Drag and drop operators

Drag and drop operators

Running graph of operatorsRunning graph of operators

Run your project

Run your project

108

Page 109: Data Stream Processing and Analytics

TutorialAn introduction to StreamBase

Used data :

• Open Data• Provided By EnerNOC (A provider of energy intelligence software) • Two data files :

• Information on 100 industrial sites• 5-minute electricity consumptions of this sites over 2 years

Objective of this tutorial :

• Detect the peaks of consumption • Implement an “energy efficiency score” for each type of industry

109

Page 110: Data Stream Processing and Analytics

TutorialAn introduction to StreamBase

Overview :

110