iot big data stream mining tutorial · iot big data stream mining tutorial bda 2017 albert bifet...
TRANSCRIPT
![Page 1: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/1.jpg)
IoT Big Data Stream Mining
Tutorial BDA 2017
Albert Bifet (@abifet), Gianmarco De Francisci Morales,
Joao Gama and Wei Fan
![Page 2: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/2.jpg)
Outline• IoT Fundamentals of
Stream Mining
• IoT Setting
• Classification
• Concept Drift
• Regression
• Clustering
• Frequent Itemset Mining
• IoT Distributed
Stream Mining
• Distributed Stream Processing Engines
• Classification
• Regression
• Open Source Tools
• Applications
• Conclusions
2
![Page 3: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/3.jpg)
IoT Fundamentals of
Stream MiningPart I
![Page 4: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/4.jpg)
IoT Setting
4
![Page 5: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/5.jpg)
INTERNET OF THINGS
IoT: sensors and actuators connected by networks to computing systems.
• Gartner predicts 20.8 billion IoT devices by 2020.• IDC projects 32 billion IoT devices by 2020
![Page 6: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/6.jpg)
Applications IoT Analytics
6
![Page 7: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/7.jpg)
Applications IoT Analytics
7
![Page 8: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/8.jpg)
IoT versus Big Data
8
![Page 9: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/9.jpg)
IoT Applications For Energy Management
9
![Page 10: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/10.jpg)
IoT Applications For Connected/Smart Home
10
![Page 11: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/11.jpg)
IoT Applications For Smart Cities
11
![Page 12: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/12.jpg)
IoT Applications For Industrial Automation
12
![Page 13: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/13.jpg)
Analytic Standard ApproachFinite training sets
Static models13
Data Set
Model
Classifier Algorithm builds Model
![Page 14: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/14.jpg)
Data Stream ApproachInfinite training sets
Dynamic models14
D
M
Update Model
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
![Page 15: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/15.jpg)
Pain Points
• Need to retrain!
• Things change over time
• How often?
• Data unused until next update!
• Value of data wasted
15
![Page 16: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/16.jpg)
IoT Stream Mining
• Maintain models online
• Incorporate data on the fly
• Unbounded training sets
• Resource efficient
• Detect changes and adapts
• Dynamic models
16
![Page 17: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/17.jpg)
Streaming Algorithms
Example
17
![Page 18: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/18.jpg)
Streaming Algorithms
Example
18
![Page 19: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/19.jpg)
Streaming Algorithms
Examp
19
![Page 20: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/20.jpg)
Streaming Algorithms
Examp
20
![Page 21: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/21.jpg)
Approximation Algorithms
• General idea, good for streaming algorithms
• Small error ε with high probability 1-δ
• True hypothesis H, and learned hypothesis Ĥ
• Pr[ |H - Ĥ| < ε|H| ] > 1-δ
21
![Page 22: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/22.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
22
![Page 23: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/23.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
23
![Page 24: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/24.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
24
![Page 25: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/25.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
25
![Page 26: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/26.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
26
![Page 27: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/27.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
27
![Page 28: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/28.jpg)
Approximation Algorithms
28
![Page 29: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/29.jpg)
Predictive Learning
29
• Classification • Regression • Concept Drift
![Page 30: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/30.jpg)
Classification
30
![Page 31: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/31.jpg)
DefinitionGiven a set of training examples belonging to nC different classes, a classifier algorithm builds a model that predicts for every unlabeled instance x the class C to which it belongs
31
Examples • Email spam filter • Activities of users smartphone
Photo: Stephen Merity http://smerity.com
![Page 32: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/32.jpg)
Process• One example at at time,
used at most once
• Limited memory
• Limited time
• Anytime prediction
32
![Page 33: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/33.jpg)
• Based on Bayes’ theorem
• Probability of observing feature xi given class C
• Prior class probability P(C)
• Just counting!
Naïve Bayes
33
posterior =likelihood× prior
evidence
P (C|x) =P (x|C)P (C)
P (x)
P (C|x) ∝Y
xi∈x
P (xi|C)P (C)
C = argmaxC
P (C|x)
![Page 34: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/34.jpg)
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output h~w (~xi)
w1
w2
w3
w4
w5
Perceptron• Linear classifier
• Data stream: ⟨xi,yi⟩
• ỹi = hw(xi) = σ(wiT xi)
• σ(x) = 1/(1+e-x) σʹ=σ(x)(1-σ(x))
• Minimize MSE J(w)=½∑(yi-ỹi)2
• SGD wi+1 = wi - η∇J xi
• ∇J = -(yi-ỹi)ỹi(1-ỹi)
• wi+1 = wi + η(yi-ỹi)ỹi(1-ỹi)xi
34
![Page 35: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/35.jpg)
Perceptron Learning
35
PERCEPTRON LEARNING(Stream, ⌘)
1 for each class
2 do PERCEPTRON LEARNING(Stream, class, ⌘)
PERCEPTRON LEARNING(Stream, class, ⌘)
1 ⇤ Let w0 and ~w be randomly initialized
2 for each example (~x , y) in Stream
3 do if class = y
4 then δ = (1 − h~w (~x)) · h~w (~x) · (1 − h~w (~x))5 else δ = (0 − h~w (~x)) · h~w (~x) · (1 − h~w (~x))6 ~w = ~w + ⌘ · δ · ~x
PERCEPTRON PREDICTION(~x)
1 return arg maxclass h~wclass(~x)
![Page 36: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/36.jpg)
Decision Tree• Each node tests a features
• Each branch represents a value
• Each leaf assigns a class
• Greedy recursive induction
• Sort all examples through tree
• xi = most discriminative attribute
• New node for xi, new branch for each value, leaf assigns majority class
• Stop if no error | limit on #instances
36
RoadTested?
Mileage?
Age?
NoYes
High
✅
❌
Low
OldRecent
✅ ❌
Car deal?
![Page 37: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/37.jpg)
Very Fast Decision Tree
• AKA, Hoeffding Tree
• A small sample can often be enough to choose a near optimal decision
• Collect sufficient statistics from a small set of examples
• Estimate the merit of each alternative attribute
• Choose the sample size that allows to differentiate between the alternatives
37
Pedro Domingos, Geoff Hulten: “Mining high-speed data streams”. KDD ’00
![Page 38: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/38.jpg)
Leaf Expansion
• When should we expand a leaf?
• Let x1 be the most informative attribute, x2 the second most informative one
• Is x1 a stable option?
• Hoeffding bound
• Split if G(x1) - G(x2) > ε =
r
R2 ln(1/δ)
2n
38
![Page 39: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/39.jpg)
HT Induction
39
HT(Stream, δ)
1 ⇤ Let HT be a tree with a single leaf(root)
2 ⇤ Init counts nijk at root
3 for each example (x , y) in Stream
4 do HTGROW((x , y),HT , δ)
HTGROW((x , y),HT , δ)
1 ⇤ Sort (x , y) to leaf l using HT
2 ⇤ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then ⇤ Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) >
q
R2 ln 1/δ2n
6 then ⇤ Split leaf on best attribute
7 for each branch
8 do ⇤ Start new leaf and initiliatize counts
![Page 40: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/40.jpg)
Properties• Number of examples to expand node depends only on
Hoeffding bound (ε decreases with √n)
• Low variance model (stable decisions with statistical support)
• Low overfitting (examples processed only once, no need for pruning)
• Theoretical guarantees on error rate with high probability
• Hoeffding algorithms asymptotically close to batch learner.Expected disagreement δ/p (p = probability instance falls into a leaf)
• Ties: broken when ε < τ even if ΔG < ε
40
![Page 41: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/41.jpg)
Regression
41
![Page 42: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/42.jpg)
DefinitionGiven a set of training examples with a numeric label, a regression algorithm builds a model that predicts for every unlabeled instance x the value with high accuracy
y=ƒ(x)
42
Examples • Stock price • Airplane delay
Photo: Stephen Merity http://smerity.com
![Page 43: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/43.jpg)
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output h~w (~xi)
w1
w2
w3
w4
w5
Perceptron
• Linear regressor
• Data stream: ⟨xi,yi⟩
• ỹi = hw(xi) = wT xi
• Minimize MSE J(w)=½∑(yi-ỹi)2
• SGD w' = w - η∇J xi
• ∇J = -(yi-ỹi)
• w' = w + η(yi-ỹi)xi
43
![Page 44: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/44.jpg)
Regression Tree
• Same structure as decision tree
• Predict = average target value or linear model at leaf (vs majority)
• Gain = reduction in standard deviation (vs entropy)
44
σ =q
X
(yi − yi)2/(N − 1)
![Page 45: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/45.jpg)
Rules• Problem: very large decision trees
have context that is complex and hard to understand
• Rules: self-contained, modular, easier to interpret, no need to cover universe
• � keeps sufficient statistics to:
• make predictions
• expand the rule
• detect changes and anomalies
45
![Page 46: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/46.jpg)
E.g: x = [4,−1, 1, 2]
f (x) =X
Rl∈S(xi )
θl yl ,
Adaptive Model Rules
• Ruleset: ensemble of rules
• Rule prediction: mean, linear model
• Ruleset prediction
• Weighted avg. of predictions of rules covering instance x
• Weights inversely proportional to error
• Default rule covers uncovered instances
46
E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams." ECML-PKDD ‘13
![Page 47: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/47.jpg)
Algorithm 1: Training AMRules
Input: S: Stream of examples
beginR ← {}, D ← 0
foreach (x, y) ∈ S do
foreach Rule r ∈ S(x) do
if ¬IsAnomaly(x, r ) then
if PHTest(errorr , λ) thenRemove the rule from R
elseUpdate sufficient statistics Lr
ExpandRule(r)
if S(x) = ∅ thenUpdate LD
ExpandRule(D)if D expanded then
R ← R∪ D
D ← 0
return (R, LD)
AMRules Induction
• Rule creation: default rule expansion
• Rule expansion: split on attribute maximizing σ reduction
• Hoeffding bound ε
• Expand when σ1st/σ2nd < 1 - ε
• Evict rule when P-H test error large
• Detect and explain local anomalies
47
=
r
R2 ln(1/δ)
2n
![Page 48: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/48.jpg)
Concept Drift
48
![Page 49: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/49.jpg)
DefinitionGiven an input sequence ⟨x1,x2,…,xt⟩, output at instant
t an alarm signal if there is a distribution change, and a prediction xt+1 minimizing the error |xt+1 − xt+1|
49
Outputs • Alarm indicating change • Estimate of parameter
Photo: http://www.logsearch.io
![Page 50: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/50.jpg)
-xt
Estimator
- -Alarm
ChangeDetector
-Estimation
Memory-
6
6?
Application• Change detection on
evaluation of model
• Training error should decrease with more examples
• Change in distribution of training error
• Input = stream of real/binary numbers
• Trade-off between detecting true changes and avoiding false alarms
50
![Page 51: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/51.jpg)
Cumulative Sum
• Alarm when mean of input data differs from zero
• Memoryless heuristic (no statistical guarantee)
• Parameters: threshold h, drift speed v
• g0 = 0, gt = max(0, gt-1 + εt - v)
• if gt > h then alarm; gt = 0
51
![Page 52: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/52.jpg)
Page-Hinckley Test
• Similar structure to Cumulative Sum
• g0 = 0, gt = gt-1 + (εt - v)
• Gt = mint(gt)
• if gt - Gt > h then alarm; gt = 0
52
![Page 53: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/53.jpg)
Number of examples processed (time)E
rro
r ra
te
concept drift
pmin
+ smin
Drift level
Warning level
0 50000
0.8
new window
Statistical Process Control
• Monitor error in sliding window
• Null hypothesis: no change between windows
• If error > warning levellearn in parallel new modelon the current window
• if error > drift levelsubstitute new model for old
53
J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with Drift Detection”. SBIA '04
![Page 54: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/54.jpg)
Concept-adapting VFDT• Model consistent with sliding window on stream
• Keep sufficient statistics also at internal nodes
• Recheck periodically if splits pass Hoeffding test
• If test fails, grow alternate subtree and swap-inwhen accuracy of alternate is better
• Processing updates O(1) time, +O(W) memory
• Increase counters for incoming instance, decrease counters for instance going out window
54
G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01
![Page 55: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/55.jpg)
VFDTc: Adapting to Change
• Monitor error rate
• When drift is detected
• Start learning alternative subtree in parallel
• When accuracy of alternative is better
• Swap subtree
• No need for window of instances
55
J. Gama, R. Fernandes, R. Rocha: “Decision Trees for Mining Data Streams”. IDA (2006)
![Page 56: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/56.jpg)
Hoeffding Adaptive Tree• Replace frequency counters by estimators
• No need for window of instances
• Sufficient statistics kept by estimators separately
• Parameter-free change detector + estimator with theoretical guarantees for subtree swap (ADWIN)
• Keeps sliding window consistent with “no-change hypothesis”
56
A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams” IDA (2009)
A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ‘07
![Page 57: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/57.jpg)
Clustering
57
![Page 58: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/58.jpg)
DefinitionGiven a set of unlabeled instances, distribute them into homogeneous groups according to some common relations or affinities
58
Examples • Sensor segmentation • Social network communities
Photo: W. Kandinsky - Several Circles (edited)
![Page 59: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/59.jpg)
Approaches
• Distance based (CluStream)
• Density based (DenStream)
• Kernel based, Coreset based, much more…
• Most approaches combine online + offline phase
• Formally: minimize cost function over a partitioning of the data
59
![Page 60: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/60.jpg)
Micro-Clusters• AKA, Cluster Features CF
Statistical summary structure
• Maintained in online phase, input for offline phase
• Data stream ⟨xi⟩, d dimensions
• Cluster feature vector N: number of pointsLSj: sum of values (for dim. j)
SSj: sum of squared values (for dim. j)
• Easy to update, easy to merge
• Constant space irrespective to the number of examples!
60
Tian Zhang, Raghu Ramakrishnan, Miron Livny: “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. SIGMOD ’96
![Page 61: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/61.jpg)
CluStream• Timestamped data stream ⟨ti, xi⟩, represented in d+1 dimensions
• Seed algorithm with q micro-clusters (k-means on initial data)
• Online phase. For each new point, either:
• Update one micro-cluster (point within maximum boundary)
• Create a new micro-cluster (delete/merge other micro-clusters)
• Offline phase. Determine k macroclusters on demand:
• K-means on micro-clusters (weighted pseudo-points)
• Time-horizon queries via pyramidal snapshot mechanism
61
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03
![Page 62: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/62.jpg)
DBSCAN• ε-n(p) = set of points at distance ≤ ε
• Core object q = ε-n(q) has weight ≥ μ
• p is directly density-reachable from q
• p ∈ ε-n(q) ∧ q is a core object
• pn is density-reachable from p1
• chain of points p1,…,pn such that pi+1 is directly d-r from pi
• Cluster = set of points that are mutually density-connected
62
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96
![Page 63: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/63.jpg)
DenStream• Based on DBSCAN
• Core-micro-cluster: CMC(w,c,r) weight w > μ, center c, radius r < ε
• Potential/outlier micro-clusters
• Online: merge point into p (or o)micro-cluster if new radius r'< ε
• Promote outlier to potential if w > βμ
• Else create new o-micro-cluster
• Offline: DBSCAN
63
Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06
![Page 64: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/64.jpg)
Static Evaluation• Internal (validation)
• Sum of squared distance (point to centroid)
• Dunn index (on distance d) D = min(inter-cluster d) / max(intra-cluster d)
• External (ground truth)
• Rand = #agreements / #choices = 2(TP+TN)/(N(N-1))
• Purity = #majority class per cluster / N
64
![Page 65: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/65.jpg)
Streaming Evaluation• Clusters may: appear, fade, move, merge
• Missed points (unassigned)
• Misplaced points (assigned to different cluster)
• Noise
• Cluster Mapping Measure CMM
• External (ground truth)
• Normalized sum of penalties of these errors
65
H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer:“An effective evaluation measure for clustering on evolving data streams”. KDD ’11
![Page 66: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/66.jpg)
Frequent Itemset Mining
66
![Page 67: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/67.jpg)
DefinitionGiven a collection of sets of items, find all the subsets that occur frequently, i.e., more than a minimum support of times
67
Examples • Market basket mining • Item recommendation
![Page 68: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/68.jpg)
Fundamentals
• Dataset D, set of items t ∈ D,
constant s (minimum support)
• Support(t) = number of sets in D that contain t
• Itemset t is frequent if support(t) ≥ s
• Frequent Itemset problem:
• Given D and s, find all frequent itemsets
68
![Page 69: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/69.jpg)
Example
69
Dataset ExampleDocument Patterns
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent
d1,d2,d3,d4,d5,d6 c
d1,d2,d3,d4,d5 e,ce
d1,d3,d4,d5 a,ac,ae,ace
d1,d3,d5,d6 b,bc
d2,d4,d5,d6 d,cd
d1,d3,d5 ab,abc,abe
be,bce,abce
d2,d4,d5 de,cde
minimal support = 3
Support Frequent
6 c
5 e,ce
4 a,ac,ae,ace
4 b,bc
4 d,cd
3 ab,abc,abe
be,bce,abce
3 de,cde
![Page 70: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/70.jpg)
Variations• A priori property: t ⊆ t' ➝ support(t) ≥ support(t’)
• Closed: none of its supersets has the same support
• Can generate all freq. itemsets and their support
• Maximal: none of its supersets is frequent
• Can generate all freq. itemsets (without support)
• Maximal ⊆ Closed ⊆ Frequent
70
![Page 71: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/71.jpg)
Example
71
Dataset ExampleDocument Patterns
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
![Page 72: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/72.jpg)
Itemset Streams
• Support as fraction of stream length
• Exact vs approximate
• Incremental, sliding window, adaptive
• Frequent, closed, maximal
72
![Page 73: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/73.jpg)
Pattern mining in streamsKeydatastructure:La.ceofpa2erns,withcounts
73
{A},20 {B},18 {C},18 {D},25
{A,B},15 {A,C},12 {B,C},10 {A,D},5 {B,D},12 {C,D},12
{A,B,C},4 {A,B,D},3 {A,C,D},3 {B,C,D},8
{A,B,C,D},2 count≤7
count>7
![Page 74: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/74.jpg)
The vast majority of stream pattern mining algorithms (implicitly or explicitly) build and update the pattern lattice.
General scheme:
let L be initial, empty lattice;
forever do {
collect a batch of items of size B;
build a summary S of the batch;
merge S into L;
}
74
Pattern mining in streams
![Page 75: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/75.jpg)
Lossy Counting
• Keep data structure D with tuples (x, freq(x), error(x))
• Imagine to divide the stream in buckets of size⎡1/ε⎤
• Foreach itemset x in the stream, Bid = current sequential bucket id starting from 1
• if x ∈ D, freq(x)++
• else D ← D ∪ (x, 1, Bid - 1)
• Prune D at bucket boundaries: evict x if freq(x) + error(x) ≤ Bid
75
G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02
![Page 76: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/76.jpg)
Moment• Keeps track of boundary below frequent itemsets in a window
• Closed Enumeration Tree (CET) (~ prefix tree)
• Infrequent gateway nodes (infrequent)
• Unpromising gateway nodes (frequent non-closed, child non-closed)
• Intermediate nodes (frequent non-closed, child closed)
• Closed nodes (frequent)
• By adding/removing transactions closed/infreq. do not change
76
Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window”. ICDM ‘04
![Page 77: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/77.jpg)
FP-Stream
• Multiple time granularities
• Based on FP-Growth (depth-first search over itemset lattice)
• Pattern-tree + Tilted-time window
• Time sensitive queries, emphasis on recent history
• High time and memory complexity
77
C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time granularities”. NGDM (2003)
![Page 78: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/78.jpg)
Itemset mining
CLOSTREAM (Yen+ 09) (Sliding window, all closed, exact)
MFI (Li+ 09) (Transaction-sensitive window, frequent closed, exact)
IncMine (Cheng+ 08) (Sliding window, frequent closed, approximate; faster for moderate approximate ratios)
78
![Page 79: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/79.jpg)
Sequence, tree, graph mining
MILE (Chen+ 05), SMDS (Marascu-Masseglia 06), SSBE (Koper-Nguyen 11): Frequent subsequence (aka sequential pattern) mining
Bifet+ 08: Frequent closed unlabeled subtree mining
Bifet+ 11: Frequent closed labeled subtree mining
Bifet+11: Frequent closed subgraph mining
79
![Page 80: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/80.jpg)
IoT Distributed
Stream MiningPart II
![Page 81: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/81.jpg)
Outline• IoT Fundamentals of
Stream Mining
• IoT Setting
• Classification
• Concept Drift
• Regression
• Clustering
• Frequent Itemset Mining
• Concept Evolution
• Limited Labeled Learning
• IoT Distributed
Stream Mining
• Distributed Stream Processing Engines
• Classification
• Regression
• Open Source Tools
• Applications
• Conclusions
81
![Page 82: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/82.jpg)
Distributed Stream Processing Engines
82
![Page 83: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/83.jpg)
A Tale of two Tribes
83
DBDBDBDBDBDBData
App App App
Faster Larger
Database
M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05
![Page 84: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/84.jpg)
SPE Evolution
—2003
—2004
—2005
—2006
—2008
—2010
—2011
—2013
Aurora
STREAM
Borealis
SPC
SPADE
Storm
S4
1st generation
2nd generation
3rd generation
Abadi et al., “Aurora: a new model and architecture for
data stream management,” VLDB Journal, 2003
Arasu et al., “STREAM: The Stanford Data Stream
Management System,” Stanford InfoLab, 2004.
Abadi et al., “The Design of the Borealis Stream
Processing Engine,” in CIDR ’05
Amini et al., “SPC: A Distributed, Scalable Platform
for Data Mining,” in DMSSP ’06
Gedik et al., “SPADE: The System S Declarative
Stream Processing Engine,” in SIGMOD ’08
Neumeyer et al., “S4: Distributed Stream Computing
Platform,” in ICDMW ’10
http://storm.apache.org
Samza http://samza.incubator.apache.org
84
![Page 85: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/85.jpg)
Actors Model
85
Live Streams
Stream 1
Stream 2
Stream 3
PE
PE
PE
PE
PE
External
Persister
Output 1
Output 2
Event
routing
![Page 86: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/86.jpg)
S4 Example
86
status.text:"Introducing #S4: a distributed #stream processing system"
PE1
PE2 PE3
PE4
RawStatusnulltext="Int..."
EV
KEY
VAL
Topictopic="S4"count=1
EV
KEY
VAL
Topictopic="stream"count=1
EV
KEY
VAL
TopicreportKey="1"topic="S4", count=4
EV
KEY
VAL
TopicExtractorPE (PE1)extracts hashtags from status.text
TopicCountAndReportPE (PE2-3)keeps counts for each topic acrossall tweets. Regularly emits report event if topic count is above a configured threshold.
TopicNTopicPE (PE4)keeps counts for top topics and outputs top-N topics to external persister
![Page 87: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/87.jpg)
PE PE
PEI
PEI
PEI
PEI
Groupings
• Key Grouping
(hashing)
• Shuffle Grouping
(round-robin)
• All Grouping
(broadcast)
87
![Page 88: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/88.jpg)
PE PE
PEI
PEI
PEI
PEI
Groupings
• Key Grouping
(hashing)
• Shuffle Grouping
(round-robin)
• All Grouping
(broadcast)
88
![Page 89: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/89.jpg)
PE PE
PEI
PEI
PEI
PEI
Groupings
• Key Grouping
(hashing)
• Shuffle Grouping
(round-robin)
• All Grouping
(broadcast)
89
![Page 90: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/90.jpg)
PE PE
PEI
PEI
PEI
PEI
Groupings
• Key Grouping
(hashing)
• Shuffle Grouping
(round-robin)
• All Grouping
(broadcast)
90
![Page 91: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/91.jpg)
Big Data Processing Engines
• Low latency
• High Latency (Not real time)
91
apache storm
Storm characteristics for real-time data processing workloads:
1 Fast2 Scalable3 Fault-tolerant4 Reliable5 Easy to operate
8
apache sa za fro li kedi
Storm and Samza are fairly similar. Both systems provide:
1 a partitioned stream model,2 a distributed execution environment,3 an API for stream processing,4 fault tolerance,5 Kafka integration
10
real ti e co putatio : strea i g co putatio
MapReduce Limitations
ExampleHow compute in real time (latency less than 1 second):
1 predictions2 frequent items as Twitter hashtags3 sentiment analysis
14
apache spark strea i g
Spark Streaming is an extension of Spark that allowsprocessing data stream using micro-batches of data.
11
![Page 92: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/92.jpg)
Kappa Architecture
• Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.
92
![Page 93: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/93.jpg)
Apache Spark
• Spark Streaming is an extension of Spark that allows processing data stream using micro-batches of data.
93
![Page 94: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/94.jpg)
Apache Spark
• Discretized Stream or DStream represents a continuous stream of data
• either the input data stream received from source, or
• the processed data stream generated by transforming the input stream.
• Internally, a DStream is represented by a continuous series of RDDs
94
![Page 95: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/95.jpg)
Apache Spark
• Any operation applied on a DStream translates to operations on the underlying RDDs
95
![Page 96: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/96.jpg)
Apache Flink
• Streaming engine
96
real ti e co putatio : strea i g co putatio
MapReduce Limitations
ExampleHow compute in real time (latency less than 1 second):
1 predictions2 frequent items as Twitter hashtags3 sentiment analysis
14
![Page 97: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/97.jpg)
Apache Flink
• Streaming engine
97
real ti e co putatio : strea i g co putatio
MapReduce Limitations
ExampleHow compute in real time (latency less than 1 second):
1 predictions2 frequent items as Twitter hashtags3 sentiment analysis
14
![Page 98: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/98.jpg)
Apache Beam
98
![Page 99: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/99.jpg)
Apache Beam
• Apache Beam code can run in:
• Apache Flink
• Apache Spark
• Google Cloud Dataflow
• Google Cloud Dataflow replaced MapReduce:
• It is based on FlumeJava and MillWheel, a stream engine as Storm, Samza
• It writes and reads to Google Pub/Sub, a service similar to Kafka
99
![Page 100: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/100.jpg)
Apache Beam
100
![Page 101: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/101.jpg)
Lambda Architecture
101
![Page 102: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/102.jpg)
Kappa Architecture
102
![Page 103: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/103.jpg)
Classification
103
![Page 104: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/104.jpg)
Hadoop AllReduce
• MPI AllReduce on MapReduce
• Parallel SGD + L-BFGS
• Aggregate + Redistribute
• Each node computes partial gradient
• Aggregate (sum) complete gradient
• Each node gets updated model
• Hadoop for data locality (map-only job)
104
A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”. JMLR (2014)
![Page 105: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/105.jpg)
7 5
1
4
9
3
8
7
13
5 3 4
15
3737 37 37
3737
AllReduceReduction Tree
Upward = Reduce Downward = Broadcast (All)105
![Page 106: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/106.jpg)
Parallel Decision Trees
• Which kind of parallelism?
• Task
• Data
• Horizontal
• Vertical
106
Data
Attributes
Instances
Class
Instance
Attributes
![Page 107: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/107.jpg)
Aggregation to compute splits
Single attribute tracked in
multiple nodes
Stats
Stats
Stats
StreamHistograms
Model
Instances
Model Updates
Horizontal Partitioning
107
Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)
![Page 108: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/108.jpg)
Hoeffding Tree Profiling
108
Other
6 %
Split
24 %
Learn
70 %
Training time for
100 nominal +
100 numeric
attributes
![Page 109: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/109.jpg)
Stats
Stats
Stats
Stream
Model
Attributes
Splits
Vertical Partitioning
109
Single attribute tracked in
single node
N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Murdopo: “VHT: Vertical Hoeffding Tree”, 2016 https://arxiv.org/abs/1607.08325
![Page 110: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/110.jpg)
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle GroupingKey GroupingAll Grouping
Vertical Hoeffding Tree
110
![Page 111: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/111.jpg)
Advantages of Vertical Parallelism
• High number of attributes => high level of parallelism(e.g., documents)
• vs. task parallelism
• Parallelism observed immediately
• vs. horizontal parallelism
• Reduced memory usage (no model replication)
• Parallelized split computation
111
![Page 112: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/112.jpg)
Regression
112
![Page 113: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/113.jpg)
ModelAggregator
Learner1
Learner2
Learnerp
Predictions
Instances
New Rules
RuleUpdates
VAMR• Vertical AMRules
• Model: rule body + head
• Target mean updated continuously with covered instances for predictions
• Default rule (creates new rules)
• Learner: statistics
• Vertical: Learner tracks statistics of independent subset of rules
• One rule tracked by only one Learner
• Model -> Learner: key grouping on rule ID
113
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14
![Page 114: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/114.jpg)
HAMR
• VAMR single model is bottleneck
• Hybrid AMRules(Vertical + Horizontal)
• Shuffle among multiple Models for parallelism
• Problem: distributed default rule decreases performance
• Separate dedicate Learner for default rule
114
A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14
Learners
Model Aggregator
1
Model Aggregator
2
Model Aggregator
r
Predictions
Instances
New Rules
RuleUpdates
LearnersLearners
Predictions
Instances
New Rules
RuleUpdates
LearnersLearners
Learners
Model Aggregator
2
Model Aggregator
2
Model Aggregators
Default Rule Learner
New Rules
![Page 115: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/115.jpg)
Open Source Tools
115
![Page 116: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/116.jpg)
MOA• {M}assive {O}nline {A}nalysis is a framework for online
learning from data streams.
• It is closely related to WEKA
• It includes a collection of offline and online as well as tools for evaluation:
• classification, regression, clustering
• outlier detection, frequent pattern mining
• Easy to extend, design and run experiments
116
http://moa.cms.waikato.ac.nz/
![Page 117: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/117.jpg)
streamDM C++
117
http://huawei-noah.github.io/streamDM-Cpp/
![Page 118: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/118.jpg)
Streaming
Vision
118
Distributed
IoT Big Data Stream Mining
![Page 119: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/119.jpg)
SAMOA
119
http://samoa-project.net
Data Mining
Distributed
Batch
Hadoop
Mahout
Stream
Storm, S4, Samza
SAMOA
Non Distributed
Batch
R, WEKA,…
Stream
MOA
G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
![Page 121: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/121.jpg)
Conclusions
121
![Page 122: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/122.jpg)
IOT AND INDUSTRY 4.0
Interoperability: IoTInformation transparency: virtual copy of the physical worldTechnical assistance: support human decisionsDecentralized decisions: make decisions on their own
![Page 123: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/123.jpg)
INTERNET OF THINGS
• EMC Digital Universe, 2014
digital universe
Figure 3: EMC Digital Universe, 2014
7
![Page 124: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/124.jpg)
INTERNET OF THINGS
• EMC Digital Universe, 2014
![Page 125: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/125.jpg)
IOT (MC KINSEY)
![Page 126: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/126.jpg)
Summary
• IoT Streaming useful for finding approximate solutions with reasonable amount of time & limited resources
• Algorithms for classification, regression, clustering, frequent itemset mining
• Distributed systems for very large streams
126
![Page 127: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/127.jpg)
Open Challenges
• Structured output
• Multi-target learning
• Millions of classes
• Representation learning
• Ease of use
127
![Page 128: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/128.jpg)
Thanks!
128
![Page 129: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/129.jpg)
References
129
![Page 130: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/130.jpg)
• IDC’s Digital Universe Study. EMC (2011)
• P. Domingos, G. Hulten: “Mining high-speed data streams”. KDD ’00
• J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with drift detection”. SBIA’04
• G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01
• J. Gama, R. Fernandes, R. Rocha: “Decision trees for mining data streams”. IDA (2006)
• A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams”. IDA (2009)
• A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ’07
• E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams”. ECML-PKDD ‘13
• H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer: “An effective evaluation measure for clustering on evolving data streams”. KDD ’11
• T. Zhang, R. Ramakrishnan, M. Livny: “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. SIGMOD ’96
• C. C. Aggarwal, J. Han, J. Wang, P. S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03
• M. Ester, H. Kriegel, J. Sander, X. Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96
130
![Page 131: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan](https://reader033.vdocuments.net/reader033/viewer/2022051806/6002c639aadd61793264321a/html5/thumbnails/131.jpg)
• F. Cao, M. Ester, W. Qian, A. Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06
• G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02
• Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window”. ICDM ’04
• C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time granularities”. NGDM (2003)
• M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05
• A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”. JMLR (2014)
• Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)
• A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ’14
• G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
• J. Gama: “Knowledge Discovery from Data Streams”. Chapman and Hall (2010)
• J. Gama: “Data Stream Mining: the Bounded Rationality”. Informatica 37(1): 21-25 (2013)
131