moa : massive online analysis
DESCRIPTION
Talk about MOA, a framework similar to WEKA dealing with massive data and data streams.TRANSCRIPT
MOA: {M}assive {O}nline {A}nalysis.
University of WaikatoHamilton, New Zealand
What is MOA?
{M}assive {O}nline {A}nalysis is a framework for online learningfrom data streams.
It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:
boosting and baggingHoeffding Trees
with and without Naïve Bayes classifiers at the leaves.
2 / 21
WEKA
Waikato Environment for Knowledge AnalysisCollection of state-of-the-art machine learning algorithmsand data processing tools implemented in Java
Released under the GPLSupport for the whole process of experimental data mining
Preparation of input dataStatistical evaluation of learning schemesVisualization of input data and the result of learning
Used for education, research and applicationsComplements “Data Mining” by Witten & Frank
3 / 21
WEKA: the bird
4 / 21
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
5 / 21
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
5 / 21
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
5 / 21
Data stream classification cycle
1 Process an example at a time,and inspect it only once (atmost)
2 Use a limited amount ofmemory
3 Work in a limited amount oftime
4 Be ready to predict at anypoint
6 / 21
Experimental setting
Evaluation procedures for DataStreams
HoldoutInterleaved Test-Then-Train orPrequential
EnvironmentsSensor Network: 100KbHandheld Computer: 32 MbServer: 400 Mb
7 / 21
Experimental setting
Data SourcesRandom Tree GeneratorRandom RBF GeneratorLED GeneratorWaveform GeneratorFunction Generator
7 / 21
Experimental setting
ClassifiersNaive BayesDecision stumpsHoeffding TreeHoeffding Option TreeBagging and Boosting
Prediction strategiesMajority classNaive Bayes LeavesAdaptive Hybrid
7 / 21
Hoeffding Option TreeHoeffding Option TreesRegular Hoeffding tree containing additional option nodes thatallow several tests to be applied, leading to multiple Hoeffdingtrees as separate paths.
8 / 21
Extension to Evolving Data Streams
New Evolving Data Stream Extensions
New Stream GeneratorsNew UNION of StreamsNew Classifiers
9 / 21
Extension to Evolving Data Streams
New Evolving Data Stream Generators
Random RBF with DriftLED with DriftWaveform with Drift
HyperplaneSEA GeneratorSTAGGER Generator
9 / 21
Concept Drift Framework
t
f (t) f (t)
α
α
t0W
0.5
1
DefinitionGiven two data streams a, b, we define c = a⊕W
t0 b as the datastream built joining the two data streams a and b
Pr[c(t) = b(t)] = 1/(1+ e−4(t−t0)/W ).Pr[c(t) = a(t)] = 1−Pr[c(t) = b(t)]
10 / 21
Concept Drift Framework
t
f (t) f (t)
α
α
t0W
0.5
1
Example
(((a⊕W0t0 b)⊕W1
t1 c)⊕W2t2 d) . . .
(((SEA9⊕Wt0 SEA8)⊕W
2t0SEA7)⊕W
3t0SEA9.5)
CovPokElec = (CoverType⊕5,000581,012 Poker)⊕5,000
1,000,000 ELEC2
10 / 21
Extension to Evolving Data Streams
New Evolving Data Stream Classifiers
Adaptive Hoeffding Option TreeDDM Hoeffding TreeEDDM Hoeffding Tree
OCBoostFLBoost
11 / 21
Ensemble Methods
New ensemble methods:Adaptive-Size Hoeffding Tree bagging:
each tree has a maximum sizeafter one node splits, it deletes some nodes to reduce itssize if the size of the tree is higher than the maximum value
ADWIN bagging:When a change is detected, the worst classifier is removedand a new classifier is added.
12 / 21
Adaptive-Size Hoeffding Tree
T1 T2 T3 T4
Ensemble of trees of different sizesmaller trees adapt more quickly to changes,larger trees do better during periods with little changediversity
13 / 21
Adaptive-Size Hoeffding Tree
0,2
0,21
0,22
0,23
0,24
0,25
0,26
0,27
0,28
0,29
0,3
0 0,1 0,2 0,3 0,4 0,5 0,6
Kappa
Err
or
0,25
0,255
0,26
0,265
0,27
0,275
0,28
0,1 0,12 0,14 0,16 0,18 0,2 0,22 0,24 0,26 0,28 0,3
Kappa
Err
or
Figure: Kappa-Error diagrams for ASHT bagging (left) and bagging(right) on dataset RandomRBF with drift, plotting 90 pairs ofclassifiers.
14 / 21
ADWIN Bagging
ADWIN
An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.
ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates
ADWIN BaggingWhen a change is detected, the worst classifier is removed anda new classifier is added.
15 / 21
Web
http://www.cs.waikato.ac.nz/∼abifet/MOA/
16 / 21
GUIjava -cp .:moa.jar:weka.jar-javaagent:sizeofag.jar moa.gui.TaskLauncher
17 / 21
GUIjava -cp .:moa.jar:weka.jar-javaagent:sizeofag.jar moa.gui.TaskLauncher
17 / 21
Command Line
EvaluatePeriodicHeldOutTestjava -cp .:moa.jar:weka.jar -javaagent:sizeofag.jarmoa.DoTask "EvaluatePeriodicHeldOutTest-l DecisionStump -s generators.WaveformGenerator-n 100000 -i 100000000 -f 1000000" > dsresult.csv
This command creates a comma separated values file:
training the DecisionStump classifier on the WaveformGeneratordata,
using the first 100 thousand examples for testing,
training on a total of 100 million examples, and
testing every one million examples:
18 / 21
Easy Design of a MOA classifier
void resetLearningImpl ()
void trainOnInstanceImpl (Instance inst)
double[] getVotesForInstance (Instance i)
void getModelDescription (StringBuilderout, int indent)
19 / 21
Example of a classifier: MajorityClass
public class MajorityClass extends AbstractClassifier {
protected DoubleVector observedClassDistribution;
@Overridepublic void resetLearningImpl() {this.observedClassDistribution = new DoubleVector();
}
@Overridepublic void trainOnInstanceImpl(Instance inst) {this.observedClassDistribution.addToValue(
(int) inst.classValue(), inst weight());}
public double[] getVotesForInstance(Instance i) {return this.observedClassDistribution.getArrayCopy();
}...
}
20 / 21
Summary{M}assive {O}nline {A}nalysis is a framework for online learningfrom data streams.
http://www.cs.waikato.ac.nz/∼abifet/MOA/
It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:
boosting and baggingHoeffding Trees
MOA deals with evolving data streamsMOA is easy to use and extend
21 / 21