shifu plugin-trainer and pmml-adapter

Shifu-Plugin Demo

Lisa Hua7/21/2014

Recap

1. Convert PMML back to ML model2. Integrate to Shifu as Shifu-plugin-*3. Add examples4. Performance test for PMML evaluator

Miscellaneous

1. Compatible issue: Spark depends on Akka 2.2.3, while shifu uses 2.1.1

2. Spark overview3. About showcase

a video that introduces shifua poster that describes my projecta project title and project description

PMML Adapter Demo

Lisa Hua06/23/14

ML Framework Neural Network Logistic Regression SVM Decision Tree

Encog Support Support TBD None

Spark None Support TBD TBD

Mahout Support Support TBD TBD

H2o TBD None TBD TBD

Outline

1. Neural Network Model Conversiona. Encog NN modelb. Mahout NN model

2. Logistic Regression Model Conversiona. Encog LR model (NN)b. Spark LR modelc. Mahout LR model

2. PMML Adapter API and how to extend PMML Adapter

Performance Test

protected void initMLModel() {...

mlModel = new MultilayerPerceptron();

mlModel.addLayer(20, false, "Identity");

// numInputFields,isFinalLayer,squashFunction

mlModel.addLayer(45, false, "Sigmoid");

mlModel.addLayer(45, false, "Sigmoid");

mlModel.addLayer(1, true, "Sigmoid");

for (MahoutData data : inputDataSet) {

mlModel.trainOnline(data.getInput()); …}

}

protected void adaptToPMML() {...

Matrix[] matrixList = nnModel.getWeightMatrices();...

}

squashFunctions: 1. only supports identity and sigmoid now.2. squashFunctionList is protected without getter function, now we set activationFunction as sigmoid by default.

Mahout NN Model - trainOnline()

//in Adapterfor (int k = 1; k < columnSize; k++) {

neuron.withConnections(new Connection(matrix.get(j, k))); } // bias neuron for each layer, set to bias=1 neuron.withConnections(new Connection(matrix.get(j, 0)));

Bias is the first Neuron in each layer that is not the final layer

protected void evaluatePMML() {

for (int i = 0; i < mahoutDataSet.size(); i++) {

Assert.assertEquals(

getPMMLEvaluatorResult(pmmlEvalResultList.get(i)),

getMahoutResult(mahoutDataSet.get(i)),

DELTA);//DELTA=10-5

}

private double getMahoutResult(MahoutData data) {

return mlModel.getOutput(data.getEvalInput()).get(0);

}

Mahout NN Model - getOutput()

Outline




Encog LR Model - compute()


lrModel = (BasicNetwork) networkReader.read(new

FileInputStream("EncogLR.lr"));

}


double[] weights = lrModel.getWeights();...}

}protected void evaluatePMML() {

for (int i = 0; i < dataSet.size(); i++) {

Assert.assertEquals( getPMMLEvaluatorResult(index++),

getNextEncogLRResult(mlResultIterator), DELTA);

}}

private double getNextEncogLRResult(Iterator<MLDataPair>

mlResultIterator) {

MLData result =

lrModel.compute(mlResultIterator.next().getInput());

return result.getData(0);

}

Spark LR Model: train() and predict()


lrModel = LogisticRegressionWithSGD.train(points.rdd(),

iterations,stepSize);

}


List<double> weights = lrModel.weights();

...}

protected void evaluatePMML() {... List<Double> evalList = lrModel.predict(evalRDD).cache().collect();

for (...) {

Assert.assertEqual( getPMMLEvaluatorResult(i),

sparkEvalList.get(i),DELTA);

}

}

Notes: 1. The method lrModel.weights() returns intercept followed by the weight list.

2. Compatible issue:

Spark depends on Akka 2.2.3, while shifu uses 2.1.1. Currently, these is compatible issue if we change Akka version of shifu-core from 2.1.1 to 2.2.3, I suspect the issue lies in Guagua based on the building history, the root cause is still unknown to me.

Mahout LR Model - train() and classifyScalar()


lrModel = new OnlineLogisticRegression(2, 20, new L1());

//numCategory, numFeatures, PriorFunction

for (MahoutDataPair pair :

inputDataSet) {

lrModel.train(pair.getActual(),

pair.getFeatureField());

}

}

protected void adaptToPMML() {... Matrix matrix = lrModel.getBeta(); // coefficients. This is a dense matrix

// that is (numCategories-1) x numFeatures

}

private double

getMahoutResult(MahoutDataPair data) {

return

lrModel.classifyScalar(data.getVector());

//Returns a single scalar

probability in the case where we have two

categories.

}

Summary of Evaluation Dataset

Model ML Framework Input Data Field Input Data Evaluation Data Nodes in each layer

NeuralNetwork

Encog 2 layers 20 450118

20,45,45,1560

Encog 3 layers 25 450 550 25,20,15,20,1

Mahout 2 layers 20 450118

20,45,45,1560

Mahout 3 layers 25 450 550 25,20,15,20,1

LogisticRegression

Encog 20 450118

560

LogisticRegression

Spark 20 450118

560

LogisticRegression

Mahout 20 450118

560

Summary of the Functions

model class nameparent class/interface Training method

retrieve training result

evalution method

Basic Data Structure

Encog

Neural Network

BasicNetork MLClassificationcompute (MLDataSet data)

getWeights(): double[] compute()

MLData: Double[], MLDataSet: Set<Double[]>

Logistic Regression

SparkLogistic Regression

Logistic Regression Model

GeneralLinearModel, ClassificationModel train(RDD data) weights():double[]

predict (RDD <Vector>): RDD<Double>

RDD: Resilient Distributed Dataset

Mahout

Neural Network

Multilayer Perceptron NeuralNetwork

trainOnline (Vector instance)

getWeightMatrices ():Matrix

getOutput (Vector):Vector

VectorMatrix: List<Vector>

Logistic Regression

Online Logistic Regression

AbstractOnline LogisticRegression

train(Vector actual, Vector instance) getBeta(): Matrix

classifyScalar (Vector instance) :double

Outline




3. PMML Adapter API

1. For new ML model conversiona. implement a subclass of PMMLModelBuilder<TargetPMMLModel, SourceMLModel>, implement adaptMLModelToPMML()

https://github.paypal.com/jihua/PMMLAdapter/blob/master/src/main/java/pmmlAdapter/PMMLModelBuilder.java

https://github.paypal.com/jihua/PMMLAdapter/blob/master/src/main/java/pmmlAdapter/spark/PMMLSparkLogisticRegressionModel.java

Next Step

● Support: supported by PMML Adapter● None: The ML framework doesn’t support this ML

model currently ● TBD: To be determined

ML Framework Neural Network Logistic Regression SVM Decision Tree

Encog Support Support TBD None

Spark None Support TBD TBD

Mahout Support Support TBD TBD

H2o TBD None TBD TBD

1. PMML skeleton - Neural Network<PMML>

<Header></Header><DataDictionary></DataDictionary> (specify the format of the input csv)<NeuralNetwork functionName=”classification”> (models)

<MiningSchema></MiningSchema> (how to use the input data)<LocalTransformation></LocalTransformation> (specify derived field)

<NeuralInput></NeuralInput> (Input layer, which field should be used)

<NeuralLayers> (Layers,not include input layer and output layer)<NeuralLayer

activationFunction=”logistic”><Neuron id=”X,Y” bias=”0.0”>

<Con from=”X-1,Y” weight=””> </Neuron>

</NeuralLayer></NeuralLayers> <NeuralOutputs numberOfOutputs="1">

<NeuralOutput outputNeuron="3,0"></NeuralOutput ></NeuralOutputs></NeuralNetwork></PMML>

2.1 PMML Neural Network - Mahout

2,3,1{ 0 => {0:-0.2861259717601905,1:-0.4079344783742465,2:-0.43218273192749174} 1 => {0:0.223912887382075,1:-0.08865866120943716,2:0.4095464158191267} 2 => {0:0.14754755237008804,1:0.2638192545136143,2:0.06633581725392071}}{ 0 => {0:0.04388751672411058,1:-0.35597268769777723,2:0.21149680575173224,3:0.34402628331423807}}0.5635827615510126,0.5482023969601073,0.5609684690326279,0.5751568027254008,

Propagation Weight train evaluate

Encog backpropagation double[] MLTrain/Propagation

Mahout feed-forward Matrix network.trainOnline (vector)

network.getOutput(vector)

3. PMML Evaluationpublic Map<String, Double> evaluateRaw(EvaluationContext context){

NeuralNetwork neuralNetwork = getModel();Map<String, Double> result = Maps.newLinkedHashMap();NeuralInputs neuralInputs = neuralNetwork.getNeuralInputs();for(NeuralInput neuralInput: neuralInputs){

DerivedField derivedField = neuralInput.getDerivedField();FieldValue value = ExpressionUtil.evaluate(derivedField, context);...result.put(neuralInput.getId(), (value.asNumber()).doubleValue());

}List<NeuralLayer> neuralLayers = neuralNetwork.getNeuralLayers();for(NeuralLayer neuralLayer : neuralLayers){

List<Neuron> neurons = neuralLayer.getNeurons();for(Neuron neuron : neurons){

double z = neuron.getBias();//the bias for each Neuron, should be set to 0

List<Connection> connections = neuron.getConnections();for(Connection connection : connections){

double input = result.get(connection.getFrom());z += input * connection.getWeight();

}double output = activation(z, neuralLayer);result.put(neuron.getId(), output);

}normalizeNeuronOutputs(neuralLayer, result);

}return result;

}private double activation(double z, NeuralLayer neuralLayer){...

switch(activationFunction){case LOGISTIC: return 1.0 / (1.0 + Math.exp(-z)); //Sigmoidcase IDENTITY: return z; ...//Linear

}}

How to get score from PMML evaluator - EvaluatorTest

PMML pmml = loadPMML(getClass()); //InputStream is = getResourceAsStream("/pmml/" +getSimpleName() + ".pmml");//return IOUtil.unmarshal(is);

NeuralNetworkEvaluator evaluator = new NeuralNetworkEvaluator(pmml);InputStream is = getClass().getResourceAsStream("/pmml/NormalizedData.csv");List<Map<FieldName, String>> input = CsvUtil.load(is);for (Map<FieldName, String> maps : input) {

Map<FieldName, NeuronClassificationMap> evaluateList = (Map<FieldName, NeuronClassificationMap>)

evaluator.evaluate(maps);for

(NeuronClassificationMap cMap : evaluateList.values())

for (Map.Entry<?, Double> entry : cMap.entrySet())

System.out.println(index++ +":"+entry.getKey() + ":" + entry.getValue() * 1000);

List<FieldName> activeFields = evaluator.getActiveFields();

}

shifu plugin-trainer and pmml-adapter

Technology

encog nn model

spark lr model

encog lr model nn

mahout lr model train

encog lr model compute

mahout nn model trainonline

mahout nn model getoutput

logistic regression