deep learning through examples

47
Deep Learning through Examples 0xdata, H2O.ai Scalable In-Memory Machine Learning Silicon Valley Big Data Science Meetup, Palo Alto, 9/3/14 Arno Candel

Upload: 0xdata

Post on 29-Nov-2014

3.913 views

Category:

Software


6 download

DESCRIPTION

Suggestions: 1) For best quality, download the PDF before viewing. 2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves. 3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2. Screencast: http://youtu.be/VoL7JKJmr2I Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!) In this talk, we take Deep Learning to task with real world data puzzles to solve. Data: - Higgs binary classification dataset (10M rows, 29 cols) - MNIST 10-class dataset - Weather categorical dataset - eBay text classification dataset (8500 cols, 500k rows, 467 classes) - ECG heartbeat anomaly detection

TRANSCRIPT

Page 1: Deep Learning through Examples

Deep Learning through Examples

0xdata H2OaiScalable In-Memory Machine Learning

Silicon Valley Big Data Science Meetup Palo Alto 9314

Arno Candel

Who am IPhD in Computational Physics 2005

from ETH Zurich Switzerland

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning 9 months at 0xdataH2O - Machine Learning

15 years in HPCSupercomputingModeling

Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine

ArnoCandel

H2O Deep Learning ArnoCandel

OutlineIntro amp Live Demo (10 mins)

Methods amp Implementation (20 mins)

Results amp Live Demos (25 mins)

Higgs boson detection

MNIST handwritten digits

text classification

Q amp A (5 mins)

3

H2O Deep Learning ArnoCandel

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

4

H2O Deep Learning ArnoCandel

Customer Demands for Practical Machine Learning

5

Requirements Value

In-Memory Fast (Interactive)

Distributed Big Data (No Sampling)

Open Source Ownership of Methods

API SDK Extensibility

H2O was developed by 0xdata from scratch to meet these requirements

H2O Deep Learning ArnoCandel

H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

6

H2O H2O

Java

H2O Deep Learning ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

eg Deep Learning

7

MapReduce

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 2: Deep Learning through Examples

Who am IPhD in Computational Physics 2005

from ETH Zurich Switzerland

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree Inc - Machine Learning 9 months at 0xdataH2O - Machine Learning

15 years in HPCSupercomputingModeling

Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine

ArnoCandel

H2O Deep Learning ArnoCandel

OutlineIntro amp Live Demo (10 mins)

Methods amp Implementation (20 mins)

Results amp Live Demos (25 mins)

Higgs boson detection

MNIST handwritten digits

text classification

Q amp A (5 mins)

3

H2O Deep Learning ArnoCandel

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

4

H2O Deep Learning ArnoCandel

Customer Demands for Practical Machine Learning

5

Requirements Value

In-Memory Fast (Interactive)

Distributed Big Data (No Sampling)

Open Source Ownership of Methods

API SDK Extensibility

H2O was developed by 0xdata from scratch to meet these requirements

H2O Deep Learning ArnoCandel

H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

6

H2O H2O

Java

H2O Deep Learning ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

eg Deep Learning

7

MapReduce

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 3: Deep Learning through Examples

H2O Deep Learning ArnoCandel

OutlineIntro amp Live Demo (10 mins)

Methods amp Implementation (20 mins)

Results amp Live Demos (25 mins)

Higgs boson detection

MNIST handwritten digits

text classification

Q amp A (5 mins)

3

H2O Deep Learning ArnoCandel

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

4

H2O Deep Learning ArnoCandel

Customer Demands for Practical Machine Learning

5

Requirements Value

In-Memory Fast (Interactive)

Distributed Big Data (No Sampling)

Open Source Ownership of Methods

API SDK Extensibility

H2O was developed by 0xdata from scratch to meet these requirements

H2O Deep Learning ArnoCandel

H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

6

H2O H2O

Java

H2O Deep Learning ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

eg Deep Learning

7

MapReduce

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 4: Deep Learning through Examples

H2O Deep Learning ArnoCandel

About H20 (aka 0xdata)Java Apache v2 Open Source

Join the wwwh2oaicommunity 1 Java Machine Learning in Github

4

H2O Deep Learning ArnoCandel

Customer Demands for Practical Machine Learning

5

Requirements Value

In-Memory Fast (Interactive)

Distributed Big Data (No Sampling)

Open Source Ownership of Methods

API SDK Extensibility

H2O was developed by 0xdata from scratch to meet these requirements

H2O Deep Learning ArnoCandel

H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

6

H2O H2O

Java

H2O Deep Learning ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

eg Deep Learning

7

MapReduce

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 5: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Customer Demands for Practical Machine Learning

5

Requirements Value

In-Memory Fast (Interactive)

Distributed Big Data (No Sampling)

Open Source Ownership of Methods

API SDK Extensibility

H2O was developed by 0xdata from scratch to meet these requirements

H2O Deep Learning ArnoCandel

H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

6

H2O H2O

Java

H2O Deep Learning ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

eg Deep Learning

7

MapReduce

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 6: Deep Learning through Examples

H2O Deep Learning ArnoCandel

H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

6

H2O H2O

Java

H2O Deep Learning ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

eg Deep Learning

7

MapReduce

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 7: Deep Learning through Examples

H2O Deep Learning ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

eg Deep Learning

7

MapReduce

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 8: Deep Learning through Examples

H2O Deep Learning ArnoCandel

H2O - The Killer App on Spark8

httpdatabrickscomblog20140630sparkling-water-h20-sparkhtml

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 9: Deep Learning through Examples

H2O Deep Learning ArnoCandel

H2O DeepLearning on Spark9

Test if we can correctly learn A B where Y = logistic(A + BX) test(deep learning log regression) val nPoints = 10000 val A = 20 val B = -15 Generate testing data val trainData = DeepLearningSuitegenerateLogisticInput(A B nPoints 42) Create RDD from testing data val trainRDD = scparallelize(trainData 2) trainRDDcache() import H2OContext_ Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc trainRDD) Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParamssource = trainH2ORDD dlParamsresponse = trainH2ORDDlastVec() dlParamsclassification = true val dl = new DeepLearning(dlParams) val dlModel = dltrain()get() Score validation data val validationData = DeepLearningSuitegenerateLogisticInput(A B nPoints 17) val validationRDD = scparallelize(validationData 2) val validationH2ORDD = toDataFrame(sc validationRDD) val predictionH2OFrame = new DataFrame(dlModelscore(validationH2ORDD))(predict) val predictionRDD = toRDD[DoubleHolder](sc predictionH2OFrame) will be implicit in the future Validate prediction validatePrediction( predictionRDDcollect()map (_predictgetOrElse(DoubleNaN)) validationData)

Brand-Sparkling-New Sneak Preview

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 10: Deep Learning through Examples

H2O Deep Learning ArnoCandel 10

John Chambers (creator of the S language R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 11: Deep Learning through Examples

H2O Deep Learning ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with RData resides on the H2O cluster

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 12: Deep Learning through Examples

H2O Deep Learning ArnoCandel 12

Higgs Particle Discovery

Higgsvs

Background

Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize

httparxivorgpdf14024735v2pdf

Images courtesy CERN LHC

Machine Learning Meets Physics

Or rather Back to the roots (WWW was invented at CERN in rsquo89hellip)

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 13: Deep Learning through Examples

H2O Deep Learning ArnoCandel 13

Higgs Binary Classification ProblemCurrent methods of choice for physicists - Boosted Decision Trees - Neural networks with 1 hidden layer BUT Must first add derived high-level features (physics formulae)

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Metric AUC = Area under the ROC curve (range 05hellip1 higher is better)

add derived

features

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 14: Deep Learning through Examples

H2O Deep Learning ArnoCandel 14

Higgs Can Deep Learning Do Better

Letrsquos build a H2O Deep Learning model and find out (That was my last weekend)

Algorithm low-level H2O AUC all features H2O AUC

Generalized Linear Model 0596 0684

Random Forest 0764 0840

Gradient Boosted Trees 0753 0839

Neural Net 1 hidden layer 0760 0830

Deep Learning

ltYour guess goes heregt

reference paper results baseline 0733

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 15: Deep Learning through Examples

H2O Deep Learning ArnoCandel

WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations

What is Deep Learning

Example Input data(image)

Prediction (who is it)

15

Facebooks DeepFace (Yann LeCun) recognises faces as well as humans

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 16: Deep Learning through Examples

H2O Deep Learning ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep (2 layers kernel + linear)

Classification trees are not deep (operate on original input space no new features generated)

16

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 17: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Deep Learning is Trending

20132009

Google trends

2011

17

Businesses are usingDeep Learning techniques

Google Brain (Andrew Ng Jeff Dean amp Geoffrey Hinton) FBI FACE $1 billion face recognition project Chinese Search Giant Baidu Hires Man Behind the ldquoGoogle Brainrdquo (Andrew Ng)

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 18: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

18

Deep Learning wins competitions AND

makes humans businesses and machines (cyborgs) smarter

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 19: Deep Learning through Examples

H2O Deep Learning ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) + multi-threaded speedup (H2O ForkJoin worker threads update the model asynchronously) + smart algorithms for accuracy (weight initialization adaptive learning rate momentum dropout regularization l1L2 regularization grid search checkpointing auto-tuning model averaging)

= Top-notch prediction engine

Deep Learning in H2O19

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 20: Deep Learning through Examples

H2O Deep Learning ArnoCandel

ldquofully connectedrdquo directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2connections

information flow

inputoutput neuronhidden neuron

4 3 2neurons 3

Example Neural Network20

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 21: Deep Learning through Examples

H2O Deep Learning ArnoCandel

age

income

employmentyj = tanh(sumi(xiuij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yjvjk)+ck)

vjk

zk pl

pl = softmax(sumk(zkwkl)+dl)

wkl

softmax(xk) = exp(xk) sumk(exp(xk))

ldquoneurons activate each other via weighted sumsrdquo

Prediction Forward Propagation

activation function tanh alternative

x -gt max(0x) ldquorectifierrdquo

pl is a non-linear function of xi can approximate ANY function

with enough layers

bj ck dl bias values(indep of inputs)

21

married

single

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 22: Deep Learning through Examples

H2O Deep Learning ArnoCandel

age

income

employment

xi

Automatic standardization of data xi mean = 0 stddev = 1

horizontalize categorical variables eg

full-time part-time none self-employed -gt

010 = part-time 000 = self-employed

Automatic initialization of weights

Poor manrsquos initialization random weights wkl

Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))

Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)

22

married

single

wkl

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 23: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo

Training Update Weights amp Biases

Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)

For each training row we make a prediction and compare with the actual label (supervised learning)

married108predicted actual

Objective minimize prediction error (MSE or cross-entropy)

w ltmdash w - rate partEpartw

1

23

single002

E

wrate

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 24: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Backward Propagation

partEpartwi = partEparty partypartnet partnetpartwi

= part(error(y))party part(activation(net))partnet xi

Backprop Compute partEpartwi via chain rule going backwards

wi

net = sumi(wixi) + b

xiE = error(y)

y = activation(net)

How to compute partEpartwi for wi ltmdash wi - rate partEpartwi

Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow

24

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 25: Deep Learning through Examples

H2O Deep Learning ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodesJVMs sync

threads async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w = (w1+w2+w3+w4)4

map each node trains a copy of the weights

and biases with (some or all of) its

local data with asynchronous FJ

threads

initial model weights and biases w

updated model w

H2O atomic in-memoryK-V store

reduce model averaging

average weights and biases from all nodes

speedup is at least nodeslog(rows) arxiv12094129v3

Keep iterating over the data (ldquoepochsrdquo) score from time to time

Query amp display the model via

JSON WWW

2

2 431

1

1

1

43 2

1 2

1

i

auto-tuned (default) or user-specified number of points per MapReduce iteration

25

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 26: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)

RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs

26

ldquoSecretrdquo Sauce to Higher Accuracy

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 27: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Detail Adaptive Learning Rate

Compute moving average of ∆wi2 at time t for window length rho

E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2

Compute RMS of ∆wi at time t with smoothing epsilon

RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)

Adaptive acceleration momentum accumulate previous weight updates but over a window of time

RMS[∆wi]t-1

RMS[partEpartwi]t

rate(wi t) =

Do the same for partEpartwi then obtain per-weight learning rate

cf ADADELTA paper

27

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 28: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Detail Dropout Regularization28

Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations

age

income

employment

married

singleX

X

X

Testing Use all activations but reduce them by a factor p

(to ldquosimulaterdquo the missing activations during training)

cf Geoff Hintons paper

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 29: Deep Learning through Examples

H2O Deep Learning ArnoCandel

MNIST digits classification

Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)

29

Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes

MNIST = Digitized handwritten digits database (Yann LeCun)

Data 28x28=784 pixels with (gray-scale) values in 0hellip255

Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo

Letrsquos see how H2O does on the MNIST dataset

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 30: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Frequent errors confuse 27 and 49

H2O Deep Learning on MNIST 087 test set error (so far)

30

test set error 15 after 10 mins 10 after 15 hours 087 after 4 hours

World-class results

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 31: Deep Learning through Examples

H2O Deep Learning A Candel

Weather Dataset31

Predict ldquoRainTomorrowrdquo from Temperature Humidity Wind Pressure etc

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 32: Deep Learning through Examples

H2O Deep Learning A Candel

Live Demo Weather Prediction

Interactive ROC curve with real-time updates

32

3 hidden Rectifier layers Dropout

L1-penalty

127 5-fold cross-validation error is at least as good as GBMRFGLM models

5-fold cross validation

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 33: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Live Demo Grid Search

How did I find those parameters Grid Search(works for multiple hyper parameters at once)

33

Then continue training the best model

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 34: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Goal Predict the item from sellerrsquos text description

34

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo

Data Binary word vector 0010000010001hellip0

vintagegold condition

Letrsquos see how H2O does on the ebay dataset

Text Classification

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 35: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time

35

Note 2 No tuning was done(results are for illustration only)

Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes

Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Text Classification

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 36: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)

36

Speedup

000

1000

2000

3000

4000

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node 1 epoch per node per MapReduce)

27 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 37: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

37

Toy example Find anomaly in ECG heart beat data First train a model on whatrsquos ldquonormalrdquo 20 time-series samples of 210 data points each

Deep Auto-Encoder Learn low-dimensional non-linear ldquostructurerdquo of the data that allows to reconstruct the orig data

Also for categorical data

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 38: Deep Learning through Examples

H2O Deep Learning ArnoCandel 38

Test set with anomaly

Test set prediction is reconstruction looks ldquonormalrdquo

Found anomaly large reconstruction error

Model of whatrsquos ldquonormalrdquo

+

=gt

Deep Learning Auto-Encoders for Anomaly Detection

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 39: Deep Learning through Examples

H2O Deep Learning ArnoCandel 39

R Vignette with example R scripts http0xdatacomh2oalgorithms

All parameters are available from Rhellip

H2O brings Deep Learning to R

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 40: Deep Learning through Examples

H2O Deep Learning ArnoCandel

POJO Model Export for Production Scoring

40

Plain old Java code is auto-generated to take your H2O Deep Learning models into production

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 41: Deep Learning through Examples

H2O Deep Learning ArnoCandel 41

How well did H2O Deep Learning do

Letrsquos see how H2O did in the past 30 minutes

Higgs Particle Discovery with H2O

ltYour guess goes heregt

reference paper results

Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 42: Deep Learning through Examples

H2O Deep Learning ArnoCandel

H2O Steam Scoring Platform

42

Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them

httpserverportsteamindexhtml

Live Demo

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 43: Deep Learning through Examples

H2O Deep Learning ArnoCandel 43

Live Demo on 10-node cluster lt10 minutes runtime for all algos Better than LHC baseline of AUC=073

Scoring Higgs Models in H2O Steam

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 44: Deep Learning through Examples

H2O Deep Learning ArnoCandel 44

AlgorithmPaperrsquosl-l AUC

low-level H2O AUC

all featuresH2O AUC

Parameters (not heavily tuned) H2O running on 10 nodes

Generalized Linear Model - 0596 0684 default binomial

Random Forest - 0764 0840 50 trees max depth 50

Gradient Boosted Trees 073 0753 0839 50 trees max depth 15

Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs

Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs

Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs

Deep Learning 6 hidden layers 0880 running - 6x500 Rectifier L1=L2=1e-5

Deep Learning on low-level features alone beats everything else H2O prelim results compare well with paperrsquos results (TMVA amp Theano)

Higgs Particle Detection with H2O

Nature paper httparxivorgpdf14024735v2pdf

HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 45: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Tips for H2O Deep LearningGeneral More layers for more complex functions (exp more non-linearity) More neurons per layer to detect finer structure in data (ldquomemorizingrdquo) Add some regularization for less overfitting (lower validation set error) Specifically Do a grid search to get a feel for convergence then continue training Try TanhRectifier try max_w2=10hellip50 L1=1e-51e-3 andor L2=1e-5hellip1e-3 Try Dropout (input up to 20 hidden up to 50) with testvalidation set Input dropout is recommended for noisy high-dimensional input Distributed More training samples per iteration faster but less accuracy With ADADELTA Try epsilon = 1e-41e-61e-81e-10 rho = 09095099 Without ADADELTA Try rate = 1e-4hellip1e-2 rate_annealing = 1e-5hellip1e-9 momentum_start = 05hellip09 momentum_stable = 099 momentum_ramp = 1rate_annealing Try balance_classes = true for datasets with large class imbalance Enable force_load_balance for small datasets Enable replicate_training_data if each node can h0ld all the data

45

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 46: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Extensions for H2O Deep Learning46

- Vision Convolutional amp Pooling Layers PUB-644

- Anomaly Detection PUB-806

- Pre-Training Stacked Auto-Encoders PUB-1014

- Faster Training GPGPU support PUB-1013

- LanguageSequences Recurrent Neural Networks

- Benchmark vs other Deep Learning packages

- Investigate other optimization algorithms

Contribute to H2OAdd your own JIRA tickets

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you

Page 47: Deep Learning through Examples

H2O Deep Learning ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform It was designed for high-performance machine learning applications on big data

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data

Join our Community and Meetups httpsgithubcom0xdata httpdocs0xdatacom wwwh2oaicommunity hexadata

47

Thank you