intel hpc developer convention salt lake city 2016 machine ... · intel hpc developer convention...

Intel HPC Developer Convention Salt Lake City 2016

Machine Learning Track

Franz J. Király

Data Analytics, Machine Learning

and HPC in today’s changing

application environment

An overview of data analytics

DATA Scientific

Questions

Exploration Statistical

Questions

Methods

Quantitative Modelling

Predictive/Inferential Descriptive/Explanatory Sta

tistical P

rogra

mm

ing

R python

The S

cie

ntific

Meth

od

Scientific and Statistical Validation

Knowledge

(practical)

Data analytics and data science in a broader context

Data analytics

Data mining,

Machine learning

Statistics, Modelling,

Raw

data

Clean

data

Lot of problems and subtleties

at these stages already

often, most of manpower

in „data“ project needs

to go here first before

one can attempt reliable

Knowledge

underlying arguments

need to be explained

well and properly

Relevant findings and

Big Data?

What „Big Data“ may mean in practice

Kernel methods, OLS

10.000

Solution strategies Number of data samples

Strategies that

stop working

in reasonable time

Number of features

10.000.000

10.000.000.000

1.000

Reading in all the data

Random forests

100

L1, LASSO (around the same order)

Manual exploratory

data analysis

1.000

Super-linear algorithms

Linear algorithms, including

Sub-sampling

On-line models

Feature extraction

Large-scale strategies

for super-linear algorithms

Feature selection

Distributed computing

Large-scale motifs in data science

Not necessarily a lot of data, but computationally intensive models

Classical example: finite elements and other numerical models

„Big models“

New fancy example: large neural networks aka „deep learning“

= where high-performance computing is helpful/impactful

Computational challenge arises from processing all of the data

Example: histogram or linear regression with huge amounts of data

„Big data“

Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes

= the „classic“, beloved by everyone

= what it says, a lot of data (ca 1 million samples or more)

Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting

Model validation and model selection = this talk‘s focus

Answers the question: which model is best for your data?

Demanding even for simple models and small amounts of data!

Example: is deep learning better than logistic regression, or guessing?

Customer: Hospital specializing in treatment of patients with a certain disease.

Meta-modelling: stylized case studies

Scientific question: depending on patient characteristics, predict the event risk.

Patients with this disease are at-risk to experience an adverse event (e.g. death)

Data set: complete clinical records of 1.000 patients, including event if occurred

Customer: Retailer who wants to accurately model behaviour of customers.

Not of interest: which algorithm/strategy, out of many, exactly solves the task

Scientific question: predict future customer behaviour given past behaviour

Customers can buy (or not buy) any of a number of products, or churn.

Data set: complete customer and purchase records of 100.000 customers

Of interest: model interpretability; how accurate the predictions are expected to be

Customer: Manufacturer wishes to find best parameter setting for machines.

Scientific question: find parameter settings which optimizes the above

Parameters influence amount/quality of product (or whether machine breaks)

Data set: outcomes for 10.000 parameter settings on those machines

whether the algorithm/model is (easily) deployable in the „real world“

= data-centric and data-dependent modelling

Model validation and model selection

1. There is no model that is good for all data.

2. For given data, there is no a-priori reason to believe

that a certain type of model will be the best one.

(otherwise the justification of validity is circular hence faulty)

a scientific necessity implied by the scientific method and the following:

Machine learning provides algorithms & theory for meta-modelling

(otherwise the concept of a model would be unnecessary)

(any such belief is not empirically justified hence pseudoscientific)

3. No model can be trusted unless its validity has

been verified by a model-independent argument.

and powerful algorithms motivated by meta-modelling optimality.

Machine Learning

and Meta-Modelling

in a Nutshell

modelling

strategy

Leitmotifs of Machine Learning

Statistical models are objects in their own right

„learning

machines“ modelling

strategy

Engineering & statistics idea:

Engineering & computer science idea:

Computer science & statistics idea:

Any abstract algorithm can be a modelling strategy/learning machine

Future performance of algorithm/learning machine can be estimated

„model validation“

„model selection“

„computational

learning“

from the intersection of engineering, statistics and computer science

Possibly non-explicit

(and should)

learning

machine ?

Problem types in Machine Learning

? ? ?

Supervised Learning:

some data is labelled by expert/oracle

Task: predict label from covariates

statistical models are usually discriminative

Examples: regression, classification

Problem types in Machine Learning

? ? !

Unsupervised Learning:

the training data is not pre-labelled

Task: find „structure“ or „pattern“ in data

statistical models are usually generative

Examples: clustering, dimension reduction

Advanced learning tasks

Semi-supervised learning

some training data are labelled, some are not

On-line learning

the data is revealed with time, models need to update

Anomaly detection

all or most data are „positive examples“, the task is to flag „test negatives“

Complications in the labelling

Complications through correlated data and/or time

Forecasting

each data point has a time stamp, predict the temporal future

Transfer learning

the data comes in dissimilar batches, train and test may be distinct

Reinforcement learning

data are not directly labelled, only indirect gain/loss

observations

„training data“

predictions

model fitting

“learning”

fitted model

prediction

new data

??

model tuning parameters

e.g., to base

decisions on

What is a Learning Machine?

Examples: generalized linear model, linear regression, support vector machine,

neural networks (= „deep learning“), random forests, gradient boosting, …

… an algorithm that solves,

e.g., the previous tasks:

Illustration: supervised learning machine

Example: Linear Regression

observations

„training data“

predictions

model fitting

“learning”

fitted model

prediction

new data

?

Fit intercept or not?

Model validation: does the model make sense?

Model

learning

Prediction

„the truth“

„training data“

„test data“

e.g. regression, GLM,

advanced methods learnt model

? „test labels“

compare

&

quantify „out-of-sample“

„hold-out “

„in-sample“

Predictive models need to be validated on unseen data!

Which means the part of data for testing has not been seen by the algorithm before!

(note: this includes the case where machine = linear regression, deep learning, etc)

The only (general) way to test goodness of prediction is actually observing prediction!

??

predictions e.g. evaluating the

regression model

prediction strategy

learning machine

„Re-sampling“: training data 1

test data

Predictor 1

Predictor 2

Predictor 3 training data 2

test data

Predictor 1

Predictor 2

Predictor 3 training data 3

test data 3

Predictor 1

Predictor 2

Predictor 3

all data

errors 1,2,3

errors 1,2,3

errors 1,2,3

aggregate

errors 1,2,3

comparison

k-fold cross-validation

how to obtain training/test splits type of re-sampling pros/cons

2. obtain k train/tests splits via:

1. divide data in k (almost) equal parts

each part is test data exactly once

the rest of data is the training set often: k=5

good compromise between

runtime and accuracy

Multiple algorithms are compared on multiple data splits/sub-datasets

leave-one-out

when k is small compared to data size

= [number of data points]-fold c.v. very accurate, high run-time

repeated

sub-sampling

parameters: training/test size

# of repetitions

1. obtain a random sub-sample of

training/test data of specified sizes

(train/test need not cover all data)

can be arbitrarily quick

can be arbitrarily inaccurate

(depending on parameter choice)

2. repeat 1. desired number of times can be combined with k-fold

State-of-art principle in model validation, model comparison and meta-modelling

Quantitative model comparison a „benchmarking experiment“ results in a table like this

model RMSE

15.3

? Confidence regions (or paired tests) to compare models to each other:

A is better than B / B is better than A / A and B are equally good

Uninformed model (stupid model/random guess) needs to be included

otherwise a statement „is better than an uninformed guess“ cannot be made.

9.5

13.6

20.1 ± 1.2

± 0.9

± 0.7

± 1.4

MAE

12.3

7.3

11.4

18.1 ± 1.1

± 0.8

± 0.9

± 1.7

„useful model“ = (significantly) better than uninformed baseline

Meta-model: automated parameter tuning

training data

test data

Parameters 1

Parameters 2

Parameters 3

mo

del

goodn

ess

1

5

.

3

?

9

.

5 1

3

.

6 2

0

.

1

±

1

.

2

±

0

.

9

±

0

.

7

±

1

.

4

Best parameters

whole training data

Re-sampled training data

Important caveat:

Which measure of predictive goodness

Which inner re-sampling scheme

Methods are usually less sensitive

to these „new“ tuning parameters

the „inner“ training/test splits

need to be part of any „outer“ training set

otherwise validation is not out-of-sample!

Re-sampling is used to determine [best parameter setting]

For validation, new unseen data needs to be used:

all data

training

data

test data

tuning train

tuning test

„real“ test

model goodness

1

5

.

3

?

9

.

5

1

3

.

6

2

0

.

1

±

1

.

2

±

0

.

9

±

0

.

7

±

1

.

4

Model w. Best

Parameter

training

data

fit to all

predict &

quantify

Multi-fold-schemes are nested: „splits within splits“

Meta-Strategies in ML „Model

tuning“

Model with tuning parameters Best tuning parameters are determined

using data-driven tuning algorithm

„Ensemble

learning“

A B

C

D

a number of (possibly „weak“) models

A D B

„strong“ ensemble model

Object dependencies in the ML workflow

all data One interesting dataset

into multiple

train/test splits training

data test data

is re-sampled

training

data test data

training

data test data

„Typical

number of“

5-10

on each

of which

the strategies

are compared 1 2 M M = 5-20

most of which

are parameter-

tuned by the

same principle

10-10.000

parameter

combinations Ensembles: further nesting

10-1.000

base learners

Runtime = 10 x 10 x 5 x 1.000 (x 100) x one run on N samples

3-5 nested splits

outer

splits

N = 100-100.000

data points („small data“)

(usually O(N²) or O(N³) )

Machine Learning

Toolboxes

An incomplete list of influential toolboxes

Modular API (e.g., methods)

Model tuning,

meta-methods

Model validation

and comparison GUI Language

R

caret

python

multi-

interface

R

Java 3rd party

wrappers

python

Common

models

Not

entirely

scikit-learn is perhaps the most widely used ML toolbox

mostly

kernels some

Few, mostly

classifiers few

python

The object-oriented ML Toolbox API

Learning Machines

as found in the R/mlr or scikit-learn packages

Leading principles: encapsulation, modularization

modular structure

Linear regression

fit(traindata)

„learning machine“ object

predict(testdata)

plus metadata & model info

object orientation

Abstraction models objects with unified API:

Public interface Concept abstracted in R/mlr in sklearn

fitting, predicting, set parameters Learner estimator

Re-sampling schemes sample, apply & get results ResampleDesc splitter classes in model_selection

Evaluation metrics compute from results, tabulate Measure metrics classes in metrics

Meta-modelling wrapping machines by strategy

Learning task benchmark, list strategies/measures Task Implicit, not

encapsulated

Tuning Ensembling Pipelining Pipeline

various wrappers various wrappers

fused classes

HPC for benchmarking/validation today

all data

Scikit-learn: joblib

training

data test data training

data test data

training

data test data

„Typical

number of“

5-10

1 2 M M = 5-20

10-10.000

parameter

combinations

10-1.000

base learners

Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive)

3-5 nested splits

outer

splits

N = 100-100.000

data points („small data“)

mlr: parallelMap

1

2

3

4

At the

selected

level:

Distribute to

clusters/cores

(one of 1-4)

HPC support tomorrow?

1 2 M

Layer 2:

Layer 1:

full graph of

dependencies: re-samples

algorithms

parameters

…

Scheduler for

algorithms and

meta-algorithms

Data

/task p

ipelin

e

DATA (e.g. Hadoop)

Layer 3:

Optimized

Primitives

Layer 4:

Hardware API

Combining (?)

MapReduce,

DAAL, dask,

joblib -> TBB?

e.g. MKL,

CUDA,

BLAS

e.g. distributed, multi-core,

multi-type/heterogeneous

(image source: continuum analytics)

Linear systems

convex optimization

stoch. gradient descent (image source: Intel math kernel library)

Challenges in ML APIs and HPC Surprisingly few resources have been invested in ML toolboxes

Most advanced toolboxes are currently open-source & academic

Features that would be desirable to the practitioner

but not available without mid-scale software development:

Integration of (a) data management, (b) exploration and (c) modelling

Full HPC integration on granular level for distributed ML benchmarking

Non-standard modelling tasks, structured data (incl time series)

data heterogeneity, multiple datasets, time series, spatial features, images etc

forecasting, on-line learning, anomaly detection, change point detection

especially challenging: integration in large scale scenarios

e.g. MapReduce for divide/conquer over data, model parts, and models

making full use parallelism for nesting and computational redundancies

complete HPC architecture for whole model benchmarking workflow

meta-modelling and re-sampling for these is an order of magnitude more costly

intel hpc developer convention salt lake city 2016 machine ... · intel hpc developer convention...

Documents