quick presentation for the openml workshop in eindhoven 2014

Manuel Martín Salvador@[email protected]

OpenML workshopEindhoven 21/10/2014

Background● MSc. Computer Engineering● Master in Soft Computing and Intelligent Systems

Currently ● PhD Student – Automatic and adaptive pre-processing for building

predictive models● Teaching – Data Mining lab

Data preparation and pre-processing

Data preparation and pre-processing

Labour intensive tasks(up to 80% of a data mining process)

Automating pre-processing

A lot of available techniques

No free lunch

Multiple combinations

Order of pre-processing methods matters

No semantic → some approaches use ontologies

Meta-learning → needs a good database of experiments

Scientific workflow platforms and repositories with experiments

Software Repository Applications

DiscoveryNet (inactive) -

Kepler - Various

Taverna MyExperiment (open) Bioinformatics

Pegasus - Various

Galaxy - Biomedical

Pipeline Pilot Accelrys (commercial)

* MLComp (“open”) Machine Learning

Weka,MOA,R,RapidMiner OpenML (open) Machine Learning

OpenML statistics

Datasets: 1042Tasks: 3025Flows: 640Runs: 31540

Valid: 24410With errors: 7130Datasets: 300Individual components: 136Paired components: 635“Flow size”: 1 – 8198

2 – 12178 3 – 1993 4 – 1533 5 – 502 6 – 6

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Distribution of components

0

200

400

600

800

1000

1200

1400

1600

Distribution of datasets

Only 3 Weka filters:Principal Components, Discretize, PLSFilter

TO DO

How to increase the number of pre-processing methods in OpenML?- The only way right now is using FilteredClassifier in Weka- What about R, MOA, RapidMiner?

Improving flow representation- Right now is difficult to see how components are connected- Clear distinction of parameters- What about including Weka flows (XML based) and ADAMS flows?- PMML support?

Statistics for available data, tasks, flows and runs

Flow recommendation system for a given dataset[dataset, data characteristics, prediction accuracy, flow_id]

Flow validation before executing it[dataset, data characteristics, flow characteristics, failure]

A little bit further

Adapting flows while processing data streams

- Detecting changes in data characteristics

- Locally checking input/output in each flow component

- Change propagation

- Reducing cost of adaptation

Photos CC by Cristina Granados

Visit us!Data Science Institute @ Bournemouth University

quick presentation for the openml workshop in eindhoven 2014

Data & Analytics