data mining and machine...

DATA MINING AND MACHINE LEARNINGLecture 6: Data preprocessing

and model selection

Lecturer: Simone Scardapane

Academic Year 2016/2017

Table of contents

Data preprocessingFeature normalizationMissing data imputationOutlier detection

Model selection, evaluation and fine-tuningK-fold cross-validationModel fine-tuningOther topics in model selectionAdvanced model evaluation techniques

The importance of feature normalization (1)

Suppose you are a bank, and you wish to draw inferences fromyour clients, who are described by only two attributes:

I Sex, encoded as [0, 1] (male), or [1, 0] (female).

I Income [$/month], going from 0 to 10.000.

Now consider a generic client x ∈ R3, and two other clients verysimilar to x:

I x′ is the same as x, but of opposite sex. We have that‖x′ − x‖2 =

√2.

I x′′ is the same as x, but his income per month is 100$higher. We have that ‖x′′ − x‖2 = 100.

The importance of feature normalization (2)

Any algorithm based on the Euclidean distance between patterns(e.g., k-NN) will give 10 times more importance to the incomewith respect to the sex, irrespective of its actual importance forthe discriminative task.

In practice, almost any machine learning algorithm will not workproperly if the features are measured in units with vastly diffe-rent scales.

The process of transforming data so that all features lie in thesame range is called feature normalization.

Min-max scaling

Feature normalization is generally applied feature-wise, althougha few techniques scale each pattern separately.

Denote by x a generic pattern. A common normalization step,min-max scaling, scales each feature to the range [0, 1] byapplying the following transformation:

xi =xi − xi ,min

xi ,max − xi ,min.

where xi ,min (xi ,max) are the minimum (maximum) value for theith feature in the dataset. We can successively scale in a genericrange [α, β] by applying a second transformation:

xi = xi(β − α) + α .

Standard scaling and robust scaling

Another popular technique is standard scaling, where:

xi =xi − µi

σ2i

.

where µi is the empirical mean and σ2i the empirical variance

computed on the dataset.

Most feature normalization techniques can be applied by exclu-ding ‘extreme’ data from the computation of the statistics (e.g.,using only data between the 15th and the 85th percentile) inorder to provide more robustness to outliers. Sometimes, this iscalled robust scaling.

An example of feature normalization

0.0 0.2 0.4 0.6 0.8 1.0Value

0

20

40

60

80

100

120

# va

lues

(a) Min-max scaling

1 0 1 2Value

0

20

40

60

80

100

120

# va

lues

(b) Standard scaling

0.5 0.0 0.5 1.0 1.5Value

0

20

40

60

80

100

120

# va

lues

(c) Robust scaling

Figure 1 : A comparison of several feature normalization techniques onthe third feature of the ‘Boston’ dataset. Note how the distribution of thedata is never affected. Robust scaling is computed on the inter-quantilerange (i.e., between the 25th and 75th percentile).

Log normalization

Many features in the real world exhibit highly skewed distribu-tions with a very long tail. A classic example is the income ofa person.

A common normalization in this case is log normalization:

xi = log(xi) .

Log normalization is important when we only care about relativechanges in a quantity. For example, the relative increase ofa+ 1 with respect to a is always log(1) using log normalization,irrespective of the value of a.

An example of log normalization

0 20 40 60 80Value

0

100

200

300

400

# va

lues

(a) Original distribution

4 2 0 2 4Value

0

20

40

60

80

100

# va

lues

(b) After log normalization

Figure 2 : Log normalization applied to the first feature of the‘Boston’ dataset (crime rate).

Table of contents



The problem of missing data

Another common problem in real-world data is missing data,wherein several entries of the input matrix X are missing.

Most learning methods are not able to cope with missing datawithout some form of human preprocessing.

A baseline strategy is to simply remove any pattern with one ormore missing features. This approach can be sub-optimal fortwo reasons:

1. If many features have missing values, we can end up withan extremely small dataset.

2. If the lack of data is correlated with some underlying factor(e.g., male users are more reluctant to share some infor-mation), then the resulting dataset will be biased.

Mean and median imputation

An effective approach in practice is to replace missing data withthe mean (or median) computed with respect to the correspon-ding column.

For classification problems, we can also do conditional dataimputation, where the mean (or median) is computed onlyusing data of the corresponding class.

If we have prior knowledge on the missing data process, wecan do conditional imputation with respect to some other inputfeatures, e.g., compute the mean with respect to the user’snationality.

Indicator variables for missing data

I If a categorical variable is missing values, an alternativeapproach is to add a new category corresponding to themissing case. For example, considering a feature x ={male, female}, we can construct a dummy encoding as:

male =

100

, female =

010

, missing value =

001

.

I The same can be done with a continuous variable; in thatcase, we use a default value for the feature when missing,and we add a bitwise feature which is 1 if the feature ismissing, 0 otherwise.

Imputation as a learning problem

Suppose that only the kth feature has missing values. We canmodel data imputation as a supervised learning problem, wherewe look for a function:

f (x−k , y) = xk ,

where x−k denotes the set of all features excluding the kth one.Mean and median imputation are a special case of this setup,where the predicted value is constant (or piecewise constant forconditional imputation).

It is rare to obtain good results with overly complicated modelsfor f . Linear regression (or logistic regression) are commonchoices using this setup.

Table of contents



Outlier and novelty detection

A third common problem is given by outliers, i.e., points thatare not ‘consistent’ (in some sense) with respect to the rest ofthe dataset. Outliers can be due to a wide range of reasons:

I Noise in the data collection process.

I Real anomalous patterns, e.g., frauds in a credit card da-taset.

If we are interested in detecting these anomalous patterns (e.g.,for fraud detection), the problem is called anomaly detectionor novelty detection. If we only wish to remove outliers inorder to improve the accuracy we have outliers detection.

Elliptic envelope

For low-dimensional problems, a common solution for outlierdetection is to fit a known parametric distribution (typically amultivariate Gaussian), and remove points whose Mahalanobisdistance is greater than a given threshold.

The covariance of the Gaussian is generally computed usingtechniques that are robust to outliers (e.g., minimum covarianceestimation).

Automatic techniques for outlier detection can severely hinderthe dataset if used improperly. If outliers are only due to a singlefeature, it is more common to visually inspect the dataset usingbox-plots.

Minumum covariance estimation

Figure 3 : Example of outlier detection using elliptical envelopes.(http://scikit-learn.org/stable/auto examples/covariance/plot mahalanobis distances.html).

http://scikit-learn.org/stable/auto_examples/covariance/plot_mahalanobis_distances.html

Example of box-plot

INDUS RAD LSTATFeature

1

0

1

2

3

Valu

es

Figure 4 : An example of box-plot on three features of the‘Boston’ dataset. The red line is the median; the box covers theinterquantile range (between the 25th percentile and the 75thpercentile). Points outside the ‘whiskers’ are considered outliers.

Other preprocessing techniques

There are countless other operations that can be applied to databefore learning, at the discretion of the specialist and dependingon the domain knowledge. Just to give a few examples:

I Generalization: replacing a categorical value with a co-arser one (e.g., region instead of postcode).

I Aggregation: replacing a continuous variable by binningits value in a set of possible ranges (this can also be doneon the output to transform a regression problem into aclassification one).

I Constructing polynomial features from the original ones.

Table of contents



K-fold cross-validation

Up to now, we used the holdout procedure to estimate the(expected) test error of a function. A single holdout can bemisleading, particularly for estimators having a large variance.

A more principled (and common way) to assess a model is torun a K-fold cross-validation:

1. Split the dataset into K equally sized portions S1, . . . ,SK .

2. Train on S2, . . . ,SK , and test on S1. Get error E1.

3. Train on S1,S3, . . . ,SK , and test on S2. Get error E2.

4. ...

5. Train on S1, . . . ,SK−1, and test on SK . Get error EK .

6. Compute final accuracy as E = 1K

∑Ki=1 Ei .

Visualization of K-fold cross-validation

Figure 5 : https://sebastianraschka.com/faq/docs/

evaluate-a-model.html.

https://sebastianraschka.com/faq/docs/evaluate-a-model.html


Bootstrap method

A third possibility is the bootstrap method:

1. Construct a bootstrap dataset by drawing N examples withreplacement from the original dataset.

2. Build B different bootstraps (e.g., B = 100), and use themto train B different functions.

3. For each point in the training set, compute the averageerror of all functions trained from bootstrap datasets notcontaining that point.

4. Obtain final accuracy by averaging the error over the entiretraining dataset.

This procedure can be extended in a large number of ways, seeSection 7.11 in the book.

Table of contents



Finding the hyper-parameters

Most learning methods require the user to select one or morehyper-parameters, most notably complexity-based parameters:

I The C parameter in ridge regression and LASSO.

I The k parameter in k-NN.

I The ε parameter in the ε-insensitive loss function.

Suppose we try several possibilities for each hyper-parameter.How do we select which one is the best? We already know thattraining error is a poor proxy for the generalization capability ofa learning model. As a matter of fact, these two problems arealmost equivalent:

I Model assessment: evaluate the expected error of a model.

I Model selection: evaluate the expected error of two modelsto select the best one.

Nested cross-validation

We can solve the model selection problem equivalently to themodel assessment problem, e.g., using holdout:

1. Keep a portion of the training dataset apart (called thevalidation dataset).

2. Train several models using the remaining portion, each onewith a different selection of hyper-parameters.

3. Select the hyper-parameter with the lowest error on thevalidation dataset, and retrain the model using the entiredataset.

One can equivalently use holdout, K-fold cross-validation, orbootstrap, either for testing or for validation.

Visualizing nested cross-validation





The proper way of doing model selection





Grid search for model fine-tuning

For models with one/two hyper-parameters, the most commonway to do fine-tuning is a grid search, where we specify a listof possible configurations, and we exhaustively search for thebest one.

For example, we can optimize k-NN by searching for k in 5, 10,. . ., 50. Some parameters are commonly optimized in log space,since we assume that small changes are not significant, e.g., itis common to optimize C by searching in an interval such as:

2−10, 2−9, . . . , 29, 210 .

We can also repeat several grid search procedures at differentlevels of resolution, by focusing around the optimum of thecoarser grid search.

Random search

Another popular choice is random search:

1. We specify a sampling distribution for each parameter, e.g.a uniform distribution in [−10, 10] for log(C ).

2. At each iteration, we randomly sample one combination ofparameters, and optimize a model up to some budget ofcomputational time.

3. We conclude the search after a given number of iterations.

Note how both grid search and random search can be triviallyparallelized by training and evaluating several models at once ina parallel architecture.

Comparing grid search and random search

Figure 8 : Bergstra, J. and Bengio, Y., 2012. Random search forhyper-parameter optimization. Journal of Machine Learning Research,13(Feb), pp. 281-305.

Hyper-parameter selection as an optimization

problem

Optimizing parameters can be seen as an optimization problem;denoting by Θ the set of hyper-parameters, we want to opti-mize:

Θ∗ = arg minΘ

Err(Θ) ,

where Err(Θ) can be, for example, a 5-fold cross-validated errorover the training dataset.

The problems are that (i) we generally cannot compute any de-rivative information on Err(·); (ii) even a single evaluation of theobjective function can be expensive. Bayesian optimizationcan be used as a black-box technique to this end.

Other forms of fine-tuning

Model selection is not limited to selecting hyper-parameters re-lated to complexity:

I We can use cross-validated accuracy as a proxy measureinside wrapper methods for feature selection (see later),e.g, to recursively remove one feature at a time.

I In an iterative optimization algorithm, we can use a valida-tion set to choose when to stop the optimization process(early stopping). Early stopping can be as effective asregularization, particularly for neural networks models.

Table of contents



Feature selection

Feature selection is the task of selecting a subset of featureswhich is ‘optimal’ (in some sense) for our learning task.

Feature selection methods belong to three distinct categories:

I Filter methods select features by computing some ‘mea-sure of interest’, such as the Pearson correlation coefficientbetween the feature and the output. They are fast, buttheir accuracy in general is not optimal.

I Wrapper methods iteratively train several models by re-moving or adding features, and use their validated accuracyto choose the final subset.

I Embedded methods are learning methods which auto-matically select features (e.g., the LASSO algorithm).

Dimensionality reduction

We already saw in the linear case that selecting an optimalsubset of features is a very difficult problem. This is a combina-torial optimization problem, which can (in principle) be solvedusing techniques such as genetic algorithms or particle swarmoptimization, which are outside the scope of this course.

Sometimes, we are interested in reducing the dimensionality ofthe input, but we do not care to keep the original features.In this case, we can apply dimensionality reduction techniques,which will be introduced in a future lecture.

Table of contents



Evaluating a model after training

Once a model is trained and fine-tuned, we need to evaluate itsperformance. Reporting the MSE or the classification accuracyon the test set is a good measure, but it might not provide allthe necessary information.

As a concrete example, consider a binary classification problem,where 99% of the data is of class 0. In this case, a classifier thatalways predicts 0 as output would have 99% accuracy! This isan example of unbalanced dataset.

Confusion matrix

For binary classification, the outputs of the classifier can begrouped in the so-called confusion matrix:

Real class is 0 Real class is 1

Prediction is 0 TN: True negativeFN: False negative(Type II error)

Prediction is 1FP: False positive(Type I error)

TP: True positive

This is easily extended to a multiclass case.

Precision and recall

From the values of the confusion matrix, we can define severalnew metrics of accuracy:

Precision =TP

FP + TP(1)

Recall (or true precision rate, TPR) =TP

FN + TP(2)

False positive rate (FPR) =FP

FP + TN. (3)

The F1 score is defined as the harmonic mean of precision andrecall, and it is especially useful in unbalanced scenarios:

F1 = 2 · 11

Precision+ 1

Recall

= 2 · Precision · Recall

Precision + Recall. (4)

ROC curve

In practice, we can obtain insights into the classification processat an even lower level of granularity, by remembering that theoutputs of a classifier are almost always real numbers, that webinarize in order to obtain the final classes.

Up to now, we considered the simple case where we round theseoutputs to the nearest integer (when y ∈ {0, 1}), or we take thesign (when y ∈ {−1, 1}). By actually varying the threshold forbinarization, we can obtain different values of recall and FPR:

I For a very low threshold, we obtain perfect recall (all pat-terns are classified as positive), but very high FPR.

I On the opposite, for a very high threshold, we obtain apoor recall with an extremely small FPR.

ROC curve (2)

By plotting the recall against the FPR for several choices ofthe threshold, we obtain the receiver operating characteristic(ROC) curve. The area under the ROC curve (AUC) is anothervery common measure of performance for a binary classifier.

We can compare two classifiers by plotting their ROC curves toget a wider set of insights into their behavior. Another relatedcurve plots the precision against the recall (PR curve).

Plotting the ROC curve

Figure 9 : If we only consider a single threshold, each classifier is a point in theROC space [Wikipedia, en: Receiver Operating Characteristic].

Plotting the ROC curve (2)

Figure 10 : More in general, we can plot an entire curve in the ROCspace [Wikipedia, en: Receiver Operating Characteristic].

Further readings

Some material in this lecture is taken from the following Secti-ons: 3.3 (subset selection), 9.6 (missing data), 7 (model asses-sment).

The following is a classic introduction to feature selection:

Guyon, I. and Elisseeff, A., 2003. An introduction to variable

and feature selection. Journal of machine learning research,

3(Mar), pp. 1157-1182.

An accessible entry point for Bayesian optimization (outside thescope of the course) can be found in the following article:

Shahriari, B., Swersky, K., Wang, Z., Adams, R.P. and de Frei-

tas, N., 2016. Taking the human out of the loop: A review

of Bayesian optimization. Proceedings of the IEEE, 104(1),

pp. 148-175.

data mining and machine...

Documents