notation: means pencil-and-paper quiz means coding quiz review questions for linear ... · 2019....

COSC 4360 - Instructor’s notes Ch.2 - Supervised Learning

Notation: □ Means pencil-and-paper QUIZ ► Means coding QUIZ

□ Review questions for linear regression algorithms:

1. What are the three regression algorithms we studied? 2. Which one is considered the “go to” linear regression algorithm? 3. What is the main difference between OLS and RR? 4. In RR, a large alpha parameter means that we have more or less regularization? More or less model

complexity? How does a large alpha impact the weights/slopes/coefficients? 5. How is LaR different from RR? Why does this lead to potentially better results? Explain when you

would use LaR rather than RR. 6. Under what circumstances is OLS better than RR (and LaR)?


Linear models for classification

Introduction to linear classification: Decision Boundaries and Regularization

regression:

classification:

Instead of placing the predictions on the line/plane/hyperplane, classification uses the line/plane/hyperplane as the border or boundary between two regions of space: points on one side are assigned to class 0, points on the other to class 1.

The most common linear classifications algorithms are:

1. Logistic Regression (!) (LogR) --> Binary Classification (can also be adapted for multiclass)

2. Support Vector Machines (SVM) --> Binary or Multiclass

For classification, the regularization parameter is named C, and it acts in the opposite way from alpha:

large C means less regularization


1. Binary classification with Logistic Regression (LogR, pp.61-64)

Note well: Despite its name, LogR is not an algorithm for regression! It is for (binary) classification!

Example (not in text): Predicting the success of a loan application based on the applicant’s net worth. We have the following data:

Net worth [$10,000]

-4 -3.5 -3 -2 -1.5 1 1.5 2 2.5 3.1 3.5 4

Result (1=approved, 0=rejected)

0 0 0 1 0 0 1 0 0 1 1 1

Note: In this example, the net worth is the (unique) feature of our dataset. In statistics, it is called the independent variable, usually denoted x.

LogR explained in three plots:

f(x) = x g(x) = ex = 1/e-x p(x) = 1/(1+e-x)

p(x) = 1/(1+e-x) is called the sigmoid or logistic function, and it “squashes” the original likelihood x into a value in between 0 and 1, which can be interpreted as a probability.


As in the case of linear regression, we have two model parameters that the algorithm uses to tweak the initial line x: f(x) = α + βx.

After the transformation, their effect on the logistic function looks like this:

p(x) = 1/(1+e-x) beta = 5 p(x) = 1/(1+e-5*x) alpha = 2 p(x) = 1/(1+e-(x+2))

In general, the probability is p(x) = 1/(1+e-(α + βx))

Conclusion:

α controls the location of the midpoint (p = 0.5)

β controls the slope around the midpoint

□ If the data has a large overlap between the classes, do you think β will be small or large?

□ If the data has a short overlap or no overlap between the classes, do you think β will be small or large?

□ In LogR, what is the meaning of a negative coefficient β ?

As with linear regression, α and β are chosen in order to minimize an error function. The details are beyond the scope of this course.


Not in text: Where are the probabilities?

The whole point of the Logistic function in LogR is to obtain a probability, rather that a “hard” yes-no or 1-0 value for prediction. The code example below shows the prediction of just one datapoint – the very first datapoint (row index zero) in the forge dataset:

Notes:

• LBFGS is one of several algorithms that LogR can use under the hood to find the best coefficients. The acronym stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno, a gradient-descent algorithm used in non-linear optimization. (The decision boundary is linear in LogR, but finding the decision boundary is not a linear problem.

• The predict function expects a two-dimensional array, with a row for each point to predict.

▀


Solutions:

□ If the data has a large overlap between the classes, do you think β will be small or large?

A: Small, i.e. the original slope is low, i.e. the probability changes slowly.

□ In LogR, what is the meaning of a negative coefficient β ?

A: Is means that the probability decreases as the independent variable x increases, as in this example:

▀


Extending the LogR model to account for multiple features (independent variables)

Example: in addition to the net worth, x1, we also know the monthly income of the applicant, x2. We can still perform LogR, by linearly combining the two features. In general, we can have a number of features d:

z = α + β1x1 + … + βdxd and the probability is P(z) = 1/(1+e-z)

Wait a second! How can this be a linear classifier when we are using division and exponentiation?!

A: The division and exponentiation are only used to calculate the probability – the two classes are still

separated by a line/plane/hyperplane: α + β1x1 + … + βdxd = 0.

Code example for the cancer dataset (d=30 features):

The LogR algorithm performs L2 (Euclidian) regularization by default.

Effect of regularization coefficient, C (default value is C = 1)


As in regression, we can visually inspect the coefficients:

Note: These are only the beta coefficients, because only they are usually important to judge the classification. If you also need the alpha (intercept), it is available in the attribute logreg.intercept_.

□ Does the mean radius correlate positively or negatively with malignancy? (Trick question – see below!)

Let us perform this experiment (not in text). Classify based only on the mean radius:

□ What does the negative coefficient mean? To answer this, we must find out how malignant and benign have been coded into 1 and 0.


So malignant is 0 and benign is 1. Negative beta means that the plot is flipped - ones (benign) on the left and zeros (malignant) on the right. As expected, smaller tumors tend to be benign:

Indeed, if we visualize the mean radius size, we get this1:

Now we add the two next features, mean_texture and mean_perimeter:

Note that both scores went up, but coef. changed sign!

The reason for these unintuitive behaviors is the presence of correlated features. For example, in the cancer dataset, mean radius is obviously correlated with mean perimeter and mean area2.

1This will be a homework problem. 2Also a homework problem.


The text points out another paradox: A coefficient may also change sign due to regularization! This can be seen in the coefficient for mean_perimeter:


It is possible to specify the L1 norm instead of the default L2:

, which, as in Lasso regression, sets many coefficients to zero:

■End half-week 1


Linear Multiclass Classification (pp.65-70)

Any binary classification algorithm can be extended to multiclass classification. One technique used to perform the extension (or reduction, if viewed in reverse) is one-vs-rest.

• Note well: One-vs-rest only works with binary classification algorithms that provide a “soft” measure of confidence for classification, like the probability in Logistic Regression classification!

• There are other techniques (e.g. One-vs-one) that can be applied to strictly 0/1 classification algorithms.

Explanation of one-vs-rest (OvR) (See last paragraph on p.65 in our text):

There are, for example, three classes: A, B, C

Three classifiers are trained, with the targets relabeled like this:

ClsfA - A is 1, B and C are 0

ClsfB - B is 1, C and A are 0

ClsfC - C is 1, A and B are 0

When a new point x_new needs to be classified, all three classifiers are applied; for example, let us say that the measures (probabilities) returned are:

ClsfA(x_new) = 0.45

ClsfB(x_new) = 0.88

ClsfC(x_new) = 0.15

The class chosen is the one that had the highest measure (probability) - in our example B.

Note: The Scikit-learn algorithm LogisticRegression has the one-vs-rest already built in. If the data has more than two targets, this technique is applied automatically! (see exercise after linear SVM later in this section!)


Explanation of Support Vector Machines (SVM) (Not in text)

□ In each binary classification problem, rate the boundaries (hyperplanes) shown from worst to best:

□ Draw the best linear boundary for this dataset (handout):


3

Idea: Find the line/plane/hyperplane that provides the maximum margin:

4

Usually, the “corridor” around the optimal hyperplane will “touch” one or two datapoints in one class and one or two in the other – these points, or vectors are called support vectors, hence the name of the algorithm.

3Image from: http://www.m8j.net/(All)Geometry%20of%20Support%20Vector%20Machines 4Image based on: https://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html


Low C (strong regularization) vs. large C (weak/no regularization) in a linear SVM:

Trick question: Which of these classifiers generalizes best?

5

5Source: https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel/


Back to our text:

Now we use SVM/SVC on the same dataset, blobs:

The visualization code (text), allows to see the 3 linear boundaries:

What happens in the “no man’s land”?

A: As explained in the introduction to one-vs-rest, the classifier must generate a “soft” value; for SVM/SVC, that value is the distance from the point to the boundary for each class. We then pick the class that is closest, i.e. it has the shortest distance. Geometrically, this means that the boundaries are the bisectors of the intersection triangle:


► Earlier we noted that the Scikit-learn algorithm LogisticRegression has the one-vs-rest already built in. Modify the SVM/SVC code above to use LogR instead. Compare the results!

Conclusions on linear classification algorithms:

There are two main hyperparameters (“knobs”) to tune:

Regularization parameter, named C in classification models. High C mean complex models, with less regularization.

Metric or norm used to calculate the penalty for the coefficients. The parameter is named penalty, with two possible values: "l2" and "l1". The default is "l2". We use "l2", unless we have grounds to believe that many features are irrelevant for classification, in which case se use "l1".

Advantages over non-linear models:

Linear is faster (fewer computations).

Linear algs. work well when the nr. of features is large compared to the nr. of data points.

Model is easier to understand (But remember the dangers of examining coefficients in isolation when features are correlated!)

Disadvantage over non-linear models:

Sometimes the data just isn’t linear :)

Read Method Chaining (p.70)

▀


Handout


Solutions:

► Earlier we noted that the Scikit-learn algorithm LogisticRegression has the one-vs-rest already built in. Modify the SVM/SVC code above to use LogR instead. Compare the results!

A: We import and use the LinearSVC model instead of LinearRegression:

and then we simply replace linear_svm with logreg in the code. The plot is similar:

notation: means pencil-and-paper quiz means coding quiz review questions for linear ... · 2019....

Documents