learning: linear methodsce.sharif.edu/courses/99-00/1/ce417-1/resources/root... · 2021. 1. 1. ·...

80
Learning: Linear Methods CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2020 Soleymani Some slides are based on Klein and Abdeel, CS188, UC Berkeley.

Upload: others

Post on 28-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Learning: Linear MethodsCE417: Introduction to Artificial Intelligence

    Sharif University of Technology

    Fall 2020

    Soleymani

    Some slides are based on Klein and Abdeel, CS188, UC Berkeley.

  • Paradigms of ML

    Reinforcement learning

    partial (indirect) feedback, no explicit guidance

    Given rewards for a sequence of moves to learn a policy and

    utility functions

    Supervised learning (regression, classification)

    predicting a target variable for which we get to see examples.

    Unsupervised learning

    revealing structure in the observed unlabeled data

    2

  • Components of (Supervised) Learning

    3

    Unknown target function: 𝑓:𝒳 → 𝒴

    Input space:𝒳

    Output space:𝒴

    Training data: 𝒙1, 𝑦1 , 𝒙2, 𝑦2 , … , (𝒙𝑁, 𝑦𝑁)

    Pick a formula 𝑔:𝒳 → 𝒴 that approximates the targetfunction 𝑓

    selected from a set of hypotheses ℋ

  • Supervised Learning:

    Regression vs. Classification

    Supervised Learning

    Regression: predict a continuous target variable

    E.g.,𝑦 ∈ [0,1]

    Classification: predict a discrete target variable

    E.g.,𝑦 ∈ {1,2, … , 𝐶}

    4

  • Regression: Example

    Housing price prediction

    0

    100

    200

    300

    400

    0 500 1000 1500 2000 2500

    Price ($)

    in 1000’s

    Size in feet2

    5 Figure adopted from slides of Andrew Ng

  • Training data: Example

    𝑥1 𝑥2 𝑦

    0.9 2.3 1

    3.5 2.6 1

    2.6 3.3 1

    2.7 4.1 1

    1.8 3.9 1

    6.5 6.8 -1

    7.2 7.5 -1

    7.9 8.3 -1

    6.9 8.3 -1

    8.8 7.9 -1

    9.1 6.2 -1x1

    x2

    6

    Training data

  • Classification: Example

    Weight (Cat, Dog)

    7

    1(Dog)

    0(Cat)

    weight

    weight

  • Linear regression

    Cost function:

    0

    100

    200

    300

    400

    500

    0 500 1000 1500 2000 2500 3000

    0

    100

    200

    300

    400

    500

    0 500 1000 1500 2000 2500 3000

    𝑦(𝑖) − 𝑔(𝑥 𝑖 ; 𝒘)

    𝑥

    𝐽 𝒘 =𝑖=1

    𝑛

    𝑦 𝑖 − 𝑔(𝑥(𝑖); 𝒘)2

    =𝑖=1

    𝑛

    𝑦 𝑖 −𝑤0 − 𝑤1𝑥𝑖 2

    8

  • Cost function

    0

    100

    200

    300

    400

    500

    0 1000 2000 3000

    Price ($)

    in 1000’s

    Size in feet2 (x)

    𝐽(𝒘)(function of the parameters 𝑤0, 𝑤1)

    𝑤0𝑤1

    9 This example has been adapted from: Prof. Andrew Ng’s slides

  • Review:

    Gradient descent

    10

    First-order optimization algorithm to find 𝒘∗ = argmin𝒘

    𝐽(𝒘)

    Also known as ”steepest descent”

    In each step, takes steps proportional to the negative of the

    gradient vector of the function at the current point 𝒘𝑡:

    𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡

    𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡

    Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of apoint 𝒘𝑡

    Gradient ascent takes steps proportional to (the positive

    of) the gradient to find a local maximum of the function

  • Review:

    Gradient descent

    11

    Minimize 𝐽(𝒘)

    𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘𝑡)

    𝛻𝒘𝐽 𝒘 = [𝜕𝐽 𝒘

    𝜕𝑤1,𝜕𝐽 𝒘

    𝜕𝑤2, … ,

    𝜕𝐽 𝒘

    𝜕𝑤𝑑]

    If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

    𝜂 can be allowed to change at every iteration as 𝜂𝑡.

    Step size

    (Learning rate

    parameter)

  • Review:

    Gradient descent disadvantages

    12

    Local minima problem

    However, when 𝐽 is convex, all local minima are also globalminima ⇒ gradient descent can converge to the globalsolution.

  • w1

    w0

    J(w0,w1)

    Review: Problem of gradient descent with

    non-convex cost functions

    13 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • w0

    w1

    J(w0,w1)

    Review: Problem of gradient descent with

    non-convex cost functions

    14 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • Gradient descent for SSE cost function

    15

    Minimize 𝐽(𝒘)𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘

    𝑡)

    𝐽(𝒘): Sum of squares error

    𝐽 𝒘 =𝑖=1

    𝑛

    𝑦 𝑖 − 𝑔 𝒙 𝑖 ; 𝒘2

    Weight update rule for 𝑔 𝒙;𝒘 = 𝒘𝑇𝒙:

    𝒘𝑡+1 = 𝒘𝑡 + 𝜂

    𝑖=1

    𝑛

    𝑦 𝑖 −𝒘𝑡𝑇𝒙 𝑖 𝒙(𝑖)

    𝒘 =

    𝑤0𝑤1⋮𝑤𝑑

    𝒙 =

    1𝑥1⋮𝑥𝑑

  • Gradient descent for SSE cost function

    16

    Weight update rule:𝑔 𝒙;𝒘 = 𝒘𝑇𝒙

    𝒘𝑡+1 = 𝒘𝑡 + 𝜂

    𝑖=1

    𝑛

    𝑦 𝑖 −𝒘𝑇𝒙 𝑖 𝒙(𝑖)

    𝜂: too small → gradient descent can be slow.

    𝜂 : too large → gradient descent can overshoot theminimum. It may fail to converge, or even diverge.

    Batch mode: each step

    considers all training

    data

  • (function of the parameters 𝑤0, 𝑤1)𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥

    𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 17

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    18 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    19 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    20 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    21 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    22 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    23 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    24 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • (function of the parameters 𝑤0, 𝑤1)

    𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

    𝑤0

    𝑤1

    25 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

  • Linear Classifiers

    26

  • Error-Driven Classification

    27

  • Errors, and What to Do

    Examples of errors

    Dear GlobalSCAPE Customer,

    GlobalSCAPE has partnered with ScanSoft to offer you the

    latest version of OmniPage Pro, for just $99.99* - the

    regular list price is $499! The most common question we've

    received about this offer is - Is this genuine? We would like

    to assure you that this offer is authorized by ScanSoft, is

    genuine and valid. You can get the . . .

    . . . To receive your $30 Amazon.com promotional certificate,

    click through to

    http://www.amazon.com/apparel

    and see the prominent link for the $30 offer. All details are

    there. We hope you enjoyed receiving this message. However,

    if you'd rather not receive future e-mails announcing new

    store launches, please click . . .

    28

  • What to Do About Errors Problem: there’s still spam in your inbox

    Need more features – words aren’t enough! Have you emailed the sender before?

    Have 1M other people just gotten the same email?

    Is the sending information consistent?

    Is the email in ALL CAPS?

    Do inline URLs point where they say they point?

    Does the email address you by (your) name?

    Naïve Bayes models can incorporate a variety of features, but tend todo best in homogeneous cases (e.g. all features are word occurrences)

    29

  • Feature Vectors

    Hello,

    Do you want free printr

    cartriges? Why pay more

    when you can get them

    ABSOLUTELY FREE! Just

    # free : 2

    YOUR_NAME : 0

    MISSPELLED : 2

    FROM_FRIEND : 0

    ...

    SPAMor+

    PIXEL-7,12 : 1

    PIXEL-7,13 : 0

    ...

    NUM_LOOPS : 1

    ...

    “2”

    30

    We show input by 𝒙 or 𝑓(𝒙)

  • Weights

    Binary case: compare features to a weight vector to identify the class

    Learning: figure out the weight vector from examples

    # free : 2

    YOUR_NAME : 0

    MISSPELLED : 2

    FROM_FRIEND : 0

    ...

    # free : 4

    YOUR_NAME :-1

    MISSPELLED : 1

    FROM_FRIEND :-3

    ...

    # free : 0

    YOUR_NAME : 1

    MISSPELLED : 1

    FROM_FRIEND : 1

    ...

    Dot product positive means the positive class

    31

  • Binary Decision Rule

    In the space of feature vectors

    Examples are points

    Any weight vector is a hyperplane

    One side corresponds toY=+1

    Other corresponds toY=-1

    BIAS : -3

    free : 4

    money : 2

    ...0 1

    0

    1

    2

    free

    mo

    ney

    +1 = SPAM

    -1 = HAM

    32

  • Square error loss function for classification!

    33

    Square error loss is not suitable for classification: Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct side

    of the decision)

    Least square loss also lack robustness to noise

    𝐽 𝒘 =

    𝑖=1

    𝑁

    𝑤𝑥 𝑖 +𝑤0 − 𝑦𝑖 2

    𝐾 = 2

  • Perceptron criterion

    34

    Two-class: 𝑦 ∈ {−1,1} 𝑦 = −1 for 𝐶2, 𝑦 = 1 for 𝐶1

    Goal:∀𝑖, 𝒙(𝑖) ∈ 𝐶1 ⇒ 𝒘𝑇𝒙(𝑖) > 0

    ∀𝑖, 𝒙 𝑖 ∈ 𝐶2 ⇒ 𝒘𝑇𝒙 𝑖 < 0

    𝐽𝑃 𝒘 = −

    𝑖∈ℳ

    𝒘𝑇𝒙 𝑖 𝑦 𝑖

    ℳ: subset of training data that are misclassified

    Many solutions? Which solution among them?

  • Batch Perceptron

    35

    “Gradient Descent” to solve the optimization problem:

    𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽𝑃(𝒘𝑡)

    𝛻𝒘𝐽𝑃 𝒘 = −

    𝑖∈ℳ

    𝒙 𝑖 𝑦 𝑖

    Batch Perceptron converges in finite number of steps for linearly

    separable data:

    Initialize 𝒘Repeat

    𝒘 = 𝒘+ 𝜂σ𝑖∈ℳ 𝒙𝑖 𝑦 𝑖

    Until convergence

  • Stochastic gradient descent for Perceptron

    36

    Single-sample perceptron:

    If 𝒙(𝑖) is misclassified:

    𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖)𝑦(𝑖)

    Perceptron convergence theorem: for linearly separable data

    If training data are linearly separable, the single-sample perceptron is

    also guaranteed to find a solution in a finite number of steps

    Initialize 𝒘, 𝑡 ← 0repeat

    𝑡 ← 𝑡 + 1𝑖 ← 𝑡 mod 𝑁

    if 𝒙(𝑖) is misclassified then𝒘 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)

    Until all patterns properly classified

    Fixed-Increment single sample Perceptron

    𝜂 can be set to 1 and proof still works

  • Weight Updates

    37

  • Learning: Binary Perceptron

    Start with weights = 0

    For each training instance:

    Classify with current weights

    If correct (i.e., y=y*), no change!

    If wrong: adjust the weight vector

    38

    𝒘𝑡+1 = 𝒘𝑡 + 𝒙(𝑖)𝑦(𝑖)

  • Example

    39

  • Learning: Binary Perceptron

    Start with weights = 0

    For each training instance:

    Classify with current weights

    40

    ො𝑦

  • Learning: Binary Perceptron

    Start with weights = 0

    For each training instance:

    Classify with current weights

    If correct (i.e., ො𝑦 = 𝑦), no change!

    If wrong: adjust the weight vector byadding or subtracting the featurevector (subtract if y is -1):

    41

    𝑦

    𝑦

    ො𝑦

  • Multiclass Decision Rule

    If we have multiple classes:

    A weight vector for each class:

    Score (activation) of a class y:

    Prediction highest score wins

    Binary = multiclass where the negative class has weight zero

    42

    ො𝑦

  • Learning: Multiclass Perceptron

    Start with all weights = 0

    Pick up training examples one by one

    Predict with current weights

    43

    ො𝑦

    𝑤 ො𝑦

    𝑤𝑦

  • Learning: Multiclass Perceptron

    Start with all weights = 0

    Pick up training examples one by one

    Predict with current weights

    If correct, no change!

    If wrong: lower score of wrong answer,raise score of right answer

    44

    𝑤 ො𝑦

    𝑤𝑦

    ො𝑦

    𝑤ො𝑦 = 𝑤ො𝑦 − 𝑓(𝑥)

  • Learning: Multiclass Perceptron

    Start with all weights = 0

    Pick up training examples one by one

    Predict with current weights

    If correct, no change!

    If wrong: lower score of wrong answer,raise score of right answer

    45

    ො𝑦

    𝑤 ො𝑦

    𝑤𝑦

    𝑤ො𝑦 = 𝑤ො𝑦 − 𝑓(𝑥)

    𝑤𝑦 = 𝑤𝑦 + 𝑓(𝑥)

  • Properties of Perceptrons

    Separability: true if some parameters get the training setperfectly classified

    Convergence: if the training is separable, perceptron willeventually converge (binary case)

    Mistake Bound: the maximum number of mistakes (binarycase) related to the margin or degree of separability

    Separable

    Non-Separable

    46

  • Examples: Perceptron

    Non-Separable Case

    47

  • Discriminative approach: logistic regression

    48

    𝑔 𝒙;𝒘 = 𝜎(𝒘𝑇𝒙)

    𝜎 . is an activation function

    Sigmoid (logistic) function

    Activation function

    𝜎 𝑧 =1

    1 + 𝑒−𝑧

    𝐾 = 2

    𝒙 = 1, 𝑥1, … , 𝑥𝑑𝒘 = 𝑤0, 𝑤1, … , 𝑤𝑑

  • Logistic regression: cost function

    49

    ෝ𝒘 = argmin𝒘

    𝐽(𝒘)

    𝐽 𝒘 =

    =

    𝑖=1

    𝑛

    −𝑦(𝑖)log 𝜎 𝒘𝑇𝒙(𝑖) − (1 − 𝑦(𝑖))log 1 − 𝜎 𝒘𝑇𝒙(𝑖)

    𝐽(𝒘) is convex w.r.t. parameters.

  • Logistic regression: loss function

    50

    Loss 𝑦, 𝑓 𝒙;𝒘 = −𝑦 × log 𝜎 𝒙;𝒘 − (1 − 𝑦) × log(1 − 𝜎 𝒙;𝒘 )

    Loss 𝑦, 𝜎 𝒙;𝒘 = ቊ−log(𝜎(𝒙;𝒘)) if 𝑦 = 1

    −log(1 − 𝜎 𝒙;𝒘 ) if 𝑦 = 0

    How is it related to zero-one loss?

    Loss 𝑦, ො𝑦 = ቊ1 𝑦 ≠ ො𝑦0 𝑦 = ො𝑦

    𝜎 𝒙;𝒘 =1

    1 + 𝑒𝑥𝑝(−𝒘𝑇𝒙)

    Since 𝑦 = 1 or 𝑦 = 0⇒

  • Logistic regression: Gradient descent

    51

    𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘𝑡)

    𝛻𝒘𝐽 𝒘 =𝑖=1

    𝑛

    𝜎 𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 𝒙 𝑖

    Is it similar to gradient of SSE for linear regression?

    𝛻𝒘𝐽 𝒘 =𝑖=1

    𝑛

    𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 𝒙 𝑖

  • Multi-class classifier

    𝑔 𝒙;𝑾 = 𝑔1 𝒙,𝑾 ,… , 𝑔𝐾 𝒙,𝑾

    𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parametersfor each class

    In linear classifiers, 𝑾 is 𝑑 × 𝐾 where 𝑑 shows number offeatures

    𝑾𝑇𝒙 provides us a vector

    𝑔 𝒙;𝑾 contains K numbers giving class scores for theinput 𝒙

    52

  • Multi-class logistic regression

    𝑔 𝒙;𝑾 = 𝑔1 𝒙,𝑾 ,… , 𝑔𝐾 𝒙,𝑾𝑇

    𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parametersfor each class

    53

    𝑔𝑘 𝒙;𝑾 =exp(𝒘𝑘

    𝑇𝒙 )

    σ𝑗=1𝐾 exp(𝒘𝑗

    𝑇𝒙 )

  • Logistic regression: multi-class

    54

    𝑾 = argmin𝑾

    𝐽(𝑾)

    𝐽 𝑾 = −

    𝑖=1

    𝑛

    𝑘=1

    𝐾

    𝑦𝑘𝑖log 𝑔𝑘 𝒙

    (𝑖);𝑾

    𝑾 = 𝒘1 ⋯ 𝒘𝐾𝒚 is a vector of length 𝐾 (1-of-K coding)e.g., 𝒚 = 0,0,1,0 𝑇 when the target class is 𝐶3

  • Logistic regression: multi-class

    55

    𝒘𝑗𝑡+1 = 𝒘𝑗

    𝑡 − 𝜂𝛻𝑾𝐽(𝑾𝑡)

    𝛻𝒘𝑗𝐽 𝑾 =𝑖=1

    𝑛

    𝑔𝑗 𝒙𝑖 ;𝑾 − 𝑦𝑗

    𝑖𝒙 𝑖

  • Logistic regression: Probabilistic perspective

    Maximum likelihood estimation:

    with:

    = Two-class Logistic Regression

    56

    𝑃 𝑦 𝑖 = +1 𝑥 𝑖 ; 𝑤 =1

    1 + 𝑒−𝒘𝑇𝒙

    𝑃 𝑦 𝑖 = 0 𝑥 𝑖 ; 𝑤 = 1 −1

    1 + 𝑒−𝒘𝑇𝒙

  • Multiclass Logistic Regression

    Multi-class linear classification

    A weight vector for each class:

    Score (activation) of a class y:

    Prediction w/highest score wins:

    How to make the scores into probabilities?

    original activations softmax activations

    57

    𝒘1𝑇𝒙

    𝒘2𝑇𝒙 𝒘3

    𝑇𝒙

    𝒘𝑘

    𝑧𝑘 = 𝒘𝑘𝑇𝒙

    ො𝑦 = argmax𝑘

    𝒘𝑘𝑇𝒙

  • Logistic regression: Probabilistic perspective

    Maximum likelihood estimation:

    with:

    = Multi-Class Logistic Regression

    58

    𝑃 𝑦 𝑖 𝑥 𝑖 ; 𝑤 =𝑒𝒘𝑦 𝑖𝑇 𝒙(𝑖)

    σ𝑘=1𝐾 𝑒𝒘𝑘

    𝑇𝒙(𝑖)

  • Logistic regression: multi-class

    59

    𝒘𝑗𝑡+1 = 𝒘𝑗

    𝑡 − 𝜂𝛻𝑾𝐽(𝑾𝑡)

    𝛻𝒘𝑗𝐽 𝑾 =𝑖=1

    𝑛

    𝑃 𝑦𝑗|𝒙𝑖 ;𝑾 − 𝑦𝑗

    𝑖𝒙 𝑖

  • Example

    How can we tell whether this W and b is good or bad?

    𝑾𝑇

    60 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

    𝒙 =

    𝑥1⋮𝑥4

    𝑾𝑇 =

    𝒘1𝒘2𝒘3 3×4

    𝒘𝟎 = 𝒃 =

    𝑏1𝑏2𝑏3

  • Bias can also be included in the W matrix

    61 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  • Softmax classifier loss: example

    𝐿(1) = − log 0.13= 0.89

    𝐿(𝑖) = − log𝑒𝑠𝑦(𝑖)

    σ𝑗=1𝐾 𝑒𝑠𝑗

    62 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

    𝑠𝑘 = 𝒘𝑘𝑇𝒙

  • Generalized linear

    Linear combination of fixed non-linear function of the

    input vector

    𝑓(𝒙;𝒘) = 𝑤0 +𝑤1𝜙1(𝒙)+ . . . 𝑤𝑚𝜙𝑚(𝒙)

    {𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)}: set of basis functions (or features)

    𝜙𝑖 𝒙 :ℝ𝑑 → ℝ

    63

  • Basis functions: examples

    Linear

    Polynomial (univariate)

    64

  • Polynomial regression: example

    65

    𝑚 = 1 𝑚 = 3

    𝑚 = 5 𝑚 = 7

  • Generalized linear classifier

    66

    Assume a transformation 𝜙:ℝ𝑑 → ℝ𝑚 on the featurespace

    𝒙 → 𝝓 𝒙

    Find a hyper-plane in the transformed feature space:

    𝑥2

    𝑥1 𝜙1(𝒙)

    𝜙2(𝒙)

    𝜙: 𝒙 → 𝝓 𝒙

    𝒘𝑇𝝓 𝒙 + 𝑤0 = 0

    {𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)}: set of basis functions (or features)

    𝜙𝑖 𝒙 :ℝ𝑑 → ℝ

    𝝓 𝒙 = [𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)]

  • Model complexity and overfitting

    With limited training data, models may achieve zero

    training error but a large test error.

    Over-fitting: when the training loss no longer bears any

    relation to the test (generalization) loss.

    Fails to generalize to unseen examples.

    67

    1

    𝑛

    𝑖=1

    𝑛

    𝑦 𝑖 − 𝑔 𝒙𝑖; 𝒘

    2

    ≈ 0Training(empirical) loss

    Expected

    (true) lossE𝐱,y 𝑦 − 𝑔 𝒙;𝒘

    2≫ 0

  • Polynomial regression

    68

    𝑚 = 0 𝑚 = 1

    𝑚 = 3 𝑚 = 9

    𝑦

    𝑦

    𝑦

    𝑦

    [Bishop]

  • Over-fitting causes

    69

    Model complexity

    E.g., Model with a large number of parameters (degrees of

    freedom)

    Low number of training data

    Small data size compared to the complexity of the model

  • Number of training data & overfitting

    70

    Over-fitting problem becomes less severe as the size of

    training data increases.

    𝑚 = 9 𝑚 = 9

    𝑛 = 15 𝑛 = 100

    [Bishop]

  • How to evaluate the learner’s performance?

    71

    Generalization error: true (or expected) error that we

    would like to optimize

    Two ways to assess the generalization error is:

    Practical: Use a separate data set to test the model

    Theoretical: Law of Large numbers

    statistical bounds on the difference between training and expected

    errors

  • Avoiding over-fitting

    72

    Determine a suitable value for model complexity (Model

    Selection)

    Simple hold-out method

    Cross-validation

    Regularization (Occam’s Razor)

    Explicit preference towards simple models

    Penalize for the model complexity in the objective function

  • Model Selection

    73

    learning algorithm defines the data-driven search over

    the hypothesis space (i.e. search for good parameters)

    hyperparameters are the tunable aspects of the model,

    that the learning algorithm does not select

    This slide has been adopted from CMU ML course:

    http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  • Model Selection

    74

    Model selection is the process by which we choose the

    “best” model from among a set of candidates

    assume access to a function capable of measuring the quality of

    a model

    typically done “outside” the main training algorithm

    Model selection / hyperparameter optimization is just

    another form of learning

    This slide has been adopted from CMU ML course:

    http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  • Simple hold out:

    training, validation, and test sets

    75

    Simple hold-out chooses the model that minimizes error on

    validation set.

    Estimate generalization error for the test set

    performance of the selected model is finally evaluated on the test set

    Training

    Validation

    Test

  • Simple hold-out: model selection

    76

    Steps:

    Divide training data into training and validation set 𝑣_𝑠𝑒𝑡

    Use only the training set to train a set of models

    Evaluate each learned model on the validation set

    𝐽𝑣 𝒘 =1

    𝑣_𝑠𝑒𝑡σ𝑖∈𝑣_𝑠𝑒𝑡 𝑦

    (𝑖)− 𝑓 𝒙

    (𝑖); 𝒘

    2

    Choose the best model based on the validation set error

    Usually, wasteful of valuable training data

    Training data may be limited.

    On the other hand, small validation set give a relatively noisyestimate of performance.

    Cross-validation technique is used instead.

  • Regularization

    77

    Adding a penalty term in the cost function to discourage

    the coefficients from reaching large values.

    Ridge regression (weight decay):

    𝐽 𝒘 =𝑖=1

    𝑛

    𝑦 𝑖 −𝒘𝑇𝝓 𝒙 𝑖2+ 𝜆𝒘𝑇𝒘

  • Regularization parameter

    78

    Generalization

    𝜆 now controls the effective complexity of the model andhence determines the degree of over-fitting

    [Bishop]

  • Choosing the regularization parameter

    79

    A set of models with different values of 𝜆.

    Find ෝ𝒘 for each model based on training data

    Find 𝐽𝑣(ෝ𝒘) (or 𝐽𝑐𝑣(ෝ𝒘)) for each model

    𝐽𝑣 𝒘 =1

    𝑛_𝑣σ𝑖∈𝑣_𝑠𝑒𝑡 𝑦

    (𝑖)− 𝑓 𝑥

    (𝑖);𝒘

    2

    Select the model with the best 𝐽𝑣(ෝ𝒘) (or 𝐽𝑐𝑣(ෝ𝒘))

  • Resources

    80

    Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan

    Tien Lin,“Learning from Data”, Chapter 2.3, 3.2, 3.4.