learning: linear methodsce.sharif.edu/courses/99-00/1/ce417-1/resources/root... · 2021. 1. 1. ·...

Learning: Linear MethodsCE417: Introduction to Artificial Intelligence

Sharif University of Technology

Fall 2020

Soleymani

Some slides are based on Klein and Abdeel, CS188, UC Berkeley.

Paradigms of ML

Reinforcement learning

partial (indirect) feedback, no explicit guidance

Given rewards for a sequence of moves to learn a policy and

utility functions

Supervised learning (regression, classification)

predicting a target variable for which we get to see examples.

Unsupervised learning

revealing structure in the observed unlabeled data

2

Components of (Supervised) Learning

3

Unknown target function: 𝑓:𝒳 → 𝒴

Input space:𝒳

Output space:𝒴

Training data: 𝒙1, 𝑦1 , 𝒙2, 𝑦2 , … , (𝒙𝑁, 𝑦𝑁)

Pick a formula 𝑔:𝒳 → 𝒴 that approximates the targetfunction 𝑓

selected from a set of hypotheses ℋ

Supervised Learning:

Regression vs. Classification

Supervised Learning

Regression: predict a continuous target variable

E.g.,𝑦 ∈ [0,1]

Classification: predict a discrete target variable

E.g.,𝑦 ∈ {1,2, … , 𝐶}

4

Regression: Example

Housing price prediction

0

100

200

300

400

0 500 1000 1500 2000 2500

Price ($)

in 1000’s

Size in feet2

5 Figure adopted from slides of Andrew Ng

Training data: Example

𝑥1 𝑥2 𝑦

0.9 2.3 1

3.5 2.6 1

2.6 3.3 1

2.7 4.1 1

1.8 3.9 1

6.5 6.8 -1

7.2 7.5 -1

7.9 8.3 -1

6.9 8.3 -1

8.8 7.9 -1

9.1 6.2 -1x1

x2

6

Training data

Classification: Example

Weight (Cat, Dog)

7

1(Dog)

0(Cat)

weight

weight

Linear regression

Cost function:

0

100

200

300

400

500

0 500 1000 1500 2000 2500 3000

0

100

200

300

400

500

0 500 1000 1500 2000 2500 3000

𝑦(𝑖) − 𝑔(𝑥 𝑖 ; 𝒘)

𝑥

𝐽 𝒘 =𝑖=1

𝑛

𝑦 𝑖 − 𝑔(𝑥(𝑖); 𝒘)2

=𝑖=1

𝑛

𝑦 𝑖 −𝑤0 − 𝑤1𝑥𝑖 2

8

Cost function

0

100

200

300

400

500

0 1000 2000 3000

Price ($)

in 1000’s

Size in feet2 (x)

𝐽(𝒘)(function of the parameters 𝑤0, 𝑤1)

𝑤0𝑤1

9 This example has been adapted from: Prof. Andrew Ng’s slides

Review:

Gradient descent

10

First-order optimization algorithm to find 𝒘∗ = argmin𝒘

𝐽(𝒘)

Also known as ”steepest descent”

In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒘𝑡:

𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡

𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡

Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of apoint 𝒘𝑡

Gradient ascent takes steps proportional to (the positive

of) the gradient to find a local maximum of the function

Review:

Gradient descent

11

Minimize 𝐽(𝒘)

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘𝑡)

𝛻𝒘𝐽 𝒘 = [𝜕𝐽 𝒘

𝜕𝑤1,𝜕𝐽 𝒘

𝜕𝑤2, … ,

𝜕𝐽 𝒘

𝜕𝑤𝑑]

If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

𝜂 can be allowed to change at every iteration as 𝜂𝑡.

Step size

(Learning rate

parameter)

Review:

Gradient descent disadvantages

12

Local minima problem

However, when 𝐽 is convex, all local minima are also globalminima ⇒ gradient descent can converge to the globalsolution.

w1

w0

J(w0,w1)

Review: Problem of gradient descent with

non-convex cost functions

13 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

w0

w1

J(w0,w1)

Review: Problem of gradient descent with

non-convex cost functions


Gradient descent for SSE cost function

15

Minimize 𝐽(𝒘)𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘

𝑡)

𝐽(𝒘): Sum of squares error

𝐽 𝒘 =𝑖=1

𝑛

𝑦 𝑖 − 𝑔 𝒙 𝑖 ; 𝒘2

Weight update rule for 𝑔 𝒙;𝒘 = 𝒘𝑇𝒙:

𝒘𝑡+1 = 𝒘𝑡 + 𝜂

𝑖=1

𝑛

𝑦 𝑖 −𝒘𝑡𝑇𝒙 𝑖 𝒙(𝑖)

𝒘 =

𝑤0𝑤1⋮𝑤𝑑

𝒙 =

1𝑥1⋮𝑥𝑑

Gradient descent for SSE cost function

16

Weight update rule:𝑔 𝒙;𝒘 = 𝒘𝑇𝒙

𝒘𝑡+1 = 𝒘𝑡 + 𝜂

𝑖=1

𝑛

𝑦 𝑖 −𝒘𝑇𝒙 𝑖 𝒙(𝑖)

𝜂: too small → gradient descent can be slow.

𝜂 : too large → gradient descent can overshoot theminimum. It may fail to converge, or even diverge.

Batch mode: each step

considers all training

data

(function of the parameters 𝑤0, 𝑤1)𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥

𝐽(𝑤0, 𝑤1)

𝑤0

𝑤1

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 17

(function of the parameters 𝑤0, 𝑤1)

𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

𝑤0

𝑤1



𝑔 𝑥;𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)

𝑤0

𝑤1


Linear Classifiers

26

Error-Driven Classification

27

Errors, and What to Do

Examples of errors

Dear GlobalSCAPE Customer,

GlobalSCAPE has partnered with ScanSoft to offer you the

latest version of OmniPage Pro, for just $99.99* - the

regular list price is $499! The most common question we've

received about this offer is - Is this genuine? We would like

to assure you that this offer is authorized by ScanSoft, is

genuine and valid. You can get the . . .

. . . To receive your $30 Amazon.com promotional certificate,

click through to

http://www.amazon.com/apparel

and see the prominent link for the $30 offer. All details are

there. We hope you enjoyed receiving this message. However,

if you'd rather not receive future e-mails announcing new

store launches, please click . . .

28

What to Do About Errors Problem: there’s still spam in your inbox

Need more features – words aren’t enough! Have you emailed the sender before?

Have 1M other people just gotten the same email?

Is the sending information consistent?

Is the email in ALL CAPS?

Do inline URLs point where they say they point?

Does the email address you by (your) name?

Naïve Bayes models can incorporate a variety of features, but tend todo best in homogeneous cases (e.g. all features are word occurrences)

29

Feature Vectors

Hello,

Do you want free printr

cartriges? Why pay more

when you can get them

ABSOLUTELY FREE! Just

# free : 2

YOUR_NAME : 0

MISSPELLED : 2

FROM_FRIEND : 0

...

SPAMor+

PIXEL-7,12 : 1

PIXEL-7,13 : 0

...

NUM_LOOPS : 1

...

“2”

30

We show input by 𝒙 or 𝑓(𝒙)

Weights

Binary case: compare features to a weight vector to identify the class

Learning: figure out the weight vector from examples

# free : 2

YOUR_NAME : 0

MISSPELLED : 2

FROM_FRIEND : 0

...

# free : 4

YOUR_NAME :-1

MISSPELLED : 1

FROM_FRIEND :-3

...

# free : 0

YOUR_NAME : 1

MISSPELLED : 1

FROM_FRIEND : 1

...

Dot product positive means the positive class

31

Binary Decision Rule

In the space of feature vectors

Examples are points

Any weight vector is a hyperplane

One side corresponds toY=+1

Other corresponds toY=-1

BIAS : -3

free : 4

money : 2

...0 1

0

1

2

free

mo

ney

+1 = SPAM

-1 = HAM

32

Square error loss function for classification!

33

Square error loss is not suitable for classification: Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct side

of the decision)

Least square loss also lack robustness to noise

𝐽 𝒘 =

𝑖=1

𝑁

𝑤𝑥 𝑖 +𝑤0 − 𝑦𝑖 2

𝐾 = 2

Perceptron criterion

34

Two-class: 𝑦 ∈ {−1,1} 𝑦 = −1 for 𝐶2, 𝑦 = 1 for 𝐶1

Goal:∀𝑖, 𝒙(𝑖) ∈ 𝐶1 ⇒ 𝒘𝑇𝒙(𝑖) > 0

∀𝑖, 𝒙 𝑖 ∈ 𝐶2 ⇒ 𝒘𝑇𝒙 𝑖 < 0

𝐽𝑃 𝒘 = −

𝑖∈ℳ

𝒘𝑇𝒙 𝑖 𝑦 𝑖

ℳ: subset of training data that are misclassified

Many solutions? Which solution among them?

Batch Perceptron

35

“Gradient Descent” to solve the optimization problem:

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽𝑃(𝒘𝑡)

𝛻𝒘𝐽𝑃 𝒘 = −

𝑖∈ℳ

𝒙 𝑖 𝑦 𝑖

Batch Perceptron converges in finite number of steps for linearly

separable data:

Initialize 𝒘Repeat

𝒘 = 𝒘+ 𝜂σ𝑖∈ℳ 𝒙𝑖 𝑦 𝑖

Until convergence

Stochastic gradient descent for Perceptron

36

Single-sample perceptron:

If 𝒙(𝑖) is misclassified:

𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖)𝑦(𝑖)

Perceptron convergence theorem: for linearly separable data

If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒘, 𝑡 ← 0repeat

𝑡 ← 𝑡 + 1𝑖 ← 𝑡 mod 𝑁

if 𝒙(𝑖) is misclassified then𝒘 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)

Until all patterns properly classified

Fixed-Increment single sample Perceptron

𝜂 can be set to 1 and proof still works

Weight Updates

37

Learning: Binary Perceptron

Start with weights = 0

For each training instance:

Classify with current weights

If correct (i.e., y=y*), no change!

If wrong: adjust the weight vector

38

𝒘𝑡+1 = 𝒘𝑡 + 𝒙(𝑖)𝑦(𝑖)

Example

39





40

ො𝑦





If correct (i.e., ො𝑦 = 𝑦), no change!

If wrong: adjust the weight vector byadding or subtracting the featurevector (subtract if y is -1):

41

𝑦

𝑦

ො𝑦

Multiclass Decision Rule

If we have multiple classes:

A weight vector for each class:

Score (activation) of a class y:

Prediction highest score wins

Binary = multiclass where the negative class has weight zero

42

ො𝑦

Learning: Multiclass Perceptron

Start with all weights = 0

Pick up training examples one by one

Predict with current weights

43

ො𝑦

𝑤 ො𝑦

𝑤𝑦





If correct, no change!

If wrong: lower score of wrong answer,raise score of right answer

44

𝑤 ො𝑦

𝑤𝑦

ො𝑦

𝑤ො𝑦 = 𝑤ො𝑦 − 𝑓(𝑥)





If correct, no change!

If wrong: lower score of wrong answer,raise score of right answer

45

ො𝑦

𝑤 ො𝑦

𝑤𝑦

𝑤ො𝑦 = 𝑤ො𝑦 − 𝑓(𝑥)

𝑤𝑦 = 𝑤𝑦 + 𝑓(𝑥)

Properties of Perceptrons

Separability: true if some parameters get the training setperfectly classified

Convergence: if the training is separable, perceptron willeventually converge (binary case)

Mistake Bound: the maximum number of mistakes (binarycase) related to the margin or degree of separability

Separable

Non-Separable

46

Examples: Perceptron

Non-Separable Case

47

Discriminative approach: logistic regression

48

𝑔 𝒙;𝒘 = 𝜎(𝒘𝑇𝒙)

𝜎 . is an activation function

Sigmoid (logistic) function

Activation function

𝜎 𝑧 =1

1 + 𝑒−𝑧

𝐾 = 2

𝒙 = 1, 𝑥1, … , 𝑥𝑑𝒘 = 𝑤0, 𝑤1, … , 𝑤𝑑

Logistic regression: cost function

49

ෝ𝒘 = argmin𝒘

𝐽(𝒘)

𝐽 𝒘 =

=

𝑖=1

𝑛

−𝑦(𝑖)log 𝜎 𝒘𝑇𝒙(𝑖) − (1 − 𝑦(𝑖))log 1 − 𝜎 𝒘𝑇𝒙(𝑖)

𝐽(𝒘) is convex w.r.t. parameters.

Logistic regression: loss function

50

Loss 𝑦, 𝑓 𝒙;𝒘 = −𝑦 × log 𝜎 𝒙;𝒘 − (1 − 𝑦) × log(1 − 𝜎 𝒙;𝒘 )

Loss 𝑦, 𝜎 𝒙;𝒘 = ቊ−log(𝜎(𝒙;𝒘)) if 𝑦 = 1

−log(1 − 𝜎 𝒙;𝒘 ) if 𝑦 = 0

How is it related to zero-one loss?

Loss 𝑦, ො𝑦 = ቊ1 𝑦 ≠ ො𝑦0 𝑦 = ො𝑦

𝜎 𝒙;𝒘 =1

1 + 𝑒𝑥𝑝(−𝒘𝑇𝒙)

Since 𝑦 = 1 or 𝑦 = 0⇒

Logistic regression: Gradient descent

51

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽(𝒘𝑡)

𝛻𝒘𝐽 𝒘 =𝑖=1

𝑛

𝜎 𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 𝒙 𝑖

Is it similar to gradient of SSE for linear regression?

𝛻𝒘𝐽 𝒘 =𝑖=1

𝑛

𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 𝒙 𝑖

Multi-class classifier

𝑔 𝒙;𝑾 = 𝑔1 𝒙,𝑾 ,… , 𝑔𝐾 𝒙,𝑾

𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parametersfor each class

In linear classifiers, 𝑾 is 𝑑 × 𝐾 where 𝑑 shows number offeatures

𝑾𝑇𝒙 provides us a vector

𝑔 𝒙;𝑾 contains K numbers giving class scores for theinput 𝒙

52

Multi-class logistic regression

𝑔 𝒙;𝑾 = 𝑔1 𝒙,𝑾 ,… , 𝑔𝐾 𝒙,𝑾𝑇

𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parametersfor each class

53

𝑔𝑘 𝒙;𝑾 =exp(𝒘𝑘

𝑇𝒙 )

σ𝑗=1𝐾 exp(𝒘𝑗

𝑇𝒙 )

Logistic regression: multi-class

54

𝑾 = argmin𝑾

𝐽(𝑾)

𝐽 𝑾 = −

𝑖=1

𝑛

𝑘=1

𝐾

𝑦𝑘𝑖log 𝑔𝑘 𝒙

(𝑖);𝑾

𝑾 = 𝒘1 ⋯ 𝒘𝐾𝒚 is a vector of length 𝐾 (1-of-K coding)e.g., 𝒚 = 0,0,1,0 𝑇 when the target class is 𝐶3


55

𝒘𝑗𝑡+1 = 𝒘𝑗

𝑡 − 𝜂𝛻𝑾𝐽(𝑾𝑡)

𝛻𝒘𝑗𝐽 𝑾 =𝑖=1

𝑛

𝑔𝑗 𝒙𝑖 ;𝑾 − 𝑦𝑗

𝑖𝒙 𝑖

Logistic regression: Probabilistic perspective

Maximum likelihood estimation:

with:

= Two-class Logistic Regression

56

𝑃 𝑦 𝑖 = +1 𝑥 𝑖 ; 𝑤 =1

1 + 𝑒−𝒘𝑇𝒙

𝑃 𝑦 𝑖 = 0 𝑥 𝑖 ; 𝑤 = 1 −1

1 + 𝑒−𝒘𝑇𝒙

Multiclass Logistic Regression

Multi-class linear classification

A weight vector for each class:

Score (activation) of a class y:

Prediction w/highest score wins:

How to make the scores into probabilities?

original activations softmax activations

57

𝒘1𝑇𝒙

𝒘2𝑇𝒙 𝒘3

𝑇𝒙

𝒘𝑘

𝑧𝑘 = 𝒘𝑘𝑇𝒙

ො𝑦 = argmax𝑘

𝒘𝑘𝑇𝒙

Logistic regression: Probabilistic perspective

Maximum likelihood estimation:

with:

= Multi-Class Logistic Regression

58

𝑃 𝑦 𝑖 𝑥 𝑖 ; 𝑤 =𝑒𝒘𝑦 𝑖𝑇 𝒙(𝑖)

σ𝑘=1𝐾 𝑒𝒘𝑘

𝑇𝒙(𝑖)


59

𝒘𝑗𝑡+1 = 𝒘𝑗

𝑡 − 𝜂𝛻𝑾𝐽(𝑾𝑡)

𝛻𝒘𝑗𝐽 𝑾 =𝑖=1

𝑛

𝑃 𝑦𝑗|𝒙𝑖 ;𝑾 − 𝑦𝑗

𝑖𝒙 𝑖

Example

How can we tell whether this W and b is good or bad?

𝑾𝑇

60 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

𝒙 =

𝑥1⋮𝑥4

𝑾𝑇 =

𝒘1𝒘2𝒘3 3×4

𝒘𝟎 = 𝒃 =

𝑏1𝑏2𝑏3

Bias can also be included in the W matrix


Softmax classifier loss: example

𝐿(1) = − log 0.13= 0.89

𝐿(𝑖) = − log𝑒𝑠𝑦(𝑖)

σ𝑗=1𝐾 𝑒𝑠𝑗


𝑠𝑘 = 𝒘𝑘𝑇𝒙

Generalized linear

Linear combination of fixed non-linear function of the

input vector

𝑓(𝒙;𝒘) = 𝑤0 +𝑤1𝜙1(𝒙)+ . . . 𝑤𝑚𝜙𝑚(𝒙)

{𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 :ℝ𝑑 → ℝ

63

Basis functions: examples

Linear

Polynomial (univariate)

64

Polynomial regression: example

65

𝑚 = 1 𝑚 = 3

𝑚 = 5 𝑚 = 7

Generalized linear classifier

66

Assume a transformation 𝜙:ℝ𝑑 → ℝ𝑚 on the featurespace

𝒙 → 𝝓 𝒙

Find a hyper-plane in the transformed feature space:

𝑥2

𝑥1 𝜙1(𝒙)

𝜙2(𝒙)

𝜙: 𝒙 → 𝝓 𝒙

𝒘𝑇𝝓 𝒙 + 𝑤0 = 0

{𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 :ℝ𝑑 → ℝ

𝝓 𝒙 = [𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)]

Model complexity and overfitting

With limited training data, models may achieve zero

training error but a large test error.

Over-fitting: when the training loss no longer bears any

relation to the test (generalization) loss.

Fails to generalize to unseen examples.

67

1

𝑛

𝑖=1

𝑛

𝑦 𝑖 − 𝑔 𝒙𝑖; 𝒘

2

≈ 0Training(empirical) loss

Expected

(true) lossE𝐱,y 𝑦 − 𝑔 𝒙;𝒘

2≫ 0

Polynomial regression

68

𝑚 = 0 𝑚 = 1

𝑚 = 3 𝑚 = 9

𝑦

𝑦

𝑦

𝑦

[Bishop]

Over-fitting causes

69

Model complexity

E.g., Model with a large number of parameters (degrees of

freedom)

Low number of training data

Small data size compared to the complexity of the model

Number of training data & overfitting

70

Over-fitting problem becomes less severe as the size of

training data increases.

𝑚 = 9 𝑚 = 9

𝑛 = 15 𝑛 = 100

[Bishop]

How to evaluate the learner’s performance?

71

Generalization error: true (or expected) error that we

would like to optimize

Two ways to assess the generalization error is:

Practical: Use a separate data set to test the model

Theoretical: Law of Large numbers

statistical bounds on the difference between training and expected

errors

Avoiding over-fitting

72

Determine a suitable value for model complexity (Model

Selection)

Simple hold-out method

Cross-validation

Regularization (Occam’s Razor)

Explicit preference towards simple models

Penalize for the model complexity in the objective function

Model Selection

73

learning algorithm defines the data-driven search over

the hypothesis space (i.e. search for good parameters)

hyperparameters are the tunable aspects of the model,

that the learning algorithm does not select

This slide has been adopted from CMU ML course:

http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

Model Selection

74

Model selection is the process by which we choose the

“best” model from among a set of candidates

assume access to a function capable of measuring the quality of

a model

typically done “outside” the main training algorithm

Model selection / hyperparameter optimization is just

another form of learning

This slide has been adopted from CMU ML course:

http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

Simple hold out:

training, validation, and test sets

75

Simple hold-out chooses the model that minimizes error on

validation set.

Estimate generalization error for the test set

performance of the selected model is finally evaluated on the test set

Training

Validation

Test

Simple hold-out: model selection

76

Steps:

Divide training data into training and validation set 𝑣_𝑠𝑒𝑡

Use only the training set to train a set of models

Evaluate each learned model on the validation set

𝐽𝑣 𝒘 =1

𝑣_𝑠𝑒𝑡σ𝑖∈𝑣_𝑠𝑒𝑡 𝑦

(𝑖)− 𝑓 𝒙

(𝑖); 𝒘

2

Choose the best model based on the validation set error

Usually, wasteful of valuable training data

Training data may be limited.

On the other hand, small validation set give a relatively noisyestimate of performance.

Cross-validation technique is used instead.

Regularization

77

Adding a penalty term in the cost function to discourage

the coefficients from reaching large values.

Ridge regression (weight decay):

𝐽 𝒘 =𝑖=1

𝑛

𝑦 𝑖 −𝒘𝑇𝝓 𝒙 𝑖2+ 𝜆𝒘𝑇𝒘

Regularization parameter

78

Generalization

𝜆 now controls the effective complexity of the model andhence determines the degree of over-fitting

[Bishop]

Choosing the regularization parameter

79

A set of models with different values of 𝜆.

Find ෝ𝒘 for each model based on training data

Find 𝐽𝑣(ෝ𝒘) (or 𝐽𝑐𝑣(ෝ𝒘)) for each model

𝐽𝑣 𝒘 =1

𝑛_𝑣σ𝑖∈𝑣_𝑠𝑒𝑡 𝑦

(𝑖)− 𝑓 𝑥

(𝑖);𝒘

2

Select the model with the best 𝐽𝑣(ෝ𝒘) (or 𝐽𝑐𝑣(ෝ𝒘))

Resources

80

Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan

Tien Lin,“Learning from Data”, Chapter 2.3, 3.2, 3.4.

learning: linear methodsce.sharif.edu/courses/99-00/1/ce417-1/resources/root... · 2021. 1. 1. ·...

Documents