machine learning and data mining i•lda •metrics •cross-validation •regularization •feature...

36
DS 4400 Alina Oprea Associate Professor, CCIS Northeastern University October 23 2019 Machine Learning and Data Mining I

Upload: others

Post on 22-Jul-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

DS 4400

Alina OpreaAssociate Professor, CCISNortheastern University

October 23 2019

Machine Learning and Data Mining I

Page 2: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Midterm Review

Page 3: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Machine learning is everywhere

3

Page 4: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

What we covered so far

Linear classification• Perceptron• Logistic regression• LDA

• Metrics• Cross-validation• Regularization• Feature selection• Gradient Descent• Maximum Likelihood

Estimation (MLE)Linear Regression

Non-linear classification• kNN• Decision trees• Naïve Bayes

Linear algebra Probability and statistics

4

Page 5: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Terminology

5

• Hypothesis space 𝐻 = 𝑓: 𝑋 → 𝑌• Training data D = 𝑥*, 𝑦* ∈ 𝑋×𝑌• Features: 𝑥/ ∈ 𝑋• Labels / response variables 𝑦* ∈ 𝑌– Classification: discrete 𝑦* ∈ {0,1}– Regression: 𝑦* ∈ R

• Loss function: 𝐿 𝑓, 𝐷– Measures how well 𝑓 fits training data

• Training algorithm: Find hypothesis 3𝑓: 𝑋 → 𝑌– 3𝑓 = argmin

:∈;𝐿 𝑓, 𝐷

Page 6: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Supervised Learning: Classification

Data Pre-processing

Feature extraction

Learning model

Training

Labeled Classification

Testing

New data

Unlabeled

Learning model Predictions

PositiveNegative

Normalization Feature Selection

Classification

6

𝑥/, 𝑦/ ∈ {0,1} 𝑓(𝑥)

𝑓(𝑥)𝑥′

𝑦C = 𝑓 𝑥C ∈ {0,1}

Page 7: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Supervised Learning: Regression

Data Pre-processing

Feature extraction

Learning model

Training

Labeled Regression

Testing

New data

Unlabeled

Learning model Predictions

Response variable

Normalization Feature Selection

Regression

𝑥/, 𝑦/ ∈ 𝑅 𝑓(𝑥)

𝑓(𝑥)𝑥′

𝑦C = 𝑓 𝑥C ∈ 𝑅

7

Page 8: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Linear RegressionE𝒚 = 𝜽𝟎 + 𝜽𝟏𝒙

ℎM 𝑥 = 𝜃O + 𝜃P𝑥

Residual

𝜽𝟏 = 𝚫𝐲/𝚫𝐱

Slope𝜃O Intercept

MSE= PU∑/WPU ℎM 𝑥/ − 𝑦/ Y

8

𝑥(/), 𝑦(/)

Page 9: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Multiple Linear Regression• Dataset:𝑥/ ∈ 𝑅^, 𝑦/ ∈ 𝑅• Hypothesis ℎM 𝑥 = 𝜃_𝑥• MSE = P

U∑ 𝜃_𝑥/ − 𝑦/ Y Loss / cost

9

Page 10: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Maximum Likelihood Estimation (MLE)Given training data 𝑋 = 𝑥P, … , 𝑥U with labels Y = 𝑦P, … , 𝑦U

What is the likelihood of training data for parameter 𝜃?

Define likelihood function

Assumption: training points are independent!

𝑀𝑎𝑥M 𝐿 𝜃 = 𝑃 𝑌 𝑋; 𝜃 = 𝑓(𝑦P, … , 𝑦U 𝑥P, … , 𝑥U; 𝜃

𝐿 𝜃 =g/WP

U

𝑃[𝑦/|𝑥/; 𝜃]

10

Page 11: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

MLE for Linear Regression

𝐿 𝜃 =g/WP

U

𝑃[𝑦/|𝑥/; 𝜃] =g/WP

U𝑓(𝑦/ 𝑥/; 𝜃, 𝜎

log𝐿 𝜃 = −c∑/WPU 𝑦/ − 𝜃O + 𝜃P𝑥/ Y

Max likelihood 𝜃 is the same as Min MSE 𝜃!The MSE metric has statistical motivation

11

Page 12: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Gradient Descent

12

Gradient = slope of line tangent to curve at the same point

Page 13: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Linear Classifiers

ℎM 𝑥 = 𝑓(𝜃_𝑥) linear function• If 𝜃_𝑥 > 0 classify 1• If 𝜃_𝑥 < 0 classify 0

All the points x on the hyperplane satisfy: 𝜃_𝑥 = 013

Page 14: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

The Perceptron

14

12

𝜃p ← 𝜃p −12(ℎM 𝑥/ − 𝑦/)𝑥/p

(𝑥/, 𝑦/)

Page 15: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

The Perceptron

15

12

(𝑥/, 𝑦/)

𝜃p ← 𝜃p −12(ℎM 𝑥/ − 𝑦/)𝑥/p

𝜃p ← 𝜃p + 𝑦/𝑥/p

Perceptron Rule: If 𝑥/ is misclassified, do 𝜃 ← 𝜃 + 𝑦/ 𝑥/

Page 16: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Online Perceptron

16

T

Let 𝜃 ←[0,0,…,0]Repeat:Receive training example 𝑥/, 𝑦/If 𝑦/𝜃_𝑥/ ≤ 0 // prediction is incorrect

𝜃 ← 𝜃 + 𝑦/ 𝑥/

Page 17: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Batch Perceptron

Guaranteed to find separating hyperplane if data is linearly separable

17

𝑦/𝜃_𝑥/𝑦/ 𝑥/

𝑥/, 𝑦/

Page 18: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

• For linearly separable data, can prove bounds on perceptron error (depends on how well separated the data is)

18

Page 19: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Perceptron Limitations• Is dependent on starting point• It could take many steps for convergence• Perceptron can overfit– Move the decision boundary for every example

Which of this is optimal?

19

Page 20: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Logistic Regression

20

Logistic Regression is a linear classifier!

Page 21: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

LDA

• Classify to one of k classes• Logistic regression computes directly– P 𝑌 = 1 𝑋 = 𝑥– Assume sigmoid function

• LDA uses Bayes Theorem to estimate it

– P 𝑌 = 𝑘 𝑋 = 𝑥 = y 𝑋 = 𝑥 𝑌 = 𝑘 y[zW{]y[|W}]

– Let 𝜋{ = P[𝑌 = 𝑘] be the prior probability of class k and 𝑓{ 𝑥 = P 𝑋 = 𝑥 𝑌 = 𝑘

21

Page 22: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

LDA

Assume 𝑓{ 𝑥 is Gaussian!Unidimensional case (d=1)

Assumption: 𝜎P = …𝜎{ = σ

22

Page 23: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

LDA decision boundaryPick class k to maximize

Example: 𝑘 = 2, 𝜋P = 𝜋YClassify as class 1 if 𝑥 > �����

Y

True decision boundary Estimated decision boundary23

Page 24: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

LDA

24

Given training data 𝑥/, 𝑦/ , 𝑖 = 1,… ,𝑁, 𝑦/ ∈ {1, … , 𝐾}

1. Estimate mean and variance

2. Estimate prior

Given testing point 𝑥, predict k that maximizes:

𝑥/

(𝑥/ − �̂�{)2

Page 25: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Multi-variate LDA Given training data 𝑥/, 𝑦/ , 𝑖 = 1,… , 𝑛, 𝑦/ ∈ {1, … , 𝐾}

1. Estimate mean and variance

2. Estimate prior

Given testing point 𝑥, predict k that maximizes:

(𝑥(/)−�̂�{)2

25

Page 26: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

26

Page 27: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Naïve Bayes Classifier

27

P[𝑌 = 𝑘]P 𝑋P = 𝑥P ∧ ⋯∧ 𝑋^= 𝑥^ 𝑌 = 𝑘P[𝑋P = 𝑥P ∧ ⋯∧ 𝑋^= 𝑥^]

P 𝑌 = 𝑘 𝑋 = 𝑥

Page 28: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Confusion Matrix

28

Page 29: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

ROC Curves

29

• Receiver Operating Characteristic (ROC)• Determine operating point (e.g., by fixing false positive rate)

Perfect classification

Random guessing

Better

One classifier for fixed threshold

2929

Page 30: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Cross Validation

• k-fold CV– Split training data into k partitions (folds) of equal size– Pick the optimal value of hyper-parameter according

to error metric averaged over all folds30

Page 31: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Bias-Variance Tradeoff

31

• Bias = Difference between estimated and true models• Variance = Model difference on different training sets

Over-fittingUnder-fitting

Page 32: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Regularization

• A method for controlling the complexity of learned hypothesis

Ridge

LASSO

𝐽 𝜃 =12�/WP

U

ℎM 𝑥/ − 𝑦/ Y +𝜆2�pWP

^

𝜃pY

𝐽 𝜃 = �/WP

U

ℎM 𝑥/ − 𝑦/ Y + 𝜆�pWP

^

|𝜃p|

Squared Residuals

Regularization

Page 33: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Type I: Conceptual

• Example 1: Describe difference between classification and regression

• Example 2: List one technique that can be used to improve model generality

• Example 3: Why do we need multiple metrics to evaluate classifiers

• Example 4: Provide advantages and disadvantages of:– Linear classifiers compared to more complex ones

33

Page 34: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

More Examples

34

DS 5220

Page 35: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Type II: Pseudocode

• Example 1: Write pseudocode for kNN• Example 2: Write pseudocode for perceptron• Example 3: Write pseudocode for …

35

Page 36: Machine Learning and Data Mining I•LDA •Metrics •Cross-validation •Regularization •Feature selection •Gradient Descent •Maximum Likelihood Estimation (MLE) Linear Regression

Type III: Computational

• Example 1: Given a dataset, train a particular ML model – E.g., kNN, Naïve Bayes etc. – Evaluate model on some simple training and

testing data

• Example 2: Given a dataset, compute some metrics / loss function

• Example 3: How many parameters does a model need to store?

36