unsupervised vs. supervised learning - sp1819.github.io · approach - look at my neighbors are...

104
Unsupervised vs. Supervised Learning Marina Sedinkina Ludwig Maximilian University of Munich Center for Information and Language Processing November 27, 2018 Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 1 / 66

Upload: others

Post on 15-Oct-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised vs. Supervised Learning

Marina Sedinkina

Ludwig Maximilian University of MunichCenter for Information and Language Processing

November 27, 2018

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 1 / 66

Page 2: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Overview

1 What Is Machine Learning?

2 Supervised Learning: Classification

3 Unsupervised Learning: Clustering

4 Supervised: K Nearest Neighbors Algorithm

5 Unsupervised: K-Means

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 2 / 66

Page 3: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Modeling: model - specification of a mathematical (or probabilistic)relationship that exists between different variables.

business model: number of users, profit per user, number ofemployees ⇒ profit is income minus expensespoker model: the cards that have been revealed so far, thedistribution of cards in the deck ⇒ win probabilitylanguage model in NLP: a probability that a string is a member of alanguage (originally developed for the problem of speech recognition)

Machine Learning - creating and using models that are learned fromdata (predictive modeling or data mining)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 3 / 66

Page 4: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Modeling: model - specification of a mathematical (or probabilistic)relationship that exists between different variables.

business model: number of users, profit per user, number ofemployees ⇒ profit is income minus expenses

poker model: the cards that have been revealed so far, thedistribution of cards in the deck ⇒ win probabilitylanguage model in NLP: a probability that a string is a member of alanguage (originally developed for the problem of speech recognition)

Machine Learning - creating and using models that are learned fromdata (predictive modeling or data mining)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 3 / 66

Page 5: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Modeling: model - specification of a mathematical (or probabilistic)relationship that exists between different variables.

business model: number of users, profit per user, number ofemployees ⇒ profit is income minus expensespoker model: the cards that have been revealed so far, thedistribution of cards in the deck ⇒ win probability

language model in NLP: a probability that a string is a member of alanguage (originally developed for the problem of speech recognition)

Machine Learning - creating and using models that are learned fromdata (predictive modeling or data mining)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 3 / 66

Page 6: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Modeling: model - specification of a mathematical (or probabilistic)relationship that exists between different variables.

business model: number of users, profit per user, number ofemployees ⇒ profit is income minus expensespoker model: the cards that have been revealed so far, thedistribution of cards in the deck ⇒ win probabilitylanguage model in NLP: a probability that a string is a member of alanguage (originally developed for the problem of speech recognition)

Machine Learning - creating and using models that are learned fromdata (predictive modeling or data mining)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 3 / 66

Page 7: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Modeling: model - specification of a mathematical (or probabilistic)relationship that exists between different variables.

business model: number of users, profit per user, number ofemployees ⇒ profit is income minus expensespoker model: the cards that have been revealed so far, thedistribution of cards in the deck ⇒ win probabilitylanguage model in NLP: a probability that a string is a member of alanguage (originally developed for the problem of speech recognition)

Machine Learning - creating and using models that are learned fromdata (predictive modeling or data mining)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 3 / 66

Page 8: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Goal - use existing data to develop models for predicting variousoutcomes for new data

Predicting whether an email message is spam or notPredicting which advertisement a shopper is most likely to click onPredicting which football team is going to win

Examples in NLP:

???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 4 / 66

Page 9: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Goal - use existing data to develop models for predicting variousoutcomes for new data

Predicting whether an email message is spam or not

Predicting which advertisement a shopper is most likely to click onPredicting which football team is going to win

Examples in NLP:

???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 4 / 66

Page 10: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Goal - use existing data to develop models for predicting variousoutcomes for new data

Predicting whether an email message is spam or notPredicting which advertisement a shopper is most likely to click on

Predicting which football team is going to win

Examples in NLP:

???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 4 / 66

Page 11: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Goal - use existing data to develop models for predicting variousoutcomes for new data

Predicting whether an email message is spam or notPredicting which advertisement a shopper is most likely to click onPredicting which football team is going to win

Examples in NLP:

???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 4 / 66

Page 12: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Goal - use existing data to develop models for predicting variousoutcomes for new data

Predicting whether an email message is spam or notPredicting which advertisement a shopper is most likely to click onPredicting which football team is going to win

Examples in NLP:

???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 4 / 66

Page 13: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

What Is Machine Learning?

Goal - use existing data to develop models for predicting variousoutcomes for new data

Predicting whether an email message is spam or notPredicting which advertisement a shopper is most likely to click onPredicting which football team is going to win

Examples in NLP:

Speech Recognition

Language Identification

Machine Translation

Document Summarization

Question Answering

Sentiment Detection

Text Classification

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 5 / 66

Page 14: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Approaches

supervised: data labeled with the correct answers to learn from

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 6 / 66

Page 15: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Approaches

unsupervised: no label given, purely based on the given raw data ⇒ findcommon structure in data

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 7 / 66

Page 16: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised Learning: General Examples

you see a group of people: divide them into groups

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 8 / 66

Page 17: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised Learning: General Examples

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 9 / 66

Page 18: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised Learning: General Examples

cluster city names, trees

cluster similar blog posts: understand what the users are bloggingabout.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 10 / 66

Page 19: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised Learning: General Examples

cluster city names, trees

cluster similar blog posts: understand what the users are bloggingabout.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 10 / 66

Page 20: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised: K Nearest Neighbors Classification

General Idea

predict how I’m going to vote!

approach - look at my neighbors are planning to vote

imagine you know:

my agemy incomehow many kids I have

new approach - look at those neighbors with similar features → betterprediction!

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 11 / 66

Page 21: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised: K Nearest Neighbors Classification

General Idea

predict how I’m going to vote!

approach - look at my neighbors are planning to vote

imagine you know:

my agemy incomehow many kids I have

new approach - look at those neighbors with similar features → betterprediction!

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 11 / 66

Page 22: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised: K Nearest Neighbors Classification

General Idea

predict how I’m going to vote!

approach - look at my neighbors are planning to vote

better idea???

imagine you know:

my agemy incomehow many kids I have

new approach - look at those neighbors with similar features → betterprediction!

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 11 / 66

Page 23: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised: K Nearest Neighbors Classification

General Idea

predict how I’m going to vote!

approach - look at my neighbors are planning to vote

imagine you know:

my age

my incomehow many kids I have

new approach - look at those neighbors with similar features → betterprediction!

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 11 / 66

Page 24: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised: K Nearest Neighbors Classification

General Idea

predict how I’m going to vote!

approach - look at my neighbors are planning to vote

imagine you know:

my agemy income

how many kids I have

new approach - look at those neighbors with similar features → betterprediction!

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 11 / 66

Page 25: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised: K Nearest Neighbors Classification

General Idea

predict how I’m going to vote!

approach - look at my neighbors are planning to vote

imagine you know:

my agemy incomehow many kids I have

new approach - look at those neighbors with similar features → betterprediction!

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 11 / 66

Page 26: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised: K Nearest Neighbors Classification

General Idea

predict how I’m going to vote!

approach - look at my neighbors are planning to vote

imagine you know:

my agemy incomehow many kids I have

new approach - look at those neighbors with similar features → betterprediction!

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 11 / 66

Page 27: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Nearest Neighbors: Classification rule

classify a new object

find the object in the training set that is most similar

assign the category of this nearest neighbor

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 12 / 66

Page 28: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Nearest Neighbors: Classification rule

classify a new object

find the object in the training set that is most similar

assign the category of this nearest neighbor

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 12 / 66

Page 29: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Nearest Neighbors: Classification rule

classify a new object

find the object in the training set that is most similar

assign the category of this nearest neighbor

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 12 / 66

Page 30: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

Take k closest neighbors instead of one

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 13 / 66

Page 31: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

k = 5; 10

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 14 / 66

Page 32: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification: Data points

Data points are vectors in some finite-dimensional space.

’+’ and ’-’ objects are 2-dimensional (2-d) vectors:

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 15 / 66

Page 33: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification: Data points

Data points are vectors in some finite-dimensional space.

’+’ and ’-’ objects are 2-dimensional (2-d) vectors:

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 15 / 66

Page 34: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points

if you have the heights, weights, and ages of a large number ofpeople, treat your data as 3-dimensional vectors (height, weight,age):

h e i g h t w e i g h t a g e p o i n t = [ 7 0 , # kg170 , # cm,

40 ] # yea r s

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 16 / 66

Page 35: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

Task: Represent each word from data as a vector (data point)

Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 17 / 66

Page 36: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

Task: Represent each word from data as a vector (data point)

Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 17 / 66

Page 37: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

Task: Represent each word from data as a vector (data point)

Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 17 / 66

Page 38: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)

2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 18 / 66

Page 39: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

One-hot representation

~vThe = (1000)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 19 / 66

Page 40: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

One-hot representation

~vThe = (1000) ~vquick = (????)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 20 / 66

Page 41: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

One-hot representation

~vThe = (1000) ~vquick = (0100)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 21 / 66

Page 42: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

One-hot representation

~vThe = (1000) ~vquick = (0100) ~vbrown = (????)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 22 / 66

Page 43: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

One-hot representation

~vThe = (1000) ~vquick = (0100) ~vbrown = (0010)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 23 / 66

Page 44: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

One-hot representation

~vThe = (1000) ~vquick = (0100) ~vbrown = (0010) ~vfox = (????)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 24 / 66

Page 45: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: One-hot encoding

1 Task: Represent each word from data as a vector (data point)2 Form vocabulary (word types) from data:

data : The q u i c k q u i c k brown f o x

Vocab(s) =

“The”

“quick”

“brown”

“fox”

3 One-hot vector is a vector filled with 0s, except for a 1 at theposition associated with word

4 Vocabulary size = 4, one-hot 4-d vector of word ”The” at theposition 0 is ~vThe = (1000):

One-hot representation

~vThe = (1000) ~vquick = (0100) ~vbrown = (0010) ~vfox = (0001)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 25 / 66

Page 46: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Data points: Document representation

How we can represent a document???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 26 / 66

Page 47: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document representation

fixed set of elements (e.g., documents): D = {d1, ...dn}

document d (data point) is represented by a vector of features:d ∈ Nk → d = [x1x2...xk ]

feature weights are numerical statistics (TF-IDF)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 27 / 66

Page 48: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document representation

fixed set of elements (e.g., documents): D = {d1, ...dn}document d (data point) is represented by a vector of features:d ∈ Nk → d = [x1x2...xk ]

feature weights are numerical statistics (TF-IDF)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 27 / 66

Page 49: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document representation

fixed set of elements (e.g., documents): D = {d1, ...dn}document d (data point) is represented by a vector of features:d ∈ Nk → d = [x1x2...xk ]

feature weights are numerical statistics (TF-IDF)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 27 / 66

Page 50: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document Representation: binary

Vectorize a text corpus, by turning each text into a vector where thecoefficient for each token could be binary:

from k e r a s . p r e p r o c e s s i n g . t e x t import T o k e n i z e rt o k e n i z e r = T o k e n i z e r ( )X t r a i n = [ ” f i r s t t e x t : f i r s t s e n t e n c e ” , ” second t e x t ” ,

” t h i r d t e x t ” ]

t o k e n i z e r . f i t o n t e x t s ( X t r a i n )t o k e n i z e r . w o r d i n d e x>>>{ ’ f i r s t ’ : 2 , ’ second ’ : 4 , ’ s e n t e n c e ’ : 3 ,

’ t e x t ’ : 1 , ’ t h i r d ’ : 5}

t o k e n i z e r . t e x t s t o m a t r i x ( X t r a i n , mode= ’ b i n a r y ’ )>>>a r r a y ( [ [ 0 . , 1 . , 1 . , 1 . , 0 . , 0 . ] ,

[ 0 . , 1 . , 0 . , 0 . , 1 . , 0 . ] ,[ 0 . , 1 . , 0 . , 0 . , 0 . , 1 . ] ] )

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 28 / 66

Page 51: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document Representation: count

Vectorize a text corpus, by turning each text into a vector where thecoefficient for each token could based on word count:

from k e r a s . p r e p r o c e s s i n g . t e x t import T o k e n i z e rt o k e n i z e r = T o k e n i z e r ( )X t r a i n = [ ” f i r s t t e x t : f i r s t s e n t e n c e ” , ” second t e x t ” ,

” t h i r d t e x t ” ]

t o k e n i z e r . f i t o n t e x t s ( X t r a i n )t o k e n i z e r . w o r d i n d e x>>>{ ’ f i r s t ’ : 2 , ’ second ’ : 4 , ’ s e n t e n c e ’ : 3 ,

’ t e x t ’ : 1 , ’ t h i r d ’ : 5}

t o k e n i z e r . t e x t s t o m a t r i x ( X t r a i n , mode= ’ count ’ )>>a r r a y ( [ [ 0 . , 1 . , 2 . , 1 . , 0 . , 0 . ] ,

[ 0 . , 1 . , 0 . , 0 . , 1 . , 0 . ] ,[ 0 . , 1 . , 0 . , 0 . , 0 . , 1 . ] ] )

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 29 / 66

Page 52: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document Representation: tf-idf

Vectorize a text corpus, by turning each text into a vector where thecoefficient for each token could based on tf-idf:

from k e r a s . p r e p r o c e s s i n g . t e x t import T o k e n i z e rt o k e n i z e r = T o k e n i z e r ( )X t r a i n = [ ” f i r s t t e x t : f i r s t s e n t e n c e ” , ” second t e x t ” ,

” t h i r d t e x t ” ]

t o k e n i z e r . f i t o n t e x t s ( X t r a i n )t o k e n i z e r . w o r d i n d e x>>>{ ’ f i r s t ’ : 2 , ’ second ’ : 4 , ’ s e n t e n c e ’ : 3 ,

’ t e x t ’ : 1 , ’ t h i r d ’ : 5}

t o k e n i z e r . t e x t s t o m a t r i x ( X t r a i n , mode= ’ t f i d f ’ )>>[[0 0 .55961579 1.55141507 0.91629073 0 0 ]

[ 0 0 .55961579 0 0 0.91629073 0 ][ 0 0 .55961579 0 0 0 0 . 9 1 6 2 9 0 7 3 ] ]

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 30 / 66

Page 53: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

def k n n c l a s s i f y ( k , l a b e l e d p o i n t s , n e w p o i n t ) :””” each l a b e l e d po i n t i s a p a i r ( po in t , l a b e l ) ”””

# o rd e r p o i n t s de s c end i ngs i m i l a r i t i e s = sorted ( l a b e l e d p o i n t s ,

key=lambda x :−c o s i n s i m ( x [ 0 ] , n e w p o i n t ) )

# f i n d the l a b e l s f o r the k c l o s e s tk n e a r e s t l a b e l s = [ l a b e l f o r , l a b e l

i n s i m i l a r i t i e s [ : k ] ]

# and choose onereturn c h o o s e o n e ( k n e a r e s t l a b e l s )

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 31 / 66

Page 54: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Recall: Sort List of Tuples

>>> s t u d e n t s = [( ’ j o h n ’ , 2 2 ) ,( ’ j a n e ’ , 2 0 ) ,( ’ dave ’ , 2 5 ) ]

>>> sorted ( s t u d e n t s )[ ( ’ dave ’ , 2 5 ) , ( ’ j a n e ’ , 2 0 ) , ( ’ j o h n ’ , 2 2 ) ]

>>> sorted ( s t u d e n t s , key=lambda x : x [ 1 ] )[ ( ’ j a n e ’ , 2 0 ) , ( ’ j o h n ’ , 2 2 ) , ( ’ dave ’ , 2 5 ) ]

>>> sorted ( s t u d e n t s , key=lambda x : x [ 1 ] , r e v e r s e=True )[ ( ’ dave ’ , 2 5 ) , ( ’ j o h n ’ , 2 2 ) , ( ’ j a n e ’ , 2 0 ) ]

>>> sorted ( s t u d e n t s , key=lambda x : −x [ 1 ] )[ ( ’ dave ’ , 2 5 ) , ( ’ j o h n ’ , 2 2 ) , ( ’ j a n e ’ , 2 0 ) ]

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 32 / 66

Page 55: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Requirements. Metric for distance computation

import mathdef d o t p r o d u c t ( v1 , v2 ) :

return sum ( [ v a l u e 1 ∗ v a l u e 2 f o r v a lu e 1 , v a l u e 2i n z ip ( v1 , v2 ) ] )

def c o s i n s i m ( v1 , v2 ) :#compute c o s i n e s i m i l a r i t yprod = d o t p r o d u c t ( v1 , v2 )l e n 1 = math . s q r t ( d o t p r o d u c t ( v1 , v1 ) )l e n 2 = math . s q r t ( d o t p r o d u c t ( v2 , v2 ) )return prod / ( l e n 1 ∗ l e n 2 )

c o s i n s i m ( [ 1 , 2 ] , [ 3 , 4 ] )>>> 0.9838699100999074

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 33 / 66

Page 56: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Cosine Similarity

dot product expresses how much the two vectors are pointing in thesame direction

if two documents share a lot of common terms, their tf-idf vectorswill point in a similar direction

cosine similarity = an indicator how close the documents are in thesemantics of their content

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 34 / 66

Page 57: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Cosine Similarity

dot product expresses how much the two vectors are pointing in thesame direction

if two documents share a lot of common terms, their tf-idf vectorswill point in a similar direction

cosine similarity = an indicator how close the documents are in thesemantics of their content

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 34 / 66

Page 58: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Cosine Similarity

dot product expresses how much the two vectors are pointing in thesame direction

if two documents share a lot of common terms, their tf-idf vectorswill point in a similar direction

cosine similarity = an indicator how close the documents are in thesemantics of their content

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 34 / 66

Page 59: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

What if we have two winners (k = 2)?

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 35 / 66

Page 60: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

What if we have two winners (k = 2)?

Strategies:

1 Pick one of the winners at random

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 35 / 66

Page 61: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

What if we have two winners (k = 2)?

Strategies:

1 Pick one of the winners at random

2 Reduce k until we find a unique winner

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 35 / 66

Page 62: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

#l a b e l s s o r t e d from n e a r e s t to f a r t h e s tl a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’

’ r e l i g i o n ’ , ’ s p o r t ’ ]

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 36 / 66

Page 63: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

#l a b e l s s o r t e d from n e a r e s t to f a r t h e s tl a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’

’ r e l i g i o n ’ , ’ s p o r t ’ ]

2 winners: ’sport’ and ’religion’

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 37 / 66

Page 64: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

#l a b e l s s o r t e d from n e a r e s t to f a r t h e s tl a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’

’ r e l i g i o n ’ , ’ s p o r t ’ ]

2 winners: ’sport’ and ’religion’

Reduce k until we find a unique winner:

reduced labels = ???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 38 / 66

Page 65: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

#l a b e l s s o r t e d from n e a r e s t to f a r t h e s tl a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’

’ r e l i g i o n ’ , ’ s p o r t ’ ]

2 winners: ’sport’ and ’religion’

Reduce k until we find a unique winner

reduced labels = labels[:-1]

p r i n t ( r e d u c e d l a b e l s )

>>> [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’ , ’ r e l i g i o n ’ ]

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 39 / 66

Page 66: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

#l a b e l s s o r t e d from n e a r e s t to f a r t h e s tl a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’

’ r e l i g i o n ’ , ’ s p o r t ’ ]

2 winners: ’sport’ and ’religion’

Reduce k until we find a unique winner

reduced labels = labels[:-1]

p r i n t ( r e d u c e d l a b e l s )

>>> [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’ , ’ r e l i g i o n ’ ]

now 1 winner: ’religion’

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 40 / 66

Page 67: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

#l a b e l s s o r t e d from n e a r e s t to f a r t h e s tl a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’ , ’ p o l i t i c s ’ ]

Winner???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 41 / 66

Page 68: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

l a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ r e l i g i o n ’ , ’ p o l i t i c s ’ ]

Winner:

’sport’

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 42 / 66

Page 69: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

l a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ c a r s ’ , ’ s p o r t ’ ]

Winner???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 43 / 66

Page 70: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

l a b e l s = [ ’ s p o r t ’ , ’ c a r s ’ , ’ c a r s ’ , ’ s p o r t ’ ]

Winner:

’cars’

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 44 / 66

Page 71: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K Nearest Neighbor (KNN) Classification

def c h o o s e o n e ( l a b e l s ) :””” l a b e l s a r e o rd e r ed from n e a r e s t to f a r t h e s t ”””

c o u n t s = Counter ( l a b e l s )winner , w i n n e r c o u n t = c o u n t s . most common ( 1 ) [ 0 ]

# count number o f w inne r s i n a l i s t ,# i . e . how many words w i th equa l w i nne r coun t ?. . .

#i f un ique winner , so r e t u r n i t. . .

#e l s e : r educe the l i s t and t r y aga in ,# i . e c a l l choose one aga in but w i th reduced l i s t. . .

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 45 / 66

Page 72: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Counter

from c o l l e c t i o n s import Counterc o l o r s = [ ’ r e d ’ , ’ b l u e ’ , ’ r e d ’ , ’ g r e e n ’ ,

’ b l u e ’ , ’ b l u e ’ , ’ r e d ’ ]c nt = Counter ( c o l o r s )p r i n t ( c nt )>>> Counter ({ ’ r e d ’ : 3 , ’ b l u e ’ : 3 , ’ g r e e n ’ : 1})

most common tuple = c nt . most common ( 1 )p r i n t ( most common tuple )>>>[( ’ r e d ’ , 3 ) ]

winner , w i n n e r c o u n t = most common tuple [ 0 ]p r i n t ( winner , w i n n e r c o u n t )>>> r e d 3

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 46 / 66

Page 73: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document Classification with KNN

fixed set of elements (e.g., documents): D = {d1, ...dn}document d (data point) is represented by a vector of features:d ∈ Nk → d = [x1x2...xk ]

feature weights are numerical statistics (like TF-IDF)

weights are not re-weighted during learning → KNN is”non-parametric” classifier

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 47 / 66

Page 74: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Document Classification with KNN

fixed set of elements (e.g., documents): D = {d1, ...dn}document d (data point) is represented by a vector of features:d ∈ Nk → d = [x1x2...xk ]

feature weights are numerical statistics (like TF-IDF)

weights are not re-weighted during learning → KNN is”non-parametric” classifier

Goal - find the most similar document for a given document d andassign the same category (1NN classification)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 48 / 66

Page 75: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised: K-Means

clustering algorithm

the number of clusters k is chosen in advance

partition the inputs into sets S1, ...,Sk using cluster centroids

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 49 / 66

Page 76: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised: K-Means

clustering algorithm

the number of clusters k is chosen in advance

partition the inputs into sets S1, ...,Sk using cluster centroids

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 49 / 66

Page 77: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Unsupervised: K-Means

clustering algorithm

the number of clusters k is chosen in advance

partition the inputs into sets S1, ...,Sk using cluster centroids

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 49 / 66

Page 78: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means

K-means clustering technique

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 50 / 66

Page 79: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means

k-means clustering technique

1 randomly initialize cluster centroids2 assign each point to the centroid to which it is closest:

use Euclidean distance to measure the distance

d(p, q) =

√√√√ n∑i=1

(qi − pi )2 (1)

3 recompute cluster centroids

4 go back to 2 until nothing changes (or it takes too long)

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 51 / 66

Page 80: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means

c l a s s KMeans :””” pe r f o rms k−means c l u s t e r i n g ”””

def i n i t ( s e l f , k ) :s e l f . k = k # number o f c l u s t e r ss e l f . means = None # means o f c l u s t e r s

def c l a s s i f y ( s e l f , input ) :””” r e t u r n the i ndex o f the c l u s t e rc l o s e s t to the i npu t ( s t ep 2) ”””return min ( range ( s e l f . k ) ,

key=lambda i :d i s t a n c e ( input , s e l f . means [ i ] ) )

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 52 / 66

Page 81: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Python min() Function

>>> a = [ ( 0 . 2 2 2 2 , 1 ) , ( 0 . 1 1 1 1 , 2 ) , ( 0 . 6 6 6 6 , 3 ) ]>>> min ( a , key= lambda x : x [ 0 ] )>>>(0.1111, 2)

>>> min ( a , key= lambda x : x [ 1 ] )( 0 . 2 2 2 2 , 1)

>>> k c l u s t e r s = 3>>> i n p u t v e c = [ 1 , 2 , 3 ]>>> means = [ [ 1 . 5 , 2 . 5 , 3 . 5 ] , [ 4 . 5 , 5 . 5 , 6 . 5 ] , [ 7 . 5 , 8 . 5 , 9 . 5 ] ]

>>> range ( k c l u s t e r s )[ 0 , 1 , 2 ]

>>> min ( range ( n u m c l u s t e r s ) , key=lambda x :d i s t a n c e ( i n p u t v e c , means [ x ] ) )

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 53 / 66

Page 82: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means

def t r a i n ( s e l f , i n p u t s ) :# choose k random po i n t s as the i n i t i a l meanss e l f . means = random . sample ( i n p u t s , s e l f . k )#step 1a s s i g n m e n t s = Nonewhi le True :

# Find new as s i gnment sn e w a s s i g n m e n t s = map( s e l f . c l a s s i f y , i n p u t s )i f a s s i g n m e n t s == n e w a s s i g n m e n t s :

return # I f no th i ng changed , we ’ r e done .

a s s i g n m e n t s = n e w a s s i g n m e n t sf o r i i n range ( s e l f . k ) : #compute new means

i p o i n t s = [ p f o r p , a i n z ip ( i n p u t s ,a s s i g n m e n t s ) i f a == i ]

i f i p o i n t s :s e l f . means [ i ] = mean ( i p o i n t s )

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 54 / 66

Page 83: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Map

r = map( func , seq )

import f u n c t o o l sdef f a h r e n h e i t (T ) :

return ( ( 9 . 0 / 5 )∗T + 32)temp = [ 3 6 . 5 , 37 , 3 7 . 5 , 3 9 ]F = map( f a h r e n h e i t , temp )

p r i n t ( l i s t ( F ) )>>> [ 9 7 . 7 , 98 .60000000000001 , 9 9 . 5 , 1 0 2 . 2 ]

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 55 / 66

Page 84: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means: Real Example

organize meetup for users

goal - choose 3 meetup locations convenient for all users

c l u s t e r e r = KMeans ( 3 )c l u s t e r e r . t r a i n ( i n p u t s )p r i n t ( c l u s t e r e r . means )

you find three clusters and you look for meetup venues near thoselocations

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 56 / 66

Page 85: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means: Real Example

organize meetup for users

goal - choose 3 meetup locations convenient for all users

c l u s t e r e r = KMeans ( 3 )c l u s t e r e r . t r a i n ( i n p u t s )p r i n t ( c l u s t e r e r . means )

you find three clusters and you look for meetup venues near thoselocations

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 56 / 66

Page 86: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means: Real Example

organize meetup for users

goal - choose 3 meetup locations convenient for all users

c l u s t e r e r = KMeans ( 3 )c l u s t e r e r . t r a i n ( i n p u t s )p r i n t ( c l u s t e r e r . means )

you find three clusters and you look for meetup venues near thoselocations

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 56 / 66

Page 87: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Kmeans with NLTK

from n l t k import c l u s t e rfrom n l t k . c l u s t e r import e u c l i d e a n d i s t a n c efrom numpy import a r r a yv e c t o r s = [ a r r a y ( f ) f o r f i n [ [ 3 , 3 ] , [ 1 , 2 ] , [ 4 , 2 ] ,

[ 4 , 0 ] , [ 2 , 3 ] , [ 3 , 1 ] ] ]c l u s t e r e r = c l u s t e r . KMeansC lus te re r ( 2 ,

e u c l i d e a n d i s t a n c e )c l u s t e r s = c l u s t e r e r . c l u s t e r ( v e c t o r s , True )p r i n t ( ’ C l u s t e r e d : ’ , v e c t o r s )p r i n t ( ’ As : ’ , c l u s t e r s )p r i n t ( ’ Means : ’ , c l u s t e r e r . means ( ) )

>>> C l u s t e r e d : [ a r r a y ( [ 3 , 3 ] ) , a r r a y ( [ 1 , 2 ] ) ,a r r a y ( [ 4 , 2 ] ) , a r r a y ( [ 4 , 0 ] ) , a r r a y ( [ 2 , 3 ] ) , a r r a y ( [ 3 , 1 ] ) ]>>> As : [ 0 , 0 , 0 , 1 , 0 , 1 ]>>> Means : [ a r r a y ( [ 2 . 5 , 2 . 5 ] ) , a r r a y ( [ 3 . 5 , 0 . 5 ] ) ]

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 57 / 66

Page 88: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Kmeans with NLTK

. . .# c l a s s i f y a new v e c t o rv e c t o r = a r r a y ( [ 3 , 3 ] )p r i n t ( ’ c l a s s i f y (%s ) : ’ % v e c t o r )p r i n t ( c l u s t e r e r . c l a s s i f y ( v e c t o r ) )

>>> c l a s s i f y ( [ 3 3 ] ) :>>> 0

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 58 / 66

Page 89: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

K-Means

Problems

How many clusters to use?

How to initialize cluster centroids?

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 59 / 66

Page 90: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach other→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new pointsupervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 91: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithm

partitions points into K clusters: points in each cluster tend to be neareach other→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new pointsupervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 92: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach other

→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new pointsupervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 93: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach othersupervised or unsupervised?

→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new pointsupervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 94: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach other→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new pointsupervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 95: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach other→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithm

determines the classification of a new pointsupervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 96: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach other→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new point

supervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 97: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach other→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new pointsupervised or unsupervised?

supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 98: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Conclusion

K-means is a clustering or classification algorithm?

→ clustering algorithmpartitions points into K clusters: points in each cluster tend to be neareach other→ unsupervised: points have no external classification

K-nearest neighbors is a clustering or classification algorithm?

→ classification algorithmdetermines the classification of a new pointsupervised or unsupervised?supervised: classifies a point based on the known classification ofother points.

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 60 / 66

Page 99: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Exercise: Euclidean Distance

d(p, q) =

√√√√ n∑i=1

(qi − pi )2 (2)

distance between person 1 and 2, 1 and 3, 2 and 3?Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 61 / 66

Page 100: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Exercise: Find Centroid

Compute the mean vector of given vectors:

v e c t o r s = [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ]c e n t r o i d v e c t o r = ???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 62 / 66

Page 101: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Exercise: Document Representation

Represent the list of documents in binary mode

documents =[” t e s t document1 ” ,” t e s t document2 ” ,” t e s t document3 ” ]

t o k e n i z e r = T o k e n i z e r ( )t o k e n i z e r . f i t o n t e x t s ( X t r a i n )t o k e n i z e r . w o r d i n d e x>>>{ ’ document1 ’ : 2 , ’ document2 ’ : 3 ,

’ document3 ’ : 4 , ’ t e s t ’ : 1}d o c m a t r i x = ???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 63 / 66

Page 102: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Supervised vs. Unsupervised

What is the main difference between supervised and unsupervised learning?

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 64 / 66

Page 103: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

Exercise: KNN Classifier

k = 3, the green point = ???

k = 5, the green point = ???

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 65 / 66

Page 104: Unsupervised vs. Supervised Learning - sp1819.github.io · approach - look at my neighbors are planning to vote imagine you know: my age my income how many kids I have new approach

References

Joel Grus (2015).

Data Science from Scratch.

OReilly.

http://choonsiong.com/public/books/Big%20Data/Data%20Science%20from%

20Scratch.pdf

Christopher D. Manning, Hinrich Schtze 2000).

Foundations of Statistical Natural Language Processing

The MIT Press Cambridge, Massachusetts London, England.

http://ics.upjs.sk/~pero/web/documents/pillar/Manning_Schuetze_

StatisticalNLP.pdf

Marina Sedinkina (LMU) Unsupervised vs. Supervised Learning November 27, 2018 66 / 66