overview - university of illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · i...

83
Overview CS 446

Upload: others

Post on 06-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overview

CS 446

Page 2: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

What is machine learning?

I Machine learning: study of computational mechanisms that “learn” fromdata to make predictions and decisions.

1 / 21

Page 3: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 1: image classification

I Birdwatcher takes photos of birds, organizes by species.

I Goal: automatically recognize bird species in new photos.

Indigo bunting

I Why ML: variation in lighting, occlusions, morphology.

2 / 21

Page 4: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 1: image classification

I Birdwatcher takes photos of birds, organizes by species.

I Goal: automatically recognize bird species in new photos.

Indigo bunting

I Why ML: variation in lighting, occlusions, morphology.

2 / 21

Page 5: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 1: image classification

I Birdwatcher takes photos of birds, organizes by species.

I Goal: automatically recognize bird species in new photos.

Indigo bunting

I Why ML: variation in lighting, occlusions, morphology.

2 / 21

Page 6: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 1: image classification

I Birdwatcher takes photos of birds, organizes by species.

I Goal: automatically recognize bird species in new photos.

Indigo bunting

I Why ML: variation in lighting, occlusions, morphology.

2 / 21

Page 7: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 2: recommender system

I Netflix users watch movies and provide ratings.

I Goal: predict user’s rating of unwatched movie.

I (Real goal: keep users paying customers.)

I (Real effect: reinforce stereotypes found in the data?)COVER FE ATURE

COMPUTER 44

vector qi R f, and each user u is associ-ated with a vector pu R f. For a given item i, the elements of qi measure the extent to which the item possesses those factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, posi-tive or negative. The resulting dot product, qi

T pu, captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. This approximates user u’s rating of item i, which is denoted by rui, leading to the estimate

r̂ui

= qiT pu. (1)

The major challenge is computing the map-ping of each item and user to factor vectors qi, pu R f. After the recommender system completes this mapping, it can easily esti-mate the rating a user will give to any item by using Equation 1.

Such a model is closely related to singular value decom-position (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparse-ness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incom-plete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting.

Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, im-putation can be very expensive as it significantly increases the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat-ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi), the system minimizes the regularized squared error on the set of known ratings:

min* *,q p ( , )u i

(rui qiTpu)

2 + (|| qi ||2 + || pu ||

2) (2)

Here, is the set of the (u,i) pairs for which rui is known (the training set).

The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant controls

recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various real-life situations.

Recommender systems rely on different types of input data, which are often placed in a matrix with one dimension representing users and the other dimension representing items of interest. The most convenient data is high-quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies, and TiVo users indicate their preferences for TV shows by pressing thumbs-up and thumbs-down buttons. We refer to explicit user feedback as ratings. Usually, explicit feedback com-prises a sparse matrix, since any single user is likely to have rated only a small percentage of possible items.

One strength of matrix factorization is that it allows incorporation of additional information. When explicit feedback is not available, recommender systems can infer user preferences using implicit feedback, which indirectly reflects opinion by observing user behavior including pur-chase history, browsing history, search patterns, or even mouse movements. Implicit feedback usually denotes the presence or absence of an event, so it is typically repre-sented by a densely filled matrix.

A BASIC MATRIX FACTORIZATION MODEL Matrix factorization models map both users and items

to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each item i is associated with a

Gearedtowardmales

Serious

Escapist

The PrincessDiaries

Braveheart

Lethal Weapon

IndependenceDay

Ocean’s 11Sense andSensibility

Gus

Dave

Gearedtoward

females

Amadeus

The Lion King Dumb andDumber

The Color Purple

Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist.

(Image credit: Koren, Bell, and Volinsky, 2009.)

I Why ML: can go far by representing users as normalized view countvectors, and correlating them.

3 / 21

Page 8: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 2: recommender system

I Netflix users watch movies and provide ratings.

I Goal: predict user’s rating of unwatched movie.

I (Real goal: keep users paying customers.)

I (Real effect: reinforce stereotypes found in the data?)COVER FE ATURE

COMPUTER 44

vector qi R f, and each user u is associ-ated with a vector pu R f. For a given item i, the elements of qi measure the extent to which the item possesses those factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, posi-tive or negative. The resulting dot product, qi

T pu, captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. This approximates user u’s rating of item i, which is denoted by rui, leading to the estimate

r̂ui

= qiT pu. (1)

The major challenge is computing the map-ping of each item and user to factor vectors qi, pu R f. After the recommender system completes this mapping, it can easily esti-mate the rating a user will give to any item by using Equation 1.

Such a model is closely related to singular value decom-position (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparse-ness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incom-plete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting.

Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, im-putation can be very expensive as it significantly increases the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat-ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi), the system minimizes the regularized squared error on the set of known ratings:

min* *,q p ( , )u i

(rui qiTpu)

2 + (|| qi ||2 + || pu ||

2) (2)

Here, is the set of the (u,i) pairs for which rui is known (the training set).

The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant controls

recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various real-life situations.

Recommender systems rely on different types of input data, which are often placed in a matrix with one dimension representing users and the other dimension representing items of interest. The most convenient data is high-quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies, and TiVo users indicate their preferences for TV shows by pressing thumbs-up and thumbs-down buttons. We refer to explicit user feedback as ratings. Usually, explicit feedback com-prises a sparse matrix, since any single user is likely to have rated only a small percentage of possible items.

One strength of matrix factorization is that it allows incorporation of additional information. When explicit feedback is not available, recommender systems can infer user preferences using implicit feedback, which indirectly reflects opinion by observing user behavior including pur-chase history, browsing history, search patterns, or even mouse movements. Implicit feedback usually denotes the presence or absence of an event, so it is typically repre-sented by a densely filled matrix.

A BASIC MATRIX FACTORIZATION MODEL Matrix factorization models map both users and items

to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each item i is associated with a

Gearedtowardmales

Serious

Escapist

The PrincessDiaries

Braveheart

Lethal Weapon

IndependenceDay

Ocean’s 11Sense andSensibility

Gus

Dave

Gearedtoward

females

Amadeus

The Lion King Dumb andDumber

The Color Purple

Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist.

(Image credit: Koren, Bell, and Volinsky, 2009.)

I Why ML: can go far by representing users as normalized view countvectors, and correlating them.

3 / 21

Page 9: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 2: recommender system

I Netflix users watch movies and provide ratings.

I Goal: predict user’s rating of unwatched movie.

I (Real goal: keep users paying customers.)

I (Real effect: reinforce stereotypes found in the data?)COVER FE ATURE

COMPUTER 44

vector qi R f, and each user u is associ-ated with a vector pu R f. For a given item i, the elements of qi measure the extent to which the item possesses those factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, posi-tive or negative. The resulting dot product, qi

T pu, captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. This approximates user u’s rating of item i, which is denoted by rui, leading to the estimate

r̂ui

= qiT pu. (1)

The major challenge is computing the map-ping of each item and user to factor vectors qi, pu R f. After the recommender system completes this mapping, it can easily esti-mate the rating a user will give to any item by using Equation 1.

Such a model is closely related to singular value decom-position (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparse-ness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incom-plete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting.

Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, im-putation can be very expensive as it significantly increases the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat-ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi), the system minimizes the regularized squared error on the set of known ratings:

min* *,q p ( , )u i

(rui qiTpu)

2 + (|| qi ||2 + || pu ||

2) (2)

Here, is the set of the (u,i) pairs for which rui is known (the training set).

The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant controls

recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various real-life situations.

Recommender systems rely on different types of input data, which are often placed in a matrix with one dimension representing users and the other dimension representing items of interest. The most convenient data is high-quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies, and TiVo users indicate their preferences for TV shows by pressing thumbs-up and thumbs-down buttons. We refer to explicit user feedback as ratings. Usually, explicit feedback com-prises a sparse matrix, since any single user is likely to have rated only a small percentage of possible items.

One strength of matrix factorization is that it allows incorporation of additional information. When explicit feedback is not available, recommender systems can infer user preferences using implicit feedback, which indirectly reflects opinion by observing user behavior including pur-chase history, browsing history, search patterns, or even mouse movements. Implicit feedback usually denotes the presence or absence of an event, so it is typically repre-sented by a densely filled matrix.

A BASIC MATRIX FACTORIZATION MODEL Matrix factorization models map both users and items

to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each item i is associated with a

Gearedtowardmales

Serious

Escapist

The PrincessDiaries

Braveheart

Lethal Weapon

IndependenceDay

Ocean’s 11Sense andSensibility

Gus

Dave

Gearedtoward

females

Amadeus

The Lion King Dumb andDumber

The Color Purple

Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist.

(Image credit: Koren, Bell, and Volinsky, 2009.)

I Why ML: can go far by representing users as normalized view countvectors, and correlating them.

3 / 21

Page 10: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 2: recommender system

I Netflix users watch movies and provide ratings.

I Goal: predict user’s rating of unwatched movie.

I (Real goal: keep users paying customers.)

I (Real effect: reinforce stereotypes found in the data?)

COVER FE ATURE

COMPUTER 44

vector qi R f, and each user u is associ-ated with a vector pu R f. For a given item i, the elements of qi measure the extent to which the item possesses those factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, posi-tive or negative. The resulting dot product, qi

T pu, captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. This approximates user u’s rating of item i, which is denoted by rui, leading to the estimate

r̂ui

= qiT pu. (1)

The major challenge is computing the map-ping of each item and user to factor vectors qi, pu R f. After the recommender system completes this mapping, it can easily esti-mate the rating a user will give to any item by using Equation 1.

Such a model is closely related to singular value decom-position (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparse-ness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incom-plete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting.

Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, im-putation can be very expensive as it significantly increases the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat-ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi), the system minimizes the regularized squared error on the set of known ratings:

min* *,q p ( , )u i

(rui qiTpu)

2 + (|| qi ||2 + || pu ||

2) (2)

Here, is the set of the (u,i) pairs for which rui is known (the training set).

The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant controls

recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various real-life situations.

Recommender systems rely on different types of input data, which are often placed in a matrix with one dimension representing users and the other dimension representing items of interest. The most convenient data is high-quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies, and TiVo users indicate their preferences for TV shows by pressing thumbs-up and thumbs-down buttons. We refer to explicit user feedback as ratings. Usually, explicit feedback com-prises a sparse matrix, since any single user is likely to have rated only a small percentage of possible items.

One strength of matrix factorization is that it allows incorporation of additional information. When explicit feedback is not available, recommender systems can infer user preferences using implicit feedback, which indirectly reflects opinion by observing user behavior including pur-chase history, browsing history, search patterns, or even mouse movements. Implicit feedback usually denotes the presence or absence of an event, so it is typically repre-sented by a densely filled matrix.

A BASIC MATRIX FACTORIZATION MODEL Matrix factorization models map both users and items

to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each item i is associated with a

Gearedtowardmales

Serious

Escapist

The PrincessDiaries

Braveheart

Lethal Weapon

IndependenceDay

Ocean’s 11Sense andSensibility

Gus

Dave

Gearedtoward

females

Amadeus

The Lion King Dumb andDumber

The Color Purple

Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist.

(Image credit: Koren, Bell, and Volinsky, 2009.)

I Why ML: can go far by representing users as normalized view countvectors, and correlating them.

3 / 21

Page 11: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 2: recommender system

I Netflix users watch movies and provide ratings.

I Goal: predict user’s rating of unwatched movie.

I (Real goal: keep users paying customers.)

I (Real effect: reinforce stereotypes found in the data?)COVER FE ATURE

COMPUTER 44

vector qi R f, and each user u is associ-ated with a vector pu R f. For a given item i, the elements of qi measure the extent to which the item possesses those factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, posi-tive or negative. The resulting dot product, qi

T pu, captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. This approximates user u’s rating of item i, which is denoted by rui, leading to the estimate

r̂ui

= qiT pu. (1)

The major challenge is computing the map-ping of each item and user to factor vectors qi, pu R f. After the recommender system completes this mapping, it can easily esti-mate the rating a user will give to any item by using Equation 1.

Such a model is closely related to singular value decom-position (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparse-ness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incom-plete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting.

Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, im-putation can be very expensive as it significantly increases the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat-ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi), the system minimizes the regularized squared error on the set of known ratings:

min* *,q p ( , )u i

(rui qiTpu)

2 + (|| qi ||2 + || pu ||

2) (2)

Here, is the set of the (u,i) pairs for which rui is known (the training set).

The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant controls

recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various real-life situations.

Recommender systems rely on different types of input data, which are often placed in a matrix with one dimension representing users and the other dimension representing items of interest. The most convenient data is high-quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies, and TiVo users indicate their preferences for TV shows by pressing thumbs-up and thumbs-down buttons. We refer to explicit user feedback as ratings. Usually, explicit feedback com-prises a sparse matrix, since any single user is likely to have rated only a small percentage of possible items.

One strength of matrix factorization is that it allows incorporation of additional information. When explicit feedback is not available, recommender systems can infer user preferences using implicit feedback, which indirectly reflects opinion by observing user behavior including pur-chase history, browsing history, search patterns, or even mouse movements. Implicit feedback usually denotes the presence or absence of an event, so it is typically repre-sented by a densely filled matrix.

A BASIC MATRIX FACTORIZATION MODEL Matrix factorization models map both users and items

to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each item i is associated with a

Gearedtowardmales

Serious

Escapist

The PrincessDiaries

Braveheart

Lethal Weapon

IndependenceDay

Ocean’s 11Sense andSensibility

Gus

Dave

Gearedtoward

females

Amadeus

The Lion King Dumb andDumber

The Color Purple

Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist.

(Image credit: Koren, Bell, and Volinsky, 2009.)

I Why ML: can go far by representing users as normalized view countvectors, and correlating them.

3 / 21

Page 12: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 2: recommender system

I Netflix users watch movies and provide ratings.

I Goal: predict user’s rating of unwatched movie.

I (Real goal: keep users paying customers.)

I (Real effect: reinforce stereotypes found in the data?)COVER FE ATURE

COMPUTER 44

vector qi R f, and each user u is associ-ated with a vector pu R f. For a given item i, the elements of qi measure the extent to which the item possesses those factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, posi-tive or negative. The resulting dot product, qi

T pu, captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. This approximates user u’s rating of item i, which is denoted by rui, leading to the estimate

r̂ui

= qiT pu. (1)

The major challenge is computing the map-ping of each item and user to factor vectors qi, pu R f. After the recommender system completes this mapping, it can easily esti-mate the rating a user will give to any item by using Equation 1.

Such a model is closely related to singular value decom-position (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparse-ness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incom-plete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting.

Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, im-putation can be very expensive as it significantly increases the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat-ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi), the system minimizes the regularized squared error on the set of known ratings:

min* *,q p ( , )u i

(rui qiTpu)

2 + (|| qi ||2 + || pu ||

2) (2)

Here, is the set of the (u,i) pairs for which rui is known (the training set).

The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant controls

recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various real-life situations.

Recommender systems rely on different types of input data, which are often placed in a matrix with one dimension representing users and the other dimension representing items of interest. The most convenient data is high-quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies, and TiVo users indicate their preferences for TV shows by pressing thumbs-up and thumbs-down buttons. We refer to explicit user feedback as ratings. Usually, explicit feedback com-prises a sparse matrix, since any single user is likely to have rated only a small percentage of possible items.

One strength of matrix factorization is that it allows incorporation of additional information. When explicit feedback is not available, recommender systems can infer user preferences using implicit feedback, which indirectly reflects opinion by observing user behavior including pur-chase history, browsing history, search patterns, or even mouse movements. Implicit feedback usually denotes the presence or absence of an event, so it is typically repre-sented by a densely filled matrix.

A BASIC MATRIX FACTORIZATION MODEL Matrix factorization models map both users and items

to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each item i is associated with a

Gearedtowardmales

Serious

Escapist

The PrincessDiaries

Braveheart

Lethal Weapon

IndependenceDay

Ocean’s 11Sense andSensibility

Gus

Dave

Gearedtoward

females

Amadeus

The Lion King Dumb andDumber

The Color Purple

Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist.

(Image credit: Koren, Bell, and Volinsky, 2009.)

I Why ML: can go far by representing users as normalized view countvectors, and correlating them. 3 / 21

Page 13: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 3: machine translation

I Linguists provide translations of all English language books into French,sentence-by-sentence.

I Goal: translate any English sentence into French.

Note: the text-to-speech is via ML (recurrent network).

I Why ML? Can avoid not just the tedium of endless grammar rules, butalso hope to capture idiom and other nuance.

4 / 21

Page 14: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 3: machine translation

I Linguists provide translations of all English language books into French,sentence-by-sentence.

I Goal: translate any English sentence into French.

Note: the text-to-speech is via ML (recurrent network).

I Why ML? Can avoid not just the tedium of endless grammar rules, butalso hope to capture idiom and other nuance.

4 / 21

Page 15: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 3: machine translation

I Linguists provide translations of all English language books into French,sentence-by-sentence.

I Goal: translate any English sentence into French.

Note: the text-to-speech is via ML (recurrent network).

I Why ML? Can avoid not just the tedium of endless grammar rules, butalso hope to capture idiom and other nuance.

4 / 21

Page 16: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 3: machine translation

I Linguists provide translations of all English language books into French,sentence-by-sentence.

I Goal: translate any English sentence into French.

Note: the text-to-speech is via ML (recurrent network).

I Why ML? Can avoid not just the tedium of endless grammar rules, butalso hope to capture idiom and other nuance.

4 / 21

Page 17: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 4: chess

I Chess enthusiasts construct a large corpus of chess games.

I Goal: for any board position, determine probability that each possiblemove leads to a win.

I Why ML? Magical “interpolation” between various positions.

I Note: it can even learn via self-play!

5 / 21

Page 18: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 4: chess

I Chess enthusiasts construct a large corpus of chess games.

I Goal: for any board position, determine probability that each possiblemove leads to a win.

I Why ML? Magical “interpolation” between various positions.

I Note: it can even learn via self-play!

5 / 21

Page 19: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 4: chess

I Chess enthusiasts construct a large corpus of chess games.

I Goal: for any board position, determine probability that each possiblemove leads to a win.

I Why ML? Magical “interpolation” between various positions.

I Note: it can even learn via self-play!

5 / 21

Page 20: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 4: chess

I Chess enthusiasts construct a large corpus of chess games.

I Goal: for any board position, determine probability that each possiblemove leads to a win.

I Why ML? Magical “interpolation” between various positions.

I Note: it can even learn via self-play!

5 / 21

Page 21: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Application 4: chess

I Chess enthusiasts construct a large corpus of chess games.

I Goal: for any board position, determine probability that each possiblemove leads to a win.

I Why ML? Magical “interpolation” between various positions.

I Note: it can even learn via self-play!

5 / 21

Page 22: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 23: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 24: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 25: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 26: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 27: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 28: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 29: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 30: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Interlude: technical background

Math.

I Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ).

I Basic probability and statistics (e.g., variance of a random variable).

I Multivariable calculus (e.g., gradient of ‖Aw − b‖2 wrt w).

I Basic proof writing (e.g., prove A>A is positive semi-definite).

Software.

I python3. It’s slow, but often computation will be inside fast libraries.

I numpy, an easy-to-use numeric library.

I pytorch, a numeric library with gpu support, auto-differentiation, anddeep learning helpers.

My opinion. pytorch is one of the nicest libraries I’ve ever used, foranything. I use it for much more than deep learning.

6 / 21

Page 31: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

python3, pytorch, numpy

>>> import numpy>>> import torch

>>> 3 / 21.5>>> 3 // 21

>>> A = torch.randn(5,5)>>> b = torch.randn(5,1)>>> x = torch.gels(b, A)[0]>>> (A @ x − b).norm()tensor(3.7985e−06)

>>> x = numpy.linalg.lstsq(A, b)[0]>>> (A @ torch.tensor(x) − b).norm()tensor(1.1999e−06)

7 / 21

Page 32: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

pytorch on gpu is easy

>>> import torch

>>> A = torch.randn(5, 5)>>> b = torch.randn(5)>>> (A @ b).norm()tensor(4.7746)

>>> device = torch.device("cuda:0")>>> (A.to(device) @ b.to(device)).norm()tensor(4.7746, device=’cuda:0’)

>>> (A.to(device) @ b).norm()Traceback (most recent call last):File "<stdin>", line 1, in <module>

RuntimeError: Expected object of typetorch.cuda.FloatTensor but found typetorch.FloatTensor for argument #2 ’ v e c ’

Note. Homeworks will be graded in a gpu-free container!

8 / 21

Page 33: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Homework 0

I Homework 0 is posted on the class webpage.

I It is a sanity check of basic (math and coding) background.

I It is due Tuesday, January 22.

I It has two gradescope components:hw0 is multiple choice, hw0code is coding.

9 / 21

Page 34: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

10 / 21

Page 35: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

10 / 21

Page 36: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

10 / 21

Page 37: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Basic setting: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f̂ from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

10 / 21

Page 38: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Learn a function?

The learned function f̂ might look like the following:

1: if age ≥ 40 then2: if genre = western then3: return 4.34: else if release date > 1998 then5: return 2.56: else

7:...

8: end if9: else if · · · then

10:...

11: end if

Want machine to figure out these if-else clauses from the data, rather thanhand-code them ourselves.

(This type of decision rule is called a decision tree.)(How to fit: use one of many recursive splitting heuristics.)

11 / 21

Page 39: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Learn a function?

The learned function f̂ might look like the following:

1: if age ≥ 40 then2: if genre = western then3: return 4.34: else if release date > 1998 then5: return 2.56: else

7:...

8: end if9: else if · · · then

10:...

11: end if

Want machine to figure out these if-else clauses from the data, rather thanhand-code them ourselves.

(This type of decision rule is called a decision tree.)(How to fit: use one of many recursive splitting heuristics.)

11 / 21

Page 40: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Learn a function?

The learned function f̂ might look like the following:

1: if age ≥ 40 then2: if genre = western then3: return 4.34: else if release date > 1998 then5: return 2.56: else

7:...

8: end if9: else if · · · then

10:...

11: end if

Want machine to figure out these if-else clauses from the data, rather thanhand-code them ourselves.

(This type of decision rule is called a decision tree.)

(How to fit: use one of many recursive splitting heuristics.)

11 / 21

Page 41: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Learn a function?

The learned function f̂ might look like the following:

1: if age ≥ 40 then2: if genre = western then3: return 4.34: else if release date > 1998 then5: return 2.56: else

7:...

8: end if9: else if · · · then

10:...

11: end if

Want machine to figure out these if-else clauses from the data, rather thanhand-code them ourselves.

(This type of decision rule is called a decision tree.)(How to fit: use one of many recursive splitting heuristics.)

11 / 21

Page 42: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 2: 1-nn

I Suppose we are given red and blue data. . .

I . . . and label new points according to the closest existing point.

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

I This is a 1-nearest-neighbor (1-nn) classifier.

I The k-nn classifier takes majority vote of the k nearest data points.

12 / 21

Page 43: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 2: 1-nn

I Suppose we are given red and blue data. . .

I . . . and label new points according to the closest existing point.

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

I This is a 1-nearest-neighbor (1-nn) classifier.

I The k-nn classifier takes majority vote of the k nearest data points.

12 / 21

Page 44: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 2: 1-nn

I Suppose we are given red and blue data. . .

I . . . and label new points according to the closest existing point.

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

I This is a 1-nearest-neighbor (1-nn) classifier.

I The k-nn classifier takes majority vote of the k nearest data points.

12 / 21

Page 45: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 2: 1-nn

I Suppose we are given red and blue data. . .

I . . . and label new points according to the closest existing point.

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

I This is a 1-nearest-neighbor (1-nn) classifier.

I The k-nn classifier takes majority vote of the k nearest data points.

12 / 21

Page 46: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 3: linear classifier

I Classify points with x 7→ sgn(a>x).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

I This gives a contour plot:depicting f−1(c) = {x ∈ R2 : f(x) = c} for a few c.

I How to fit: convex optimization.

13 / 21

Page 47: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 3: linear classifier

I Classify points with x 7→ sgn(a>x).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-7.500

-5.000

-2.500

0.000

2.500

5.000

7.500

I This gives a contour plot:depicting f−1(c) = {x ∈ R2 : f(x) = c} for a few c.

I How to fit: convex optimization.

13 / 21

Page 48: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 3: linear classifier

I Classify points with x 7→ sgn(a>x).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-7.500

-5.000

-2.500

0.000

2.500

5.000

7.500

I This gives a contour plot:depicting f−1(c) = {x ∈ R2 : f(x) = c} for a few c.

I How to fit: convex optimization.

13 / 21

Page 49: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 4: 2-layer ReLU network

I Classify points with x 7→ sgn(A2σr(A1x+ b1) + b2),where A1 ∈ R16×2, b1 ∈ R16, A2 ∈ R1×16, b2 ∈ R1,and σr(z) = max{0, z} is the Rectified Linear Unit (ReLU).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

I Here once again is the contour plot: f−1(c) = {x ∈ R2 : f(x) = c} for afew c.

I How to fit: use convex optimization tools, without knowing why.

14 / 21

Page 50: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 4: 2-layer ReLU network

I Classify points with x 7→ sgn(A2σr(A1x+ b1) + b2),where A1 ∈ R16×2, b1 ∈ R16, A2 ∈ R1×16, b2 ∈ R1,and σr(z) = max{0, z} is the Rectified Linear Unit (ReLU).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

00

0.00

0

0.000

3.00

0

I Here once again is the contour plot: f−1(c) = {x ∈ R2 : f(x) = c} for afew c.

I How to fit: use convex optimization tools, without knowing why.

14 / 21

Page 51: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Example 4: 2-layer ReLU network

I Classify points with x 7→ sgn(A2σr(A1x+ b1) + b2),where A1 ∈ R16×2, b1 ∈ R16, A2 ∈ R1×16, b2 ∈ R1,and σr(z) = max{0, z} is the Rectified Linear Unit (ReLU).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

00

0.00

0

0.000

3.00

0

I Here once again is the contour plot: f−1(c) = {x ∈ R2 : f(x) = c} for afew c.

I How to fit: use convex optimization tools, without knowing why.

14 / 21

Page 52: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Other examples

Other examples of supervised learning (learn f : X → Y given pairs(x, y) ∈ X × Y ).

I Support Vector Machines (SVM), least squares, naive Bayes, AdaBoost,. . .

Other types of machine learning.

I Unsupervised learning: find structure in some examples (xi)ni=1

(there are no labels!).

I Time series modeling: the label of xi also depends on (xi−1, xi−2, . . . ).

I Reinforcement learning: the machine learning method makes decisions,not just predictions:the outputs affect feature inputs.(E.g., controlling a robot).

15 / 21

Page 53: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Other examples

Other examples of supervised learning (learn f : X → Y given pairs(x, y) ∈ X × Y ).

I Support Vector Machines (SVM), least squares, naive Bayes, AdaBoost,. . .

Other types of machine learning.

I Unsupervised learning: find structure in some examples (xi)ni=1

(there are no labels!).

I Time series modeling: the label of xi also depends on (xi−1, xi−2, . . . ).

I Reinforcement learning: the machine learning method makes decisions,not just predictions:the outputs affect feature inputs.(E.g., controlling a robot).

15 / 21

Page 54: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Why is machine learning challenging?

From the examples, supervised learning (fit f̂ : X → Y to ((xi, yi))ni=1)

consists of:

I Pick a model and fitting algorithm; feed them data.

Not easy.

I We might overfit the training set.

I We might pick a bad model.

I The fitting method might be fickle.

I Data may be hard to obtain.

16 / 21

Page 55: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Why is machine learning challenging?

From the examples, supervised learning (fit f̂ : X → Y to ((xi, yi))ni=1)

consists of:

I Pick a model and fitting algorithm; feed them data.

Not easy.

I We might overfit the training set.

I We might pick a bad model.

I The fitting method might be fickle.

I Data may be hard to obtain.

16 / 21

Page 56: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Why is machine learning challenging?

From the examples, supervised learning (fit f̂ : X → Y to ((xi, yi))ni=1)

consists of:

I Pick a model and fitting algorithm; feed them data.

Not easy.

I We might overfit the training set.

I We might pick a bad model.

I The fitting method might be fickle.

I Data may be hard to obtain.

16 / 21

Page 57: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Why is machine learning challenging?

From the examples, supervised learning (fit f̂ : X → Y to ((xi, yi))ni=1)

consists of:

I Pick a model and fitting algorithm; feed them data.

Not easy.

I We might overfit the training set.

I We might pick a bad model.

I The fitting method might be fickle.

I Data may be hard to obtain.

16 / 21

Page 58: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Why is machine learning challenging?

From the examples, supervised learning (fit f̂ : X → Y to ((xi, yi))ni=1)

consists of:

I Pick a model and fitting algorithm; feed them data.

Not easy.

I We might overfit the training set.

I We might pick a bad model.

I The fitting method might be fickle.

I Data may be hard to obtain.

16 / 21

Page 59: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Why is machine learning challenging?

From the examples, supervised learning (fit f̂ : X → Y to ((xi, yi))ni=1)

consists of:

I Pick a model and fitting algorithm; feed them data.

Not easy.

I We might overfit the training set.

I We might pick a bad model.

I The fitting method might be fickle.

I Data may be hard to obtain.

16 / 21

Page 60: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R2 = 0.00690718, R2 = 0.00870077

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 61: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R2 = 0.00690718, R2 = 0.00870077

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 62: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R3 = 0.0064944, R3 = 0.00998528

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 63: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R4 = 0.0062397, R4 = 0.013063

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 64: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R5 = 0.00582684, R5 = 0.00975194

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 65: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R6 = 0.00571136, R6 = 0.0142185

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 66: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R7 = 0.00565266, R7 = 0.0282631

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 67: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R8 = 0.00564127, R8 = 0.0440347

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 68: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R9 = 0.00541878, R9 = 0.42463

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 69: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R10 = 0.00540813, R10 = 0.682619

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 70: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R12 = 0.00503499, R12 = 40.6796

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 71: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R16 = 0.00312787, R16 = 30910.6

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 72: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R21 = 0.0026458, R21 = 1.24795e + 07

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 73: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R27 = 0.00262184, R27 = 233052

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 74: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R34 = 0.00259486, R34 = 3.97751e + 10

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 75: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Overfitting

Let’s fit a polynomial. What’s the degree?Truth: y = 0 · x+ ξ, ξ ∼ Gaussian.Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.01.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00R1 = 0.00692304, R1 = 0.00897457, R38 = 0.00242094, R38 = 1.02285e + 13

Overfitting: fitting training (seen) data, not fitting testing (unseen) data.One cure: “regularization”.

17 / 21

Page 76: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Choosing an appropriate model

Should we use 1-nn or a 2-layer ReLU network?

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

00

0.00

0

0.000

3.00

0

18 / 21

Page 77: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

1. Throw more data, fitting time, and neural network parameters at aproblem until training error becomes near zero.

2. Shrink/regularize the model to reduce test error.

Does this work?

1. For some well-studied problems, yes.

2. For other problems, and/or if you lack resources, no.

19 / 21

Page 78: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

1. Throw more data, fitting time, and neural network parameters at aproblem until training error becomes near zero.

2. Shrink/regularize the model to reduce test error.

Does this work?

1. For some well-studied problems, yes.

2. For other problems, and/or if you lack resources, no.

19 / 21

Page 79: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

1. Throw more data, fitting time, and neural network parameters at aproblem until training error becomes near zero.

2. Shrink/regularize the model to reduce test error.

Does this work?

1. For some well-studied problems, yes.

2. For other problems, and/or if you lack resources, no.

19 / 21

Page 80: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

1. Throw more data, fitting time, and neural network parameters at aproblem until training error becomes near zero.

2. Shrink/regularize the model to reduce test error.

Does this work?

1. For some well-studied problems, yes.

2. For other problems, and/or if you lack resources, no.

19 / 21

Page 81: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

1. Throw more data, fitting time, and neural network parameters at aproblem until training error becomes near zero.

2. Shrink/regularize the model to reduce test error.

Does this work?

1. For some well-studied problems, yes.

2. For other problems, and/or if you lack resources, no.

19 / 21

Page 82: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

This course (CS 446 MJT)

Topics

I An overview of standard supervised machine learning methods.

I Some mathematical intuition, motivation, and background.

I A few topics in unsupervised learning.

Course website

http://mjt.cs.illinois.edu/courses/ml-s19/

I Schedule, homeworks, lecture slides, academic integrity, etc.

Course staff

I Instructor: Matus Telgarsky.

I Teaching assistants, office hours, online forum: see course website.

I Many thanks: to Daniel Hsu @ Columbia;this class is based upon his.

20 / 21

Page 83: Overview - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-overview.pdf · I Net ix users watch movies and provide ratings. I Goal: predict user’s rating of

Key takeaways

1. Examples of machine learning problems and why they are challenging.

2. Course information.

3. Homework 0 due next Tuesday (January 22)!

21 / 21