slidesbyjeﬀreyd.ullman · highdim. data locality) sensive) hashing) clustering) dimensional ity)...

Slides by Jeffrey D. Ullman Stanford University/Infolab

High dim. data

Locality sensi-ve hashing

Clustering

Dimensionality

reduc-on

Graph data

PageRank, SimRank

Community Detec-on

Spam Detec-on

Infinite data

Filtering data

streams

Web adver-sing

Queries on streams

Machine learning

SVM

Decision Trees

Perceptron, kNN

Apps

Recommender systems

Associa-on Rules

Duplicate document detec-on

2

¡ We are given a set of training examples, consis-ng of input-‐output pairs (x,y), where: 1.  x is an item of the type we want to evaluate. 2.  y is the value of some func-on f(x).

¡  Example: x is an email, and f(x) is +1 if x is spam and -‐1 if not.

§  Binary classifica-on. ¡  Example: x is a vector giving characteris-cs of

a voter, and y is their preferred candidate. §  More general classifica-on.

3

¡  In general, input x can be of any type. ¡  OWen, output y is binary. § Usually represented as +1 = true, or “in the class” and -‐1 = false, or “not in the class.”

§  Called binary classifica5on. ¡  y can be one of a finite set of categories. §  Called classifica5on.

¡  y can be a real number. §  Called regression.

4

¡  Supervised learning is building, from the training data, a model that closely represents the func-on y = f(x).

¡  Example: If x and y are real numbers, the model of f might be a straight line.

¡  Example: if x is a email and y is +1 or -‐1, the model might be weights on words together with a threshold such that the answer is +1 (spam) if the weighted sum of the words in the email exceeds the threshold.

5

¡  Would like to do predic4on: es4mate a func-on f(x) so that y = f(x)

¡  Where y can be: §  Real number: Regression §  Categorical: Classifica-on §  Complex object:

§  Ranking of items, Parse tree, etc.

¡  Data is labeled: §  Have many pairs {(x, y)}

§  x … vector of binary, categorical, real valued features §  y … class ({+1, -‐1}, or a real number)

6

X Y

X’ Y’

Training and test set

Estimate y = f(x) on X,Y. Hope that the same f(x)

also works on unseen X’, Y’

¡  Task: Given data (X,Y) build a model f() to predict Y’ based on X’

¡  Strategy: Es4mate 𝒚 = 𝒇(𝒙) on (𝑿,𝒀). Hope that the same 𝒇(𝒙) also works to predict unknown 𝒀’ §  The “hope” is called generaliza4on

§ OverfiRng: If f(x) predicts well Y but is unable to predict Y’ § We want to build a model that generalizes well to unseen data § But Jure, how can we well on data we have never seen before?!?

7

X Y

X’ Y’ Test data

Training data

¡  A decision tree is a model in the form of a tree where: §  Interior nodes have tests about the value of x, with one child for each possible outcome of the test.

§  Leaves have a value for f(x). ¡  Given an x to be tested, start at the root, and perform the tests, moving to the proper child.

¡ When you reach a leaf, declare the value of f(x) to be whatever is found at that leaf.

8

9

Age < 30 Y N

Intelligence Stupid Very Stupid

Sanders

Liberal? Y N

Clinton Cruz Trump

¡ We need to choose a loss func5on that measures how well or badly a given model represents the func-on y = f(x).

¡  Common choice: the frac-on of x’s for which the model gives a value different from y.

¡  Example: if we use a model of email spam that is a weight for each word and a threshold, then the loss for given weights + threshold could be the frac-on of misclassified emails.

10

¡  But are all errors equally bad? ¡  Ques-on for thought: do you think that the loss should be the same for a good email classified as spam and a spam email passed to the user’s inbox?

¡  If y is a numerical value, cost could be the average magnitude of the difference between f(x) as computed by the model, and the true y. § Or square each error (like RMSE).

§  Subtle point: squaring errors makes the loss func-on much more tolerant of small errors, but not big ones.

11

¡  Given training data and a form of model you wish to develop, find the instance of the model that minimizes the loss on the training data.

¡  Example: For the email-‐spam problem, incrementally adjust weights on words un-l small changes cannot decrease the probability of misclassifica-on.

¡  Example: design a decision tree top-‐down, picking at each node the test that makes the branches most homogeneous.

12

13

¡  Divide your data randomly into training and test data.

¡  Build your best model based on the training data only.

¡  Apply your model to the test data.

¡  Does your model predict y’ for the test data as well as it predicted y for the training data?

X Y

X’ Y’

Training data

Test data

¡  The test data helps you measure overfipng. ¡  But you want your model to work not only on the test data, but on all the unseen data that it will eventually be called upon to process. §  The valida5on set.

¡  If the training and test sets are truly drawn at random from the popula-on of all data, and the test set shows liqle overfipng, then the model should not exhibit overfipng on real data. §  A big “if,” e.g., with email spam, where the popula-on is always changing.

14

¡  Some-mes, your model will show much greater loss on the test data than on the training data. §  Called overfi:ng.

¡  The problem is that the modeling process has picked up on details of the training data that are so fine-‐grained that they do not apply to the popula-on from which the data is drawn.

¡  Example: a decision tree with so many levels that the typical leaf is reached by only one member of the training set.

15

¡  Idea: Pretend we do not know the data/labels we actually do know §  Build the model f(x) on the training data See how well f(x) does on the test data §  If it does well, then apply it also to X’

¡  Refinement: Cross valida4on §  Splipng into training/valida-on set is brutal §  Let’s split our data (X,Y) into 10-‐folds (buckets) §  Take out 1-‐fold for valida-on, train on remaining 9 §  Repeat this 10 -mes, report average performance

16

X Y

X’

Validation set

Training set

Test set

¡ We will talk about the following methods: §  k-‐Nearest Neighbor (Instance based learning) §  Perceptron and Winnow algorithms

¡ Main ques4on: How to efficiently train (build a model/find model parameters)?

17

¡  Instance based learning ¡  Example: Nearest neighbor §  Keep the whole training dataset: {(x, y)} §  A query example (vector) q comes §  Find closest example(s) x*

§  Predict y* ¡ Works both for regression and classifica4on §  Collabora4ve filtering is an example of k-‐NN classifier

§  Find k most similar people to user x that have rated movie y § Predict ra-ng yx of x as an average of yk

18

¡  To make Nearest Neighbor work we need 4 things: §  Distance metric:

§  Euclidean §  How many neighbors to look at?

§  One §  Weigh4ng func4on (op4onal):

§  Unused §  How to fit with the local points?

§  Just predict the same output as the nearest neighbor

19

¡  Distance metric: §  Euclidean

¡  How many neighbors to look at? §  k

¡  Weigh4ng func4on (op4onal): §  Unused

¡  How to fit with the local points? §  Just predict the average output among k nearest neighbors

20

k=9

¡  Distance metric: §  Euclidean

¡  How many neighbors to look at? §  All of them (!)

¡  Weigh4ng func4on: §  𝒘↓𝒊 =𝐞𝐱𝐩(− 𝒅(𝒙↓𝒊 ,𝒒)↑𝟐 /𝑲↓𝒘  )

§  Nearby points to query q are weighted more strongly. Kw…kernel width.

¡  How to fit with the local points? §  Predict weighted average: ∑𝒊↑▒𝒘↓𝒊 𝒚↓𝒊  /∑𝒊↑▒𝒘↓𝒊   

21

Kw=10 Kw=20 Kw=80

d(xi, q) = 0

wi

¡  Given: a set P of n points in Rd

¡  Goal: Given a query point q § NN: Find the nearest neighbor p of q in P §  Range search: Find one/all points in P within distance r from q

22

q p

¡ Main memory: §  Linear scan §  Tree based:

§ Quadtree §  kd-‐tree

§ Hashing: §  Locality-‐Sensi-ve Hashing

¡  Secondary storage: §  R-‐trees

23

¡  Example: Spam filtering

¡  Instance space x ∈ X (|X|= n data points) §  Binary or real-‐valued feature vector x of word occurrences

§  d features (words + other things, d~100,000) ¡  Class y ∈ Y §  y: Spam (+1), Ham (-‐1)

25

¡  Binary classifica4on:

¡  Input: Vectors x(j) and labels y(j) §  Vectors x(j) are real valued where ‖𝒙‖↓𝟐 =𝟏

¡  Goal: Find vector w = (w1, w2 ,... , wd ) §  Each wi is a real number

26

f (x) = +1 if w1 x1 + w2 x2 +. . . wd xd ≥ θ -1 otherwise {

w ⋅ x = 0 -‐ -‐ -‐ -‐ -‐ -‐

-‐ -‐ -‐

-‐ -‐

-‐ -‐ -‐

-‐

θ−⇔

∀⇔

,

1,

ww

xxxNote:

Decision boundary is linear

w

¡  (very) Loose mo4va4on: Neuron ¡  Inputs are feature values ¡  Each feature has a weight wi ¡  Ac4va4on is the sum: §  f(x) = Σi wi xi = w⋅ x

¡  If the f(x) is: §  Posi4ve: Predict +1 § Nega4ve: Predict -‐1

27

∑

x1 x2 x3 x4

≥ 0?

w1 w2 w3 w4

viagra nige

ria

Spam=1

Ham=-1 w

x(1) x(2)

w⋅x=0

¡  Perceptron: y’ = sign(w⋅ x) ¡  How to find parameters w?

§  Start with w0 = 0 §  Pick training examples x(t) one by one (from disk) §  Predict class of x(t) using current weights

§  y’ = sign(w(t)⋅ x(t)) §  If y’ is correct (i.e., yt = y’)

§  No change: w(t+1) = w(t)

§  If y’ is wrong: adjust w(t) w(t+1) = w(t) + η ⋅ y (t) ⋅ x(t)

§ η is the learning rate parameter §  x(t) is the t-‐th training example §  y(t) is true t-‐th class label ({+1, -‐1})

28

w(t)

η⋅y(t)⋅x(t)

x(t), y(t)=1

w(t+1)

Note that the Perceptron is a conservative algorithm: it ignores samples that it classifies correctly.

¡  Perceptron Convergence Theorem: §  If there exist a set of weights that are consistent (i.e., the data is linearly separable) the Perceptron learning algorithm will converge

¡  How long would it take to converge? ¡  Perceptron Cycling Theorem: §  If the training data is not linearly separable the Perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop

29

¡  Perceptron will oscillate and won’t converge ¡ When to stop learning? ¡  (1) Slowly decrease the learning rate η §  A classic way is to: η = c1/(t + c2)

§ But, we also need to determine constants c1 and c2 ¡  (2) Stop when the training error stops chaining ¡  (3) Have a small test dataset and stop when the test set error stops decreasing

¡  (4) Stop when we reached some maximum number of passes over the data

30

¡ Winnow : Predict f(x) = +1 iff 𝒘⋅𝒙≥𝜽 §  Similar to perceptron, just different updates § Assume x is a real-‐valued feature vector, ‖𝒙‖↓𝟐 =𝟏

§ w … weights (can never get nega4ve!) §  𝒁↑(𝒕) =∑𝒊↑▒𝒘↓𝒊 𝐞𝐱𝐩(𝜼𝒚↑(𝒕) 𝒙↓𝒊↑(𝒕) )  is the normalizing const.

31

•  Initialize: 𝜽= 𝑑/2 , 𝒘=[1/𝑑 ,…, 1/𝑑 ] •  For every training example 𝒙↑(𝒕) 

•  Compute 𝒚↑′  = 𝒇( 𝒙↑(𝒕) ) •  If no mistake (𝒚↑(𝒕) =𝒚′): do nothing •  If mistake then: 𝒘↓𝒊 ←𝒘↓𝒊 𝐞𝐱𝐩(𝜼𝒚↑(𝒕) 𝒙↓𝒊↑(𝒕) )/𝒁↑(𝒕)  

¡  About the update: 𝒘↓𝒊 ←𝒘↓𝒊 𝐞𝐱𝐩(𝜼𝒚↑(𝒕) 𝒙↓𝒊↑(𝒕) )/𝒁↑(𝒕)   §  If x is false nega-ve, increase wi

§  If x is false posi-ve, decrease wi

§  In other words: Consider 𝒙↓𝒊↑(𝒕) ∈{−1,+1} §  Then 𝒘↓𝒊↑(𝒕+𝟏) ∝𝒘↓𝒊↑(𝒕) ⋅{█■𝒆↑𝜼  @𝒆↑−𝜼    █■𝑖𝑓 𝑥↓𝑖↑(𝑡) = 𝑦↑(𝑡) @𝑒𝑙𝑠𝑒  § No4ce: This is a weighted majority algorithm of “experts” xi agreeing with y

32

(promote)

(demote)

¡  SeRng: §  Examples: 𝒙∈{𝟎,𝟏}, weights 𝒘∈ 𝑹↑𝒅  §  Predic4on: 𝒇(𝒙)=+𝟏 iff 𝒘⋅𝒙≥𝜽 else −𝟏

¡  Perceptron: Addi-ve weight update

§  If y=+1 but w·∙x ≤ θ then wi ← wi + 1 (if xi=1) §  If y=-‐1 but w·∙x > θ then wi ← wi -‐ 1 (if xi=1)

¡  Winnow: Mul-plica-ve weight update

§  If y=+1 but w·∙x ≤ θ then wi ← 2 ·∙ wi (if xi=1) §  If y=-‐1 but w·∙x > θ then wi ← wi / 2 (if xi=1)

33

w ← w + η y x

w ← w exp{η y x}

(promote)

(demote)

(promote)

(demote)

¡  How to compare learning algorithms? ¡  Considera4ons: § Number of features d is very large §  The instance space is sparse

§ Only few features per training example are non-‐zero

§  The model is sparse § Decisions depend on a small subset of features §  In the “true” model on a few wi are non-‐zero

§ Want to learn from a number of examples that is small rela-ve to the dimensionality d

34

Perceptron ¡  Online: Can adjust to changing target, over -me

¡  Advantages §  Simple §  Guaranteed to learn a linearly separable problem

§  Advantage with few relevant features per training example

¡  Limita4ons §  Only linear separa-ons §  Only converges for linearly separable data

§  Not really “efficient with many features”

Winnow ¡  Online: Can adjust to changing target, over -me

¡  Advantages §  Simple §  Guaranteed to learn a linearly separable problem

§  Suitable for problems with many irrelevant anributes

¡  Limita4ons §  Only linear separa-ons §  Only converges for linearly separable data

§  Not really “efficient with many features”

35

¡  New seRng: Online Learning §  Allows for modeling problems where we have a con-nuous stream of data

§ We want an algorithm to learn from it and slowly adapt to the changes in data

¡  Idea: Do slow updates to the model §  Both our methods Perceptron and Winnow make updates if they misclassify an example

§  So: First train the classifier on training data. Then for every example from the stream, if we misclassify, update the model (using small learning rate)

36

¡  Protocol: §  User comes and tell us origin and des-na-on § We offer to ship the package for some money ($10 -‐ $50) §  Based on the price we offer, some-mes the user uses our service (y = 1), some-mes they don't (y = -‐1)

¡  Task: Build an algorithm to op-mize what price we offer to the users

¡  Features x capture: §  Informa-on about user §  Origin and des-na-on

¡  Problem: Will user accept the price? 37

¡  Model whether user will accept our price: y = f(x; w) §  Accept: y =1, Not accept: y=-‐1 §  Build this model with say Perceptron or Winnow

¡  The website that runs con4nuously ¡  Online learning algorithm would do something like §  User comes §  She is represented as an (x,y) pair where

§  x: Feature vector including price we offer, origin, des-na-on §  y: If they chose to use our service or not

§  The algorithm updates w using just the (x,y) pair §  Basically, we update the w parameters every -me we get some new data

38

¡ We discard this idea of a data “set” ¡  Instead we have a con-nuous stream of data ¡  Further comments: §  For a major website where you have a massive stream of data then this kind of algorithm is preqy reasonable

§ Don’t need to deal with all the training data §  If you had a small number of users you could save their data and then run a normal algorithm on the full dataset § Doing mul-ple passes over the data

39

¡  An online algorithm can adapt to changing user preferences

¡  For example, over -me users may become more price sensi-ve

¡  The algorithm adapts and learns this

¡  So the system is dynamic

40

slidesbyjeﬀreyd.ullman · highdim. data locality) sensive) hashing) clustering) dimensional ity)...

Documents