slidesbyjeffreyd.ullman · highdim. data locality) sensive) hashing) clustering) dimensional ity)...
TRANSCRIPT
Slides by Jeffrey D. Ullman Stanford University/Infolab
High dim. data
Locality sensi-ve hashing
Clustering
Dimensionality
reduc-on
Graph data
PageRank, SimRank
Community Detec-on
Spam Detec-on
Infinite data
Filtering data
streams
Web adver-sing
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron, kNN
Apps
Recommender systems
Associa-on Rules
Duplicate document detec-on
2
¡ We are given a set of training examples, consis-ng of input-‐output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some func-on f(x).
¡ Example: x is an email, and f(x) is +1 if x is spam and -‐1 if not.
§ Binary classifica-on. ¡ Example: x is a vector giving characteris-cs of
a voter, and y is their preferred candidate. § More general classifica-on.
3
¡ In general, input x can be of any type. ¡ OWen, output y is binary. § Usually represented as +1 = true, or “in the class” and -‐1 = false, or “not in the class.”
§ Called binary classifica5on. ¡ y can be one of a finite set of categories. § Called classifica5on.
¡ y can be a real number. § Called regression.
4
¡ Supervised learning is building, from the training data, a model that closely represents the func-on y = f(x).
¡ Example: If x and y are real numbers, the model of f might be a straight line.
¡ Example: if x is a email and y is +1 or -‐1, the model might be weights on words together with a threshold such that the answer is +1 (spam) if the weighted sum of the words in the email exceeds the threshold.
5
¡ Would like to do predic4on: es4mate a func-on f(x) so that y = f(x)
¡ Where y can be: § Real number: Regression § Categorical: Classifica-on § Complex object:
§ Ranking of items, Parse tree, etc.
¡ Data is labeled: § Have many pairs {(x, y)}
§ x … vector of binary, categorical, real valued features § y … class ({+1, -‐1}, or a real number)
6
X Y
X’ Y’
Training and test set
Estimate y = f(x) on X,Y. Hope that the same f(x)
also works on unseen X’, Y’
¡ Task: Given data (X,Y) build a model f() to predict Y’ based on X’
¡ Strategy: Es4mate 𝒚 = 𝒇(𝒙) on (𝑿,𝒀). Hope that the same 𝒇(𝒙) also works to predict unknown 𝒀’ § The “hope” is called generaliza4on
§ OverfiRng: If f(x) predicts well Y but is unable to predict Y’ § We want to build a model that generalizes well to unseen data § But Jure, how can we well on data we have never seen before?!?
7
X Y
X’ Y’ Test data
Training data
¡ A decision tree is a model in the form of a tree where: § Interior nodes have tests about the value of x, with one child for each possible outcome of the test.
§ Leaves have a value for f(x). ¡ Given an x to be tested, start at the root, and perform the tests, moving to the proper child.
¡ When you reach a leaf, declare the value of f(x) to be whatever is found at that leaf.
8
9
Age < 30 Y N
Intelligence Stupid Very Stupid
Sanders
Liberal? Y N
Clinton Cruz Trump
¡ We need to choose a loss func5on that measures how well or badly a given model represents the func-on y = f(x).
¡ Common choice: the frac-on of x’s for which the model gives a value different from y.
¡ Example: if we use a model of email spam that is a weight for each word and a threshold, then the loss for given weights + threshold could be the frac-on of misclassified emails.
10
¡ But are all errors equally bad? ¡ Ques-on for thought: do you think that the loss should be the same for a good email classified as spam and a spam email passed to the user’s inbox?
¡ If y is a numerical value, cost could be the average magnitude of the difference between f(x) as computed by the model, and the true y. § Or square each error (like RMSE).
§ Subtle point: squaring errors makes the loss func-on much more tolerant of small errors, but not big ones.
11
¡ Given training data and a form of model you wish to develop, find the instance of the model that minimizes the loss on the training data.
¡ Example: For the email-‐spam problem, incrementally adjust weights on words un-l small changes cannot decrease the probability of misclassifica-on.
¡ Example: design a decision tree top-‐down, picking at each node the test that makes the branches most homogeneous.
12
13
¡ Divide your data randomly into training and test data.
¡ Build your best model based on the training data only.
¡ Apply your model to the test data.
¡ Does your model predict y’ for the test data as well as it predicted y for the training data?
X Y
X’ Y’
Training data
Test data
¡ The test data helps you measure overfipng. ¡ But you want your model to work not only on the test data, but on all the unseen data that it will eventually be called upon to process. § The valida5on set.
¡ If the training and test sets are truly drawn at random from the popula-on of all data, and the test set shows liqle overfipng, then the model should not exhibit overfipng on real data. § A big “if,” e.g., with email spam, where the popula-on is always changing.
14
¡ Some-mes, your model will show much greater loss on the test data than on the training data. § Called overfi:ng.
¡ The problem is that the modeling process has picked up on details of the training data that are so fine-‐grained that they do not apply to the popula-on from which the data is drawn.
¡ Example: a decision tree with so many levels that the typical leaf is reached by only one member of the training set.
15
¡ Idea: Pretend we do not know the data/labels we actually do know § Build the model f(x) on the training data See how well f(x) does on the test data § If it does well, then apply it also to X’
¡ Refinement: Cross valida4on § Splipng into training/valida-on set is brutal § Let’s split our data (X,Y) into 10-‐folds (buckets) § Take out 1-‐fold for valida-on, train on remaining 9 § Repeat this 10 -mes, report average performance
16
X Y
X’
Validation set
Training set
Test set
¡ We will talk about the following methods: § k-‐Nearest Neighbor (Instance based learning) § Perceptron and Winnow algorithms
¡ Main ques4on: How to efficiently train (build a model/find model parameters)?
17
¡ Instance based learning ¡ Example: Nearest neighbor § Keep the whole training dataset: {(x, y)} § A query example (vector) q comes § Find closest example(s) x*
§ Predict y* ¡ Works both for regression and classifica4on § Collabora4ve filtering is an example of k-‐NN classifier
§ Find k most similar people to user x that have rated movie y § Predict ra-ng yx of x as an average of yk
18
¡ To make Nearest Neighbor work we need 4 things: § Distance metric:
§ Euclidean § How many neighbors to look at?
§ One § Weigh4ng func4on (op4onal):
§ Unused § How to fit with the local points?
§ Just predict the same output as the nearest neighbor
19
¡ Distance metric: § Euclidean
¡ How many neighbors to look at? § k
¡ Weigh4ng func4on (op4onal): § Unused
¡ How to fit with the local points? § Just predict the average output among k nearest neighbors
20
k=9
¡ Distance metric: § Euclidean
¡ How many neighbors to look at? § All of them (!)
¡ Weigh4ng func4on: § 𝒘↓𝒊 =𝐞𝐱𝐩(− 𝒅(𝒙↓𝒊 ,𝒒)↑𝟐 /𝑲↓𝒘 )
§ Nearby points to query q are weighted more strongly. Kw…kernel width.
¡ How to fit with the local points? § Predict weighted average: ∑𝒊↑▒𝒘↓𝒊 𝒚↓𝒊 /∑𝒊↑▒𝒘↓𝒊
21
Kw=10 Kw=20 Kw=80
d(xi, q) = 0
wi
¡ Given: a set P of n points in Rd
¡ Goal: Given a query point q § NN: Find the nearest neighbor p of q in P § Range search: Find one/all points in P within distance r from q
22
q p
¡ Main memory: § Linear scan § Tree based:
§ Quadtree § kd-‐tree
§ Hashing: § Locality-‐Sensi-ve Hashing
¡ Secondary storage: § R-‐trees
23
¡ Example: Spam filtering
¡ Instance space x ∈ X (|X|= n data points) § Binary or real-‐valued feature vector x of word occurrences
§ d features (words + other things, d~100,000) ¡ Class y ∈ Y § y: Spam (+1), Ham (-‐1)
25
¡ Binary classifica4on:
¡ Input: Vectors x(j) and labels y(j) § Vectors x(j) are real valued where ‖𝒙‖↓𝟐 =𝟏
¡ Goal: Find vector w = (w1, w2 ,... , wd ) § Each wi is a real number
26
f (x) = +1 if w1 x1 + w2 x2 +. . . wd xd ≥ θ -1 otherwise {
w ⋅ x = 0 -‐ -‐ -‐ -‐ -‐ -‐
-‐ -‐ -‐
-‐ -‐
-‐ -‐ -‐
-‐
θ−⇔
∀⇔
,
1,
ww
xxxNote:
Decision boundary is linear
w
¡ (very) Loose mo4va4on: Neuron ¡ Inputs are feature values ¡ Each feature has a weight wi ¡ Ac4va4on is the sum: § f(x) = Σi wi xi = w⋅ x
¡ If the f(x) is: § Posi4ve: Predict +1 § Nega4ve: Predict -‐1
27
∑
x1 x2 x3 x4
≥ 0?
w1 w2 w3 w4
viagra nige
ria
Spam=1
Ham=-1 w
x(1) x(2)
w⋅x=0
¡ Perceptron: y’ = sign(w⋅ x) ¡ How to find parameters w?
§ Start with w0 = 0 § Pick training examples x(t) one by one (from disk) § Predict class of x(t) using current weights
§ y’ = sign(w(t)⋅ x(t)) § If y’ is correct (i.e., yt = y’)
§ No change: w(t+1) = w(t)
§ If y’ is wrong: adjust w(t) w(t+1) = w(t) + η ⋅ y (t) ⋅ x(t)
§ η is the learning rate parameter § x(t) is the t-‐th training example § y(t) is true t-‐th class label ({+1, -‐1})
28
w(t)
η⋅y(t)⋅x(t)
x(t), y(t)=1
w(t+1)
Note that the Perceptron is a conservative algorithm: it ignores samples that it classifies correctly.
¡ Perceptron Convergence Theorem: § If there exist a set of weights that are consistent (i.e., the data is linearly separable) the Perceptron learning algorithm will converge
¡ How long would it take to converge? ¡ Perceptron Cycling Theorem: § If the training data is not linearly separable the Perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop
29
¡ Perceptron will oscillate and won’t converge ¡ When to stop learning? ¡ (1) Slowly decrease the learning rate η § A classic way is to: η = c1/(t + c2)
§ But, we also need to determine constants c1 and c2 ¡ (2) Stop when the training error stops chaining ¡ (3) Have a small test dataset and stop when the test set error stops decreasing
¡ (4) Stop when we reached some maximum number of passes over the data
30
¡ Winnow : Predict f(x) = +1 iff 𝒘⋅𝒙≥𝜽 § Similar to perceptron, just different updates § Assume x is a real-‐valued feature vector, ‖𝒙‖↓𝟐 =𝟏
§ w … weights (can never get nega4ve!) § 𝒁↑(𝒕) =∑𝒊↑▒𝒘↓𝒊 𝐞𝐱𝐩(𝜼𝒚↑(𝒕) 𝒙↓𝒊↑(𝒕) ) is the normalizing const.
31
• Initialize: 𝜽= 𝑑/2 , 𝒘=[1/𝑑 ,…, 1/𝑑 ] • For every training example 𝒙↑(𝒕)
• Compute 𝒚↑′ = 𝒇( 𝒙↑(𝒕) ) • If no mistake (𝒚↑(𝒕) =𝒚′): do nothing • If mistake then: 𝒘↓𝒊 ←𝒘↓𝒊 𝐞𝐱𝐩(𝜼𝒚↑(𝒕) 𝒙↓𝒊↑(𝒕) )/𝒁↑(𝒕)
¡ About the update: 𝒘↓𝒊 ←𝒘↓𝒊 𝐞𝐱𝐩(𝜼𝒚↑(𝒕) 𝒙↓𝒊↑(𝒕) )/𝒁↑(𝒕) § If x is false nega-ve, increase wi
§ If x is false posi-ve, decrease wi
§ In other words: Consider 𝒙↓𝒊↑(𝒕) ∈{−1,+1} § Then 𝒘↓𝒊↑(𝒕+𝟏) ∝𝒘↓𝒊↑(𝒕) ⋅{█■𝒆↑𝜼 @𝒆↑−𝜼 █■𝑖𝑓 𝑥↓𝑖↑(𝑡) = 𝑦↑(𝑡) @𝑒𝑙𝑠𝑒 § No4ce: This is a weighted majority algorithm of “experts” xi agreeing with y
32
(promote)
(demote)
¡ SeRng: § Examples: 𝒙∈{𝟎,𝟏}, weights 𝒘∈ 𝑹↑𝒅 § Predic4on: 𝒇(𝒙)=+𝟏 iff 𝒘⋅𝒙≥𝜽 else −𝟏
¡ Perceptron: Addi-ve weight update
§ If y=+1 but w·∙x ≤ θ then wi ← wi + 1 (if xi=1) § If y=-‐1 but w·∙x > θ then wi ← wi -‐ 1 (if xi=1)
¡ Winnow: Mul-plica-ve weight update
§ If y=+1 but w·∙x ≤ θ then wi ← 2 ·∙ wi (if xi=1) § If y=-‐1 but w·∙x > θ then wi ← wi / 2 (if xi=1)
33
w ← w + η y x
w ← w exp{η y x}
(promote)
(demote)
(promote)
(demote)
¡ How to compare learning algorithms? ¡ Considera4ons: § Number of features d is very large § The instance space is sparse
§ Only few features per training example are non-‐zero
§ The model is sparse § Decisions depend on a small subset of features § In the “true” model on a few wi are non-‐zero
§ Want to learn from a number of examples that is small rela-ve to the dimensionality d
34
Perceptron ¡ Online: Can adjust to changing target, over -me
¡ Advantages § Simple § Guaranteed to learn a linearly separable problem
§ Advantage with few relevant features per training example
¡ Limita4ons § Only linear separa-ons § Only converges for linearly separable data
§ Not really “efficient with many features”
Winnow ¡ Online: Can adjust to changing target, over -me
¡ Advantages § Simple § Guaranteed to learn a linearly separable problem
§ Suitable for problems with many irrelevant anributes
¡ Limita4ons § Only linear separa-ons § Only converges for linearly separable data
§ Not really “efficient with many features”
35
¡ New seRng: Online Learning § Allows for modeling problems where we have a con-nuous stream of data
§ We want an algorithm to learn from it and slowly adapt to the changes in data
¡ Idea: Do slow updates to the model § Both our methods Perceptron and Winnow make updates if they misclassify an example
§ So: First train the classifier on training data. Then for every example from the stream, if we misclassify, update the model (using small learning rate)
36
¡ Protocol: § User comes and tell us origin and des-na-on § We offer to ship the package for some money ($10 -‐ $50) § Based on the price we offer, some-mes the user uses our service (y = 1), some-mes they don't (y = -‐1)
¡ Task: Build an algorithm to op-mize what price we offer to the users
¡ Features x capture: § Informa-on about user § Origin and des-na-on
¡ Problem: Will user accept the price? 37
¡ Model whether user will accept our price: y = f(x; w) § Accept: y =1, Not accept: y=-‐1 § Build this model with say Perceptron or Winnow
¡ The website that runs con4nuously ¡ Online learning algorithm would do something like § User comes § She is represented as an (x,y) pair where
§ x: Feature vector including price we offer, origin, des-na-on § y: If they chose to use our service or not
§ The algorithm updates w using just the (x,y) pair § Basically, we update the w parameters every -me we get some new data
38
¡ We discard this idea of a data “set” ¡ Instead we have a con-nuous stream of data ¡ Further comments: § For a major website where you have a massive stream of data then this kind of algorithm is preqy reasonable
§ Don’t need to deal with all the training data § If you had a small number of users you could save their data and then run a normal algorithm on the full dataset § Doing mul-ple passes over the data
39
¡ An online algorithm can adapt to changing user preferences
¡ For example, over -me users may become more price sensi-ve
¡ The algorithm adapts and learns this
¡ So the system is dynamic
40