fast kernel methodslsong/teaching/8803ml/lecture21.pdfqr decomposition essentially gram-schmidt...
TRANSCRIPT
Fast Kernel Methods
Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Le Song
Kernel low rank approximation
Incomplete Cholesky factorization of kernel matrix 𝐾 of size 𝑛 × 𝑛 to 𝑅 of size 𝑑 × 𝑛, and 𝑑 ≪ 𝑛
𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥
′
𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑅𝑥⊤ 𝑅𝑅⊤ + 𝜎𝑛𝑜𝑖𝑠𝑒
2 𝐼−1𝑅𝑌⊤
𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥′ = 𝑅𝑥𝑥 − 𝑅𝑥
⊤ 𝑅𝑅⊤ + 𝜎𝑛𝑜𝑖𝑠𝑒2 𝐼
−1(𝑅𝑅⊤)𝑅𝑥
2
𝐾 ≈ 𝑅⊤
𝑅
𝐴 ≈ 𝑅⊤
𝑅
Incomplete Cholesky Decomposition
We have a few things to understand
Gram-Schmidt orthogonalization
Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal
basis 𝑄 = 𝑢1, 𝑢2, … 𝑢𝑛 , 𝑢𝑖⊤𝑢𝑗 = 0, 𝑢𝑖
⊤𝑢𝑖 = 0
QR decomposition
Given a set of orthonormal basis 𝑄, compute the projection of 𝑉
onto 𝑄, 𝑣𝑖 = 𝑟𝑗𝑖𝑢𝑗𝑗 , 𝑅 = 𝑟𝑗𝑖
𝑉 = 𝑄𝑅
Cholesky decomposition with pivots
𝑉 ≈ 𝑄 : , 1: 𝑘 𝑅 1: 𝑘, ∶
Kernelization
𝑉⊤𝑉 = 𝑅⊤𝑄⊤𝑄𝑅 = 𝑅⊤𝑅 ≈ 𝑅 1: 𝑘, ∶ ⊤ 𝑅 1: 𝑘, ∶
𝐾 = Φ⊤Φ ≈ 𝑅 1: 𝑘, ∶ ⊤ 𝑅 1: 𝑘, ∶
3
Gram-Schmidt orthogonalization
Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal basis
𝑄 = 𝑢1, 𝑢2, … 𝑢𝑛 , 𝑢𝑖⊤𝑢𝑗 = 0, 𝑢𝑖
⊤𝑢𝑖 = 0
𝑢1 can be found by picking an arbitrary 𝑣1 and normalize
𝑢1 =𝑣1
𝑣1
𝑢2 can be found by picking a vector 𝑣2 and subtract out multiple of 𝑢1, and then normalize
𝑎2 = 𝑣2 − < 𝑣2, 𝑢1 > 𝑢1
𝑢2 =𝑎2
𝑎2
𝑎𝑖 = 𝑣𝑖 − < 𝑣𝑖 , 𝑢𝑗 > 𝑢𝑗 𝑖−1𝑗=1
4
𝑢1
𝑢2
𝑣2
𝑣1
𝑎2
QR decomposition
Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection
Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal basis 𝑄 = 𝑢1, 𝑢2, … 𝑢𝑛 using Gram-Schmidt orthogonalization
The projection of 𝑣𝑖 on to basis vector 𝑢𝑗 is 𝑟𝑗𝑖 =< 𝑣𝑖 , 𝑢𝑗 >
𝑣1 = 𝑢1 < 𝑢1, 𝑣1 >
𝑣2 = 𝑢1 < 𝑢1, 𝑣2 > +𝑢2 < 𝑢2, 𝑣2 >
𝑣3 = 𝑢1 < 𝑢1, 𝑣3 > +𝑢2 < 𝑢2, 𝑣3 > +𝑢3 < 𝑢3, 𝑣3 >
…
𝑣𝑖 = < 𝑣𝑖 , 𝑢𝑗 > 𝑢𝑗 𝑖𝑗=1
5
QR decomposition
Because use the original data point to form basis vectors, vector 𝑣𝑖 only have 𝑖 nonzeros components
𝑣𝑖 = < 𝑣𝑖 , 𝑢𝑗 > 𝑢𝑗 =𝑖𝑗=1 𝑟𝑗𝑖𝑢𝑗
𝑖𝑗=1
Collect terms into matrix format
6
𝑉 = 𝑣1, … , 𝑣𝑛 , 𝑣𝑖 ∈ 𝑅𝑑
𝑄 = (𝑢1, … , 𝑢𝑑) 𝑅 = (𝑟:𝑖 , … , 𝑢:𝑛) zeros
QR decomposition with pivots
QR decomposition
If we only choose a few basis vectors, then its approximation
The basis vectors is formed from the original data points
how to order/choose from the original data points?
such that small approximation error?
order/choose from data points: choosing pivots 7
𝑄 = (𝑢1, … , 𝑢𝑑) 𝑅 = (𝑟:𝑖 , … , 𝑢:𝑛) zeros 𝑉 =
Cholesky decomposition
𝐾 is symmetric and positive definite matrix, then 𝐾 can be decomposed as
𝐾 = 𝑅⊤𝑅
Since 𝐾 is a kernel matrix, we can find an implicit feature space
𝐾 = Φ⊤Φ,where Φ = 𝜙 𝑥1 , … , 𝜙 𝑥𝑛
QR decomposition on Φ = QR
𝐾 = R⊤Q⊤QR = 𝑅⊤𝑅
Incomplete Cholesky decomposition
Use QR decomposition with pivots
𝐾 ≈ 𝑅 1: 𝑑, : ⊤𝑅(1: 𝑑, : )
8
𝐾 ≈ 𝑅⊤
𝑅
Random features
What basis to use?
𝑒𝑗𝜔′(𝑥−𝑦) can be replaced by cos (𝜔 𝑥 − 𝑦 ) since both 𝑘 𝑥 − 𝑦
and 𝑝 𝜔 real functions
cos 𝜔 𝑥 − 𝑦 = cos 𝜔𝑥 cos 𝜔𝑦 + sin 𝜔𝑥 sin 𝜔𝑦
For each 𝜔, use feature [cos 𝜔𝑥 , sin 𝜔𝑥 ]
What randomness to use?
Randomly draw 𝜔 from 𝑝 𝜔
Eg. Gaussian RBF kernel, drawn from Gaussian
9
String Kernels
Compare two sequences for similarity
Exact matching kernel
Counting all matching substrings
Flexible weighting scheme
Does not work well for noisy case
Successful applications in bio-informatics
Linear time algorithm using suffix trees 10
K( , )=0.7 ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG
GCATGAC GCCATTG ACCTGCT GGTCCTA
Exact matching string kernels
Bag of Characters
Count single characters, set 𝑤𝑠 = 0 for 𝑠 > 1
Bag of Words
s is bounded by whitespace
Limited range correlations
Set 𝑤𝑠 = 0 for all 𝑠 > 𝑛 given a fixed 𝑛
K-spectrum kernel
Account for matching substrings of length 𝑘, set 𝑤𝑠 = 0 for all 𝑠 ≠ 𝑘
11
Suffix trees
Definitions: compact tree built from all the suffixes of a string.
Eg. suffix tree of ababc denoted by S(ababc)
Node Label = unique path from the root
Suffix links are used to speed up parsing of strings: if we are at node 𝑎𝑥 then suffix links help us to jump to node 𝑥
Represent all the substrings of a given string
Can be constructed in linear time and stored in linear space
Each leaf corresponds to a unique suffix
Leaves on the subtree give number of occurrence 12
Combining classifiers
Average results from several different models
Bagging
Stacking (meta-learning)
Boosting
Why?
Better classification performance than individual classifiers
More resilience to noise
Concerns
Take more time to obtain the final model
Overfitting
13
Bagging
Bagging: Bootstrap aggregating
Generate B bootstrap samples of the training data: uniformly random sampling with replacement
Train a classifier or a regression function using each bootstrap sample
For classification: majority vote on the classification results
For regression: average on the predicted values
Advantage:
Simple
Reduce variance
Improve performance for unstable classifier which may vary significantly with small changes in the dataset.
14
Bagging Example
Sample with replacement
15
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
Stacking classifiers
16
Level-0 models are based on different learning models and use original data (level-0 data)
Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called “generalizer”
If you have lots of models, you can stacking into deeper hierarchies
Boosting
Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule
A family of methods which produce a sequence of classifiers
Each classifier is dependent on the previous one and focuses on the previous one’s errors
Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier.
Questions:
How to choose “hardest” examples?
How to combine these classifiers?
17
AdaBoost
18
Toy Example
Weak classifier (rule of thumb): vertical or horizontal half-planes
Uniform weights on all examples
19
Boosting round 1
Choose a rule of thumb (weak classifier)
Some data points obtain higher weights because they are classified incorrectly
20
Boosting round 2
Choose a new rule of thumb
Reweight again. For incorrectly classified examples, weight increased
21
Boosting round 3
Repeat the same process
Now we have 3 classifiers
22
Boosting aggregate classifier
Final classifier is weighted combination of weak classifiers
23