fast kernel methodslsong/teaching/8803ml/lecture21.pdfqr decomposition essentially gram-schmidt...

Fast Kernel Methods

Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Le Song

Kernel low rank approximation

Incomplete Cholesky factorization of kernel matrix 𝐾 of size 𝑛 × 𝑛 to 𝑅 of size 𝑑 × 𝑛, and 𝑑 ≪ 𝑛

𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥

′

𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑅𝑥⊤ 𝑅𝑅⊤ + 𝜎𝑛𝑜𝑖𝑠𝑒

2 𝐼−1𝑅𝑌⊤

𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥′ = 𝑅𝑥𝑥 − 𝑅𝑥

⊤ 𝑅𝑅⊤ + 𝜎𝑛𝑜𝑖𝑠𝑒2 𝐼

−1(𝑅𝑅⊤)𝑅𝑥

2

𝐾 ≈ 𝑅⊤

𝑅

𝐴 ≈ 𝑅⊤

𝑅

Incomplete Cholesky Decomposition

We have a few things to understand

Gram-Schmidt orthogonalization

Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal

basis 𝑄 = 𝑢1, 𝑢2, … 𝑢𝑛 , 𝑢𝑖⊤𝑢𝑗 = 0, 𝑢𝑖

⊤𝑢𝑖 = 0

QR decomposition

Given a set of orthonormal basis 𝑄, compute the projection of 𝑉

onto 𝑄, 𝑣𝑖 = 𝑟𝑗𝑖𝑢𝑗𝑗 , 𝑅 = 𝑟𝑗𝑖

𝑉 = 𝑄𝑅

Cholesky decomposition with pivots

𝑉 ≈ 𝑄 : , 1: 𝑘 𝑅 1: 𝑘, ∶

Kernelization

𝑉⊤𝑉 = 𝑅⊤𝑄⊤𝑄𝑅 = 𝑅⊤𝑅 ≈ 𝑅 1: 𝑘, ∶ ⊤ 𝑅 1: 𝑘, ∶

𝐾 = Φ⊤Φ ≈ 𝑅 1: 𝑘, ∶ ⊤ 𝑅 1: 𝑘, ∶

3

Gram-Schmidt orthogonalization

Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal basis

𝑄 = 𝑢1, 𝑢2, … 𝑢𝑛 , 𝑢𝑖⊤𝑢𝑗 = 0, 𝑢𝑖

⊤𝑢𝑖 = 0

𝑢1 can be found by picking an arbitrary 𝑣1 and normalize

𝑢1 =𝑣1

𝑣1

𝑢2 can be found by picking a vector 𝑣2 and subtract out multiple of 𝑢1, and then normalize

𝑎2 = 𝑣2 − < 𝑣2, 𝑢1 > 𝑢1

𝑢2 =𝑎2

𝑎2

𝑎𝑖 = 𝑣𝑖 − < 𝑣𝑖 , 𝑢𝑗 > 𝑢𝑗 𝑖−1𝑗=1

4

𝑢1

𝑢2

𝑣2

𝑣1

𝑎2

QR decomposition

Essentially Gram-Schmidt orthogonalization, keep both the orthonormal basis and weight of the projection

Given a set of vectors V = {𝑣1, 𝑣2, … , 𝑣𝑛}, find a set of orthonormal basis 𝑄 = 𝑢1, 𝑢2, … 𝑢𝑛 using Gram-Schmidt orthogonalization

The projection of 𝑣𝑖 on to basis vector 𝑢𝑗 is 𝑟𝑗𝑖 =< 𝑣𝑖 , 𝑢𝑗 >

𝑣1 = 𝑢1 < 𝑢1, 𝑣1 >

𝑣2 = 𝑢1 < 𝑢1, 𝑣2 > +𝑢2 < 𝑢2, 𝑣2 >

𝑣3 = 𝑢1 < 𝑢1, 𝑣3 > +𝑢2 < 𝑢2, 𝑣3 > +𝑢3 < 𝑢3, 𝑣3 >

…

𝑣𝑖 = < 𝑣𝑖 , 𝑢𝑗 > 𝑢𝑗 𝑖𝑗=1

5

QR decomposition

Because use the original data point to form basis vectors, vector 𝑣𝑖 only have 𝑖 nonzeros components

𝑣𝑖 = < 𝑣𝑖 , 𝑢𝑗 > 𝑢𝑗 =𝑖𝑗=1 𝑟𝑗𝑖𝑢𝑗

𝑖𝑗=1

Collect terms into matrix format

6

𝑉 = 𝑣1, … , 𝑣𝑛 , 𝑣𝑖 ∈ 𝑅𝑑

𝑄 = (𝑢1, … , 𝑢𝑑) 𝑅 = (𝑟:𝑖 , … , 𝑢:𝑛) zeros

QR decomposition with pivots

QR decomposition

If we only choose a few basis vectors, then its approximation

The basis vectors is formed from the original data points

how to order/choose from the original data points?

such that small approximation error?

order/choose from data points: choosing pivots 7

𝑄 = (𝑢1, … , 𝑢𝑑) 𝑅 = (𝑟:𝑖 , … , 𝑢:𝑛) zeros 𝑉 =

Cholesky decomposition

𝐾 is symmetric and positive definite matrix, then 𝐾 can be decomposed as

𝐾 = 𝑅⊤𝑅

Since 𝐾 is a kernel matrix, we can find an implicit feature space

𝐾 = Φ⊤Φ,where Φ = 𝜙 𝑥1 , … , 𝜙 𝑥𝑛

QR decomposition on Φ = QR

𝐾 = R⊤Q⊤QR = 𝑅⊤𝑅

Incomplete Cholesky decomposition

Use QR decomposition with pivots

𝐾 ≈ 𝑅 1: 𝑑, : ⊤𝑅(1: 𝑑, : )

8

𝐾 ≈ 𝑅⊤

𝑅

Random features

What basis to use?

𝑒𝑗𝜔′(𝑥−𝑦) can be replaced by cos (𝜔 𝑥 − 𝑦 ) since both 𝑘 𝑥 − 𝑦

and 𝑝 𝜔 real functions

cos 𝜔 𝑥 − 𝑦 = cos 𝜔𝑥 cos 𝜔𝑦 + sin 𝜔𝑥 sin 𝜔𝑦

For each 𝜔, use feature [cos 𝜔𝑥 , sin 𝜔𝑥 ]

What randomness to use?

Randomly draw 𝜔 from 𝑝 𝜔

Eg. Gaussian RBF kernel, drawn from Gaussian

9

String Kernels

Compare two sequences for similarity

Exact matching kernel

Counting all matching substrings

Flexible weighting scheme

Does not work well for noisy case

Successful applications in bio-informatics

Linear time algorithm using suffix trees 10

K( , )=0.7 ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG

GCATGAC GCCATTG ACCTGCT GGTCCTA

Exact matching string kernels

Bag of Characters

Count single characters, set 𝑤𝑠 = 0 for 𝑠 > 1

Bag of Words

s is bounded by whitespace

Limited range correlations

Set 𝑤𝑠 = 0 for all 𝑠 > 𝑛 given a fixed 𝑛

K-spectrum kernel

Account for matching substrings of length 𝑘, set 𝑤𝑠 = 0 for all 𝑠 ≠ 𝑘

11

Suffix trees

Definitions: compact tree built from all the suffixes of a string.

Eg. suffix tree of ababc denoted by S(ababc)

Node Label = unique path from the root

Suffix links are used to speed up parsing of strings: if we are at node 𝑎𝑥 then suffix links help us to jump to node 𝑥

Represent all the substrings of a given string

Can be constructed in linear time and stored in linear space

Each leaf corresponds to a unique suffix

Leaves on the subtree give number of occurrence 12

Combining classifiers

Average results from several different models

Bagging

Stacking (meta-learning)

Boosting

Why?

Better classification performance than individual classifiers

More resilience to noise

Concerns

Take more time to obtain the final model

Overfitting

13

Bagging

Bagging: Bootstrap aggregating

Generate B bootstrap samples of the training data: uniformly random sampling with replacement

Train a classifier or a regression function using each bootstrap sample

For classification: majority vote on the classification results

For regression: average on the predicted values

Advantage:

Simple

Reduce variance

Improve performance for unstable classifier which may vary significantly with small changes in the dataset.

14

Bagging Example

Sample with replacement

15

Original 1 2 3 4 5 6 7 8

Training set 1 2 7 8 3 7 6 3 1

Training set 2 7 8 5 6 4 2 7 1

Training set 3 3 6 2 7 5 6 2 2

Training set 4 4 5 1 4 6 4 3 8

Stacking classifiers

16

Level-0 models are based on different learning models and use original data (level-0 data)

Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called “generalizer”

If you have lots of models, you can stacking into deeper hierarchies

Boosting

Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule

A family of methods which produce a sequence of classifiers

Each classifier is dependent on the previous one and focuses on the previous one’s errors

Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier.

Questions:

How to choose “hardest” examples?

How to combine these classifiers?

17

AdaBoost

18

Toy Example

Weak classifier (rule of thumb): vertical or horizontal half-planes

Uniform weights on all examples

19

Boosting round 1

Choose a rule of thumb (weak classifier)

Some data points obtain higher weights because they are classified incorrectly

20

Boosting round 2

Choose a new rule of thumb

Reweight again. For incorrectly classified examples, weight increased

21

Boosting round 3

Repeat the same process

Now we have 3 classifiers

22

Boosting aggregate classifier

Final classifier is weighted combination of weak classifiers

23

fast kernel methodslsong/teaching/8803ml/lecture21.pdfqr decomposition essentially gram-schmidt...

Documents