ping li efficient data reduction and summarization december...
TRANSCRIPT
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1
Efficient Data Reduction and Summarization
Ping Li
Department of Statistical Science
Faculty of Computing and Information Science
Cornell University
December 8, 2011
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 2
Hashing Algorithms for Large-Scale Learning
Ping Li
Department of Statistical Science
Faculty of Computing and Information Science
Cornell University
References1. Ping Li, Anshumali Shrivastava, and Christian Konig, Training Logistic Regression and SVM on 200GB Data
Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW), Technical Reports, 2011
2. Ping Li, Anshumali Shrivastava, Joshua Moore, and Christian Konig, Hashing Algorithms for Large-Scale
Learning, NIPS 2011
3. Ping Li and Christian Konig, Theory and Applications of b-Bit Minwise Hashing, Research Highlight Article in
Communications of ACM, August 2011
4. Ping Li, Christian Konig, and Wenhao Gui, b-Bit Minwise Hashing for Estimating Three-Way Similarities, NIPS
2010
5. Ping Li, Christian Konig, b-Bit Minwise Hashing, WWW 2010
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 3
How Large Are the Datasets?
Conceptually, consider the data as a matrix of size n × D.
In modern applications, # examples n = 106 is common and n = 109 is not
rare, for example, images, documents, spams, social and search click-through
data. In a sense, click-through data can have unlimited # examples.
Very high-dimensional (image, text, biological) data are common, e.g., D = 106
(million), D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a
sense, D can be made arbitrarily higher by taking into account pairwise, 3-way or
higher interactions.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 4
Massive, High-dimensional, Sparse, and Binary Data
Binary sparse data are very common in the real-world :
• For many applications (such as text), binary sparse data are very natural.
• Many datasets can be quantized/thresholded to be binary without hurting the
prediction accuracy.
• In some cases, even when the “original” data are not too sparse, they often
become sparse when considering pariwise and higher-order interactions.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 5
An Example of Binary (0/1) Sparse Massive Data
A set S ⊆ Ω = 0, 1, ..., D− 1 can be viewed as 0/1 vector in D dimensions.
Shingling : Each document (Web page) can be viewed as a set of w-shingles.
For example, after parsing, a sentence “today is a nice day” becomes
• w = 1: “today”, “is”, “a”, “nice”, “day”
• w = 2: “today is”, “is a”, “a nice”, “nice day”
• w = 3: “today is a”, “is a nice”, “a nice day”
Previous studies used w ≥ 5, as single-word (unit-gram) model is not sufficient.
Shingling generates extremely high dimensional vectors, e.g., D = (105)w .
(105)5 = 1025 = 283, although in current practice, it seems D = 264 suffices.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 6
Challenges of Learning with Massive High-dimensional Data
• The data often can not fit in memory (even when the data are sparse).
• Data loading takes too long. For example, online algorithms require less
memory but often need to iterate over the data for several (or many) passes to
achieve sufficient accuracies. Data loading in general dominates the cost.
• Training can be very expensive, even for simple linear models such as logistic
regression and linear SVM.
• Testing may be too slow to meet the demand. This is especially important for
applications in search engines or interactive data visual analytics.
• The model itself can be too large to store, for example, we can not really store
a vector of weights for fitting logistic regression on data of 264 dimensions.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 7
Another Example of Binary Massive Sparse Data
The zipcode recognition task can be formulated as a 10-class classification
problem. The well-known MNIST8m dataset contains 8 million 28 × 28 small
images. We try to use linear SVM (perhaps the fastest classification algorithm).
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Features for linear SVM
1. Original pixel intensities. 784(= 28 × 28) dimensions.
2. All overlapping 2 × 2 (or higher) patches (absent/present). Binary in 12448
dimensions. (28 − 1)2 × 22×2 + 784 = 12448.
3. All pairwise on top of 2 × 2 patches. Binary in 77,482,576 dimensions.
12448 × (12448 − 1)/2 + 12448 = 77482576.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 8
Linear SVM experiments on the first 20,000 examples ( MNIST20k)
The first 10,000 for training and the remaining 10,000 for testing.
10−2
10−1
100
101
102
75
80
85
90
95
100
C
Tes
t acc
urac
y (%
)
Original
2 × 2
All pairwise of 2 × 2
MNIST20k: Accuracy
C is the usual regularization parameter in linear SVM (the only parameter).
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 9
3 × 3: all overlapping 3 × 2 patches + all 2 × 2 patches
4 × 4: all overlapping 4 × 4 patches + ...
10−2
10−1
100
101
102
75
80
85
90
95
100
C
Tes
t acc
urac
y (%
)
2 × 2
3 × 3 4 × 4
MNIST20k: Accuracy
Increasing features does not help much, possibly because # examples is small.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 10
Linear SVM experiments on the first 1,000,000 examples ( MNIST1m)
The first 500,000 for training and the remaining 500,000 for testing.
10−2
10−1
100
101
102
75
80
85
90
95
100
C
Tes
t acc
urac
y (%
)
2 × 2
4 × 4
Original
MNIST1m: Accuracy
3 × 3
All pairwise on top of 2 × 2 patches would require > 1TB storage on MNIST1m.
Remind these are merely tiny images to start with.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 11
A Popular Solution Based on Normal Random Projections
Random Projections : Replace original data matrix A by B = A × R
A R = B
R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).
B ∈ Rn×k : projected matrix, also random.
B approximately preserves the Euclidean distance and inner products between
any two rows of A. In particular, E (BBT) = AA
T.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 12
Using Random Projections for Linear Learning
The task is straightforward.
A R = B
Since the new (projected) data B preserve the inner product of the original data
A, we can simply feed B into (e.g.,) SVM or logistic regression solvers.
These days, linear algorithms are extremely fast (if the data can fit in memory).
——— However —–
Random projections may not be good for inner products because the variance
can be much much larger than b-bit minwise hashing.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 13
Notation for b-Bit Minwise Hashing
A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros).
Consider two sets S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1 (e.g., D = 264)
f2
f1 a
f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|.
The resemblance R is a popular measure of set similarity
R =|S1 ∩ S2|
|S1 ∪ S2|=
a
f1 + f2 − a
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 14
Minwise Hashing
Suppose a random permutation π is performed on Ω, i.e.,
π : Ω −→ Ω, where Ω = 0, 1, ..., D − 1.
An elementary probability argument shows that
Pr (min(π(S1)) = min(π(S2))) =|S1 ∩ S2|
|S1 ∪ S2|= R.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 15
An Example
D = 5. S1 = 0, 3, 4, S2 = 1, 2, 3, R = |S1∩S2||S1∪S2|
= 15 .
One realization of the permutation π can be
0 =⇒ 3
1 =⇒ 2
2 =⇒ 0
3 =⇒ 4
4 =⇒ 1
π(S1) = 3, 4, 1 = 1, 3, 4, π(S2) = 2, 0, 4 = 0, 2, 4
In this realization, min(π(S1)) 6= min(π(S2)).
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 16
Minwise Hashing Estimator
After k independent permutations, π1, π2, ..., πk , one can estimate R without
bias, as a binomial:
RM =1
k
k∑
j=1
1min(πj(S1)) = min(πj(S2)),
Var(
RM
)
=1
kR(1 − R).
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 17
Problems with Minwise Hashing
• Similarity search: high storage and computational cost
In industry, each hashed value, e.g., min(π(S1)), is usually stored using 64
bits. Often k = 200 samples (hashed values) are needed. The total storage
can be prohibitive: n × 64 × 200 = 16 TB, if n = 1010 (web pages).
=⇒ computational problems, transmission problems, etc.
• Statistical learning: high dimensionality and model size
We recently realize that minwise hashing can be used in statistical learning.
However, 64-bit hashing leads to impractically high-dimensional model.
b-bit minwise hashing naturally solves both problems!
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 18
The Basic Probability Result for b-Bit Minwise Hashing
Consider two sets, S1 and S2,
S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1,
f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|
Define the minimum values under π : Ω → Ω to be z1 and z2:
z1 = min (π (S1)) , z2 = min (π (S2)) .
and their b-bit versions
z(b)1 = The lowest b bits of z1, z
(b)2 = The lowest b bits of z2
Example: if z1 = 7(= 111 in binary), then z(1)1 = 1, z
(2)1 = 3.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 19
Theorem
Assume D is large (which is virtually always true)
Pb = Pr
(
z(b)1 = z
(b)2
)
= C1,b + (1 − C2,b) R
——————–
Recall, (assuming infinite precision, or as many digits as needed), we have
Pr (z1 = z2) = R
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 20
Theorem
Assume D is large (which is virtually always true)
Pb = Pr
(
z(b)1 = z
(b)2
)
= C1,b + (1 − C2,b) R
r1 =f1
D, r2 =
f2
D, f1 = |S1|, f2 = |S2|,
C1,b = A1,b
r2
r1 + r2+ A2,b
r1
r1 + r2,
C2,b = A1,b
r1
r1 + r2+ A2,b
r2
r1 + r2,
A1,b =r1 [1 − r1]
2b−1
1 − [1 − r1]2b
, A2,b =r2 [1 − r2]
2b−1
1 − [1 − r2]2b
.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 21
The Variance-Space Trade-off
Smaller b =⇒ smaller space for each sample. However,
smaller b =⇒ larger variance.
B(b; R, r1, r2) characterizes the var-space trade-off:
B(b; R, r1, r2) = b × Var(
Rb
)
Lower B(b) is better.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 22
The ratio,
B(b1; R, r1, r2)
B(b2; R, r1, r2)
measures the improvement of using b = b2 (e.g., b2 = 1) over using b = b1
(e.g., b1 = 64).
B(64)B(b) = 20 means, to achieve the same accuracy (variance), using b = 64 bits
per hashed value will require 20 times more space (in bits) than using b = 1.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 23
B(64)B(b) , relative improvement of using b = 1, 2, 3, 4 bits, compared to 64 bits.
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 10−10 b = 1
b = 2
b = 3
b = 4
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 0.1
b = 1
b = 2
b = 3
b = 4
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 0.5
b = 1
b = 2
b = 3
b = 4
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 0.9
b = 1
b = 2
b = 3
b = 4
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 24
Relative Improvements in the Least Favorable Situation
In the least favorable condition, the improvement would be
B(64)
B(1)= 64
R
R + 1
If R = 0.5 (the commonly used similarity threshold), then the improvement
will be 643 = 21.3-fold, which is large enough for industry.
——————
It turns out in terms of dimension reduction, the improvement is like 264
28 -fold.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 25
Experiments: Duplicate detection on Microsoft news articl es
The dataset was crawled as part of the BLEWS project at Microsoft. We
computed pairwise resemblances for all documents and retrieved documents
pairs with resemblance R larger than a threshold R0.
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.3
Sample size (k)
Pre
cisi
on
b=1b=2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.4
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 26
Using b = 1 or 2 is almost as good as b = 64 especially for higher threshold R0.
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.5
Sample size (k)
Pre
cisi
on
b=1
2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.6
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.7
Sample size (k)
Pre
cisi
on
b=1
2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.8
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 27
b-Bit Minwise Hashing for Large-Scale linear Learning
Linear learning algorithms require the estimators to be inner products.
If we look carefully, the estimator of minwise hashing is indeed an inner product of
two (extremely sparse) vectors in D × k dimensions (infeasible when D = 264):
RM =1
k
k∑
j=1
1min(πj(S1)) = min(πj(S2))
because, as z1 = min(π(S1)), z2 = min(π(S2)) ∈ Ω = 0, 1, ..., D − 1.
1z1 = z2 =D−1∑
i=0
1z1 = i × z2 = i
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 28
For example, D = 5, z1 = 2, z2 = 3,
1z1 = z2 = inner product between 0, 0, 1, 0, 0, and 0, 1, 0, 0, 0.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 29
Integrating b-Bit Minwise Hashing for (Linear) Learning
1. We apply k independent random permutations on each (binary) feature
vector xi and store the lowest b bits of each hashed value. This way, we
obtain a new dataset which can be stored using merely nbk bits.
2. At run-time, we expand each new data point into a 2b × k-length vector, i.e.
we concatenate the k vectors (each of length 2b). The new feature vector has
exactly k 1’s.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 30
Consider an example of k = 3 and b = 2. For set (vector) S1:
Original hashed values (k = 3) : 12013 25964 20191
Original binary representations :
010111011101101 110010101101100 100111011011111
Lowest b = 2 binary digits : 01 00 11
Corresponding decimal values : 1 0 3
Expanded 2b = 4 binary digits : 0010 0001 1000
New feature vector fed to a solver : 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0
Then we apply the same permutations and procedures on sets S2, S3, ...
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 31
Datasets and Solver for Linear Learning
This talk only presents two text datasets. Experiments on image data (1TB or
larger) will be presented in the future.
Dataset # Examples (n) # Dims (D) Avg # Nonzeros Train / Test
Webspam (24 GB) 350,000 16,609,143 3728 80% / 20%
Rcv1 (200 GB) 781,265 1,010,017,424 12062 50% / 50%
We chose LIBLINEAR as the basic solver for linear learning. Note that our
method is purely statistical/problilsitic and is independent of the underlying
optimization procedures.
All experiments were conducted on workstations with Xeon(R) CPU
([email protected]) and 48GB RAM, under Windows 7 System.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 32
Experimental Results on Linear SVM and Webspam
• We conducted our extensive experiments for a wide range of regularization C
values (from 10−3 to 102) with fine spacings in [0.1, 10].
• We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 33
Testing Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 6,8,10,16
b = 1
svm: k = 200
b = 24
b = 4
Spam: Accuracy
• Solid: b-bit hashing. Dashed (red) the original data
• Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using
the original data.
• The results are averaged over 50 runs.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 34
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6b = 8,10,16
svm: k = 50Spam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
CA
ccur
acy
(%)
Spam: Accuracy
svm: k = 100
b = 1
b = 2
b = 4b = 8,10,16
4
6
6
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
svm: k = 150
Spam: Accuracy
b = 1
b = 2
b = 4b = 6,8,10,16
4
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 6,8,10,16
b = 1
svm: k = 200
b = 24
b = 4
Spam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
Spam: Accuracy
svm: k = 300
b = 1
b = 24
b = 4b = 6,8,10,16
10−3
10−2
10−1
100
101
102
80828486889092949698
100
b = 1
b = 24
b = 6,8,10,16
C
Acc
urac
y (%
)
svm: k = 500
Spam: Accuracy
b = 4
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 35
Stability of Testing Accuracy (Standard Deviation)
Our method produces very stable results, especially b ≥ 4
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6b = 8
b = 1610
svm: k = 50Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6b = 8
b = 10,16svm: k = 100Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
svm: k = 150
Spam:Accuracy (std)
b = 1
b = 2
b = 4
b = 8b = 16
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6
b = 8,10,16svm: k = 200Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
Spam:Accuracy (std)
b = 1
b = 2
b = 4b = 8b = 16svm: k = 300
10−3
10−2
10−1
100
101
102
10−2
10−1
100
b = 1
b = 2
b = 4
b = 6,8,10,16Spam accuracy (std)
C
Acc
urac
y (s
td %
)
svm: k = 500
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 36
Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k = 50Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k =100Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k = 150
Spam:Training Time
b = 16
b = 1
b=8
b = 4b = 1
2416
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
svm: k = 200Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
b = 1
2
416
b=8
b = 1
svm: k = 300
Spam:Training Time10
−310
−210
−110
010
110
210
0
101
102
103
Spam: Training time
b = 10
b = 16
C
Tra
inin
g tim
e (s
ec)
svm: k = 500
• They did not include data loading time (which is small for b-bit hashing)
• The original training time is about 100 seconds.
• b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 37
Testing Time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 50Spam: Testing time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 100Spam: Testing time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tes
ting
time
(sec
)
svm: k = 150Spam:Testing Time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 200Spam: Testing time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tes
ting
time
(sec
) Spam:Testing Timesvm: k = 300
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 500Spam: Testing time
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 38
Experimental Results on L2-Regularized Logistic Regression
Testing Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
logit: k = 50Spam: Accuracy
b = 1
b = 2
b = 4
b = 6b = 8,10,16
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)logit: k = 100
Spam: Accuracy
b = 1
b = 2
b = 4b = 6
b = 8,10,16
10−3
10−2
10−1
100
101
102
80
85
90
95
100
C
Acc
urac
y (%
)
Spam:Accuracylogit: k = 150
b = 1
b = 2
b = 4b = 8
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
Spam: Accuracy
logit: k = 200
b = 6,8,10,16
b = 1
b = 2
b = 4
10−3
10−2
10−1
100
101
102
80
90
95
100
C
Acc
urac
y (%
)
logit: k = 300Spam:Accuracy
b = 1
b = 2
b = 4b = 8
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
logit: k = 500
Spam: Accuracy
b = 1
b = 2b = 4
4
b = 6,8,10,16
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 39
Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 50
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 100
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 150Spam:Training Time
b = 16b = 1
2 b = 84
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 200
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
b = 1
b = 482
logit: k = 300Spam:Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
) b = 16
logit: k = 500
Spam: Training time
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 40
Comparisons with Other Algorithms
We conducted extensive experiments with the VW algorithm, which has the same
variance as random projections. Vowpal Wabbit (VW) refers to the algorithm in
(Weinberger et. al., ICML 2009).
Consider two sets S1, S2, f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|,
R = |S1∩S2||S1∪S2|
= af1+f2−a
. Then, from their variances
V ar (aV W ) ≈1
k
(
f1f2 + a2)
V ar(
RMINWISE
)
=1
kR (1 − R)
we can immediately see the significant advantages of minwise hashing, especially
when a ≈ 0 (which is common in practice).
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 41
Comparing b-Bit Minwise Hashing with VW
8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about
the same test accuracy as VW with k = 104 ∼ 106.
101
102
103
104
105
106
80828486889092949698
100
C = 0.01
1,10,100
0.1
k
Acc
urac
y (%
)
svm: VW vs b = 8 hashing
C = 0.01
C = 0.1C = 1
10,100
Spam: Accuracy
101
102
103
104
105
106
80828486889092949698
100
k
Acc
urac
y (%
)
1
C = 0.01
C = 0.1
C = 110
C = 0.01
C = 0.1
10,100
logit: VW vs b = 8 hashingSpam: Accuracy
100
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 42
8-bit hashing is substantially faster than VW (to achieve the same accuracy).
102
103
104
105
106
100
101
102
103
k
Tra
inin
g tim
e (s
ec) C = 100
C = 10
C = 1,0.1,0.01C = 100
C = 10
C = 1,0.1,0.01
Spam: Training time
svm: VW vs b = 8 hashing
102
103
104
105
106
100
101
102
103
kT
rain
ing
time
(sec
)
C = 0.01
10,1.0,0.1
100
logit: VW vs b = 8 hashingSpam: Training time
C = 0.1,0.01
C = 100,10,1
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 43
Experimental Results on Rcv1 (200GB)
Test accuracy using linear SVM (Can not train the original data)
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
svm: k =50 b = 16
b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)rcv1: Accuracy
svm: k =100 b = 16
b = 12b = 8
b = 4b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
svm: k =150
rcv1: Accuracy
b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
svm: k =200
rcv1: Accuracy
b = 16
b = 12b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
svm: k =300 b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
b = 1
b = 2
b = 4
b = 8b = 12
b = 16
CA
ccur
acy
(%)
svm: k =500
rcv1: Accuracy
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 44
Training time using linear SVM
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=50
12
16
b = 8
12
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
)
rcv1: Train Time
svm: k=100
b = 16
12
b = 8
12
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=150
b = 1612
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=200
b = 1612
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=300
b = 16
12
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
svm: k=500
b = 8
12
b = 16
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 45
Test accuracy using logistic regression
10−2
100
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =50 b = 16
b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
CA
ccur
acy
(%)
rcv1: Accuracyb = 1
b = 2
b = 4
b = 8
b = 12
b = 16logit: k =100
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =150 b = 16
b = 12b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =200 b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
b = 16b = 12
b = 8
b = 4
b = 2
b = 1
logit: k =300
rcv1: Accuracy
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
b = 16
b = 12b = 8
b = 4
b = 2
b = 1
rcv1: Accuracy
logit: k =500
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 46
Training time using logistic regression
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=50
b = 168
b = 1
b = 4
b = 2
b = 12
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
)
rcv1: Train Time
logit: k=100
b = 8
b = 16
b = 12
b = 1
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=150
b = 16
128
b = 4
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=200
b = 16
128
b = 1
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=300b = 16
12
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
b = 1612
b = 8
logit: k=500
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 47
Comparisons with VW on Rcv1
Test accuracy using linear SVM
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
svm: VW vs hashing
rcv1: Accuracy
C = 0.01
C = 0.1,1,10
C = 0.01,0.1,1,10
1−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)rcv1: Accuracy
svm: VW vs hashing2−bit
C = 0.01,0.1,1,10 C = 0.01
C = 0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracysvm: VW vs hashing4−bit
C = 0.01
C = 0.1,1,10
C = 0.01
C = 10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracy
svm: VW vs hashing8−bit
C = 0.1,1,10
C = 0.01C = 0.01
C = 10
101
102
103
104
50
60
70
80
90
100
C = 0.01
K
Acc
urac
y (%
)
C = 0.01C = 0.1,1,10
rcv1: Accuracysvm: VW vs hashing12−bit
C = 10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 0.1,1,10
C = 0.01
rcv1: Accuracy
svm: VW vs hashing16−bit
C = 10
C = 0.01
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 48
Test accuracy using logistic regression
101
102
103
104
50
60
70
80
90
100
C = 0.01
C = 0.01,0.1,1,10
C = 0.1,1,10
K
Acc
urac
y (%
)
logit: VW vs hashingrcv1: Accuracy
1−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 1logit: VW vs hashing2−bit
rcv1: Accuracy
C = 0.1,1,10
C = 0.01C = 0.01,0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 0.1,1,10
C = 0.01
C = 0.1,1,10
C = 0.01
rcv1: Accuracylogit: VW vs hashing4−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracylogit: VW vs hashing8−bit
C = 0.01
C = 0.1,1,10
C = 0.01
C = 0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracylogit: VW vs hashing12−bit
C = 0.01
C = 0.1,1,10
C = 0.1,1,10
C = 0.01
101
102
103
104
50
60
70
80
90
100
KA
ccur
acy
(%)
C = 0.01
C = 0.1,1,10
C = 0.01
C = 10
logit: VW vs hashing16−bit
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 49
Training time
101
102
103
104
100
101
102
103
C = 0.01
C = 0.1
C = 10
8−bit
K
Tra
inin
g tim
e (s
ec)
C = 0.01
C = 0.1
C = 1
rcv1: Train Time
svm: VW vs hashing
C = 1
C = 10
101
102
103
104
100
101
102
103
KT
rain
ing
time
(sec
)
rcv1: Train Time
logit: VW vs hashing8−bit
C = 10
C = 1
C = 0.1
C = 0.01
C = 0.01
C = 0.1
C = 1
C = 10
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 50
Conclusion
• Minwise Hashing is the standard technique in the context of search.
• b-bit minwise hashing is a substantial improvement by using only b bits per
hashed value instead of 64 bits (e.g., a 20-fold reduction in space).
• Our work is the first proposal to use minwise hashing in linear learning.
However, the resultant data dimension may be too large to be practical.
• b-bit minwise hashing provides a very simple solution, by dramatically
reducing the dimensionality from (e.g.,) 264 × k to 2b × k.
• Thus, b-bit minwise hashing can be easily integrated with linear learning, to
solve extremely large problems, especially when data do not fit in memory.
• b-bit minwise hashing has substantial advantages (in many aspects) over
other methods such as random projections and VW.
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 51
Questions?