Download - Exact data mining from in-exact data
Exact data mining from in-exact data
Nick Freris
Qualcomm, San DiegoOctober 10, 2013
Exact data mining on in-exact data
IntroductionInformation retrieval is a large industry..
Biology, finance, engineering, marketing, vision/graphics, etc...but data are hardly ever maintained in original form
(1)
Original Quantized
Compression Rights protection
Watermarking
Exact data mining on in-exact data
Preservation of the mining outcome
this talk: provable guarantees for distanced-based mining
(2)
K-means preservationOptimal distance estimation
D=11
Optimal distance estimation
This work presents optimal estimation of Euclidean distance in the compressed domain. This result cannot be further improved
Our approach is applicable on any orthonormal data compression basis (Fourier, Wavelets, Chebyshev, PCA, etc.)
Our method allows up to – 57% better distance estimation– 80% less computation effort– 128 : 1 compression efficiency
(3)
D = 11.5
Motivation/BenefitsTime-series data are customarily compressed in order to:
Save storage spaceReduce transmission bandwidthAchieve faster processing / data analysisRemove noise
Distance estimation has various data analytics applicationsClustering / ClassificationAnomaly detectionSimilarity search
Now we can do all this very efficiently directly on the compressed data!
(4)
Previous work vs current approach
Our approachApproach 1 Approach 2:(only one compressed)
First coeffs + errorBest coeffs + errorBest coeffs + error + constraints Optimal (our method)
(5)
Exact data mining on in-exact data
applicationsSimilarity Search/Grouping/Clustering
find semantically similar searches, deduce associations
QUERY: Supply ChainResults with similar query patterns
integrated supply chainSmarter supply chain
Business consultant supply chain management
dynamic inventory optimization
Five key supply chain areas
IBM’s dynamic inventory
Chief supply chain officer
Retail Supply Management
(6)
Exact data mining on in-exact data
applications
Query: business
analytics
Query: business services
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Requests
Similarity Search/Grouping/ClusteringAdvertising/recommendationsBuying keywords that are cheaper but exhibit same demand pattern
(6)
Exact data mining on in-exact data
applications
Query: Acquisition
Jan Feb Mar Apr May Jun Jul Aug Sep Okt Nov Dec
Requests May ‘10: Sterling Commerce Acquisition (1.4B)
Feb ‘10: Intelliden Inc. Acquisition
Burst Detection Discovery of important events
(6)
Exact data mining on in-exact data
underlying operation: k-NN searchK-Nearest Neighbor (k-NN) similarity searchIssues that arise:
How to compress data?How to speedup search
We have to estimate tight bounds on the distance metric using just the compressed representation
(7)
Exact data mining on in-exact data
similarity search problemquery
D = 7.3
D = 10.2
D = 11.8
D = 17
D = 22
Distance
Objective: Compare the query with all sequences in DB and return the k most similar sequences to the query.
Linear Scan:
(8)
Exact data mining on in-exact data
speeding up similarity searchoriginalDB
Candidate Superset
Verify against original DB
simplifiedDB
keyword
simplifiedquery
Final Answer set
keyword 1
keyword 2
keyword 3
…
Upper / lower bounds on distance
(8)
Exact data mining on in-exact data
compressing weblog data Use Euclidean distance to match time-series. But how should we compress the
data?
Query: “analytics and optimization”
Query: “consulting services”
1 year span
The data are highly periodic, so we can use Fourier
decomposition.
Instead of using the first Fourier coefficients we can
use the best ones instead.
(9)
Discrete Fourier Transform (DFT) decomposition of time-series
into sinusoids
-0.4446
-0.9864
-0.3254
-0.6938
-0.1086
-0.3470
0.5849
1.5927
-0.9430
-0.3037
-0.7805
-0.1953
-0.3037
0.2381
2.8389
-0.7046
-0.5529
-0.6721
0.1189
0.2706
-0.0003
1.3976
-0.4987
-0.2387
-0.7588
x(n)
-0.3633
-0.6280 + 0.2709i
-0.4929 + 0.0399i
-1.0143 + 0.9520i
0.7200 - 1.0571i
-0.0411 + 0.1674i
-0.5120 - 0.3572i
0.9860 + 0.8043i
-0.3680 - 0.1296i
-0.0517 - 0.0830i
-0.9158 + 0.4481i
1.1212 - 0.6795i
0.2667 + 0.1100i
0.2667 - 0.1100i
1.1212 + 0.6795i
-0.9158 - 0.4481i
-0.0517 + 0.0830i
-0.3680 + 0.1296i
0.9860 - 0.8043i
-0.5120 + 0.3572i
-0.0411 - 0.1674i
0.7200 + 1.0571i
-1.0143 - 0.9520i
-0.4929 - 0.0399i
-0.6280 - 0.2709i
X(f)
“Every signal can be represented as a superposition of sines and cosines”
“Every signal can be represented as a superposition of sines and cosines”
Jean Baptiste Fourier (1768-1830)
(10)
spectrum
Keep just some frequencies of periodogram
Sequence:
Periodogram:
-0.4446
-0.9864
-0.3254
-0.6938
-0.1086
-0.3470
0.5849
1.5927
-0.9430
-0.3037
-0.7805
-0.1953
-0.3037
0.2381
2.8389
-0.7046
-0.5529
-0.6721
0.1189
0.2706
-0.0003
1.3976
-0.4987
-0.2387
-0.7588
-0.3633
-0.6280 + 0.2709i
-0.4929 + 0.0399i
-1.0143 + 0.9520i
0.7200 - 1.0571i
-0.0411 + 0.1674i
-0.5120 - 0.3572i
0.9860 + 0.8043i
-0.3680 - 0.1296i
-0.0517 - 0.0830i
-0.9158 + 0.4481i
1.1212 - 0.6795i
0.2667 + 0.1100i
0.2667 - 0.1100i
1.1212 + 0.6795i
-0.9158 - 0.4481i
-0.0517 + 0.0830i
-0.3680 + 0.1296i
0.9860 - 0.8043i
-0.5120 + 0.3572i
-0.0411 - 0.1674i
0.7200 + 1.0571i
-1.0143 - 0.9520i
-0.4929 - 0.0399i
-0.6280 - 0.2709i
0.1320
0.4677
0.2445
1.9352
1.6359
0.0297
0.3897
1.6190
0.1522
0.0096
1.0395
1.7187
0.0832
0.0832
1.7187
1.0395
0.0096
0.1522
1.6190
0.3897
0.0297
1.6359
1.9352
0.2445
0.4677
x(n)X(f)P(f)
f
(10)
spectrum
Keep just some frequencies of periodogram
Sequence:
Periodogram:
-0.4446
-0.9864
-0.3254
-0.6938
-0.1086
-0.3470
0.5849
1.5927
-0.9430
-0.3037
-0.7805
-0.1953
-0.3037
0.2381
2.8389
-0.7046
-0.5529
-0.6721
0.1189
0.2706
-0.0003
1.3976
-0.4987
-0.2387
-0.7588
-0.3633
-0.6280 + 0.2709i
-0.4929 + 0.0399i
-1.0143 + 0.9520i
0.7200 - 1.0571i
-0.0411 + 0.1674i
-0.5120 - 0.3572i
0.9860 + 0.8043i
-0.3680 - 0.1296i
-0.0517 - 0.0830i
-0.9158 + 0.4481i
1.1212 - 0.6795i
0.2667 + 0.1100i
0.2667 - 0.1100i
1.1212 + 0.6795i
-0.9158 - 0.4481i
-0.0517 + 0.0830i
-0.3680 + 0.1296i
0.9860 - 0.8043i
-0.5120 + 0.3572i
-0.0411 - 0.1674i
0.7200 + 1.0571i
-1.0143 - 0.9520i
-0.4929 - 0.0399i
-0.6280 - 0.2709i
0.1320
0.4677
0.2445
1.9352
1.6359
0.0297
0.3897
1.6190
0.1522
0.0096
1.0395
1.7187
0.0832
0.0832
1.7187
1.0395
0.0096
0.1522
1.6190
0.3897
0.0297
1.6359
1.9352
0.2445
0.4677
x(n)X(f)P(f)
f
0 50 100 150 200 250 300 350
7 day period
(10)
spectrum
Keep just some frequencies of periodogram
Sequence:
Periodogram:
-0.4446
-0.9864
-0.3254
-0.6938
-0.1086
-0.3470
0.5849
1.5927
-0.9430
-0.3037
-0.7805
-0.1953
-0.3037
0.2381
2.8389
-0.7046
-0.5529
-0.6721
0.1189
0.2706
-0.0003
1.3976
-0.4987
-0.2387
-0.7588
-0.3633
-0.6280 + 0.2709i
-0.4929 + 0.0399i
-1.0143 + 0.9520i
0.7200 - 1.0571i
-0.0411 + 0.1674i
-0.5120 - 0.3572i
0.9860 + 0.8043i
-0.3680 - 0.1296i
-0.0517 - 0.0830i
-0.9158 + 0.4481i
1.1212 - 0.6795i
0.2667 + 0.1100i
0.2667 - 0.1100i
1.1212 + 0.6795i
-0.9158 - 0.4481i
-0.0517 + 0.0830i
-0.3680 + 0.1296i
0.9860 - 0.8043i
-0.5120 + 0.3572i
-0.0411 - 0.1674i
0.7200 + 1.0571i
-1.0143 - 0.9520i
-0.4929 - 0.0399i
-0.6280 - 0.2709i
0.1320
0.4677
0.2445
1.9352
1.6359
0.0297
0.3897
1.6190
0.1522
0.0096
1.0395
1.7187
0.0832
0.0832
1.7187
1.0395
0.0096
0.1522
1.6190
0.3897
0.0297
1.6359
1.9352
0.2445
0.4677
x(n)X(f)P(f)
f
0 50 100 150 200 250 300 350
30 day period
(10)
spectrum
Keep just some frequencies of periodogram
Sequence:
Periodogram:
-0.4446
-0.9864
-0.3254
-0.6938
-0.1086
-0.3470
0.5849
1.5927
-0.9430
-0.3037
-0.7805
-0.1953
-0.3037
0.2381
2.8389
-0.7046
-0.5529
-0.6721
0.1189
0.2706
-0.0003
1.3976
-0.4987
-0.2387
-0.7588
-0.3633
-0.6280 + 0.2709i
-0.4929 + 0.0399i
-1.0143 + 0.9520i
0.7200 - 1.0571i
-0.0411 + 0.1674i
-0.5120 - 0.3572i
0.9860 + 0.8043i
-0.3680 - 0.1296i
-0.0517 - 0.0830i
-0.9158 + 0.4481i
1.1212 - 0.6795i
0.2667 + 0.1100i
0.2667 - 0.1100i
1.1212 + 0.6795i
-0.9158 - 0.4481i
-0.0517 + 0.0830i
-0.3680 + 0.1296i
0.9860 - 0.8043i
-0.5120 + 0.3572i
-0.0411 - 0.1674i
0.7200 + 1.0571i
-1.0143 - 0.9520i
-0.4929 - 0.0399i
-0.6280 - 0.2709i
0.1320
0.4677
0.2445
1.9352
1.6359
0.0297
0.3897
1.6190
0.1522
0.0096
1.0395
1.7187
0.0832
0.0832
1.7187
1.0395
0.0096
0.1522
1.6190
0.3897
0.0297
1.6359
1.9352
0.2445
0.4677
x(n)X(f)P(f)
fReconstruction:
(10)
Exact data mining on in-exact data
similarity Search Approximate the Euclidean distance using DFT
0.4326
2.0981
1.9728
1.6851
2.8316
1.6407
0.4515
0.4892
0.1619
0.0128
0.1739
0.5518
0.0365
2.1467
2.0103
2.1243
3.1910
3.2503
3.1547
2.3223
2.6167
1.2805
1.9949
3.6184
2.9266
3.7846
5.0386
3.4449
2.0039
2.5751
2.1752
2.8652
65.0630
0.4721 +20.2755i
4.6455 - 0.7204i
-11.6083 - 5.8807i
1.0588 - 3.9980i
0.3014 + 3.4492i
-1.3769 + 3.3947i
0.5637 + 3.4632i
-2.7711 + 0.3124i
-3.6572 - 1.4000i
0.1078 - 6.0011i
-1.9892 + 2.8143i
-2.6120 + 3.4182i
-3.9104 - 0.4136i
-1.2362 + 3.5155i
-2.2397 - 0.6038i
-2.7173
-2.2397 + 0.6038i
-1.2362 - 3.5155i
-3.9104 + 0.4136i
-2.6120 - 3.4182i
-1.9892 - 2.8143i
0.1078 + 6.0011i
-3.6572 + 1.4000i
-2.7711 - 0.3124i
0.5637 - 3.4632i
-1.3769 - 3.3947i
0.3014 - 3.4492i
1.0588 + 3.9980i
-11.6083 + 5.8807i
4.6455 + 0.7204i
0.4721 -20.2755i
x(n) X(f)
11.5517
7.9234
First 5 Coefficients +symmetric ones
(11)
Exact data mining on in-exact data
similarity Search Approximate the Euclidean distance using DFT
0.4326
2.0981
1.9728
1.6851
2.8316
1.6407
0.4515
0.4892
0.1619
0.0128
0.1739
0.5518
0.0365
2.1467
2.0103
2.1243
3.1910
3.2503
3.1547
2.3223
2.6167
1.2805
1.9949
3.6184
2.9266
3.7846
5.0386
3.4449
2.0039
2.5751
2.1752
2.8652
65.0630
0.4721 +20.2755i
4.6455 - 0.7204i
-11.6083 - 5.8807i
1.0588 - 3.9980i
0.3014 + 3.4492i
-1.3769 + 3.3947i
0.5637 + 3.4632i
-2.7711 + 0.3124i
-3.6572 - 1.4000i
0.1078 - 6.0011i
-1.9892 + 2.8143i
-2.6120 + 3.4182i
-3.9104 - 0.4136i
-1.2362 + 3.5155i
-2.2397 - 0.6038i
-2.7173
-2.2397 + 0.6038i
-1.2362 - 3.5155i
-3.9104 + 0.4136i
-2.6120 - 3.4182i
-1.9892 - 2.8143i
0.1078 + 6.0011i
-3.6572 + 1.4000i
-2.7711 - 0.3124i
0.5637 - 3.4632i
-1.3769 - 3.3947i
0.3014 - 3.4492i
1.0588 + 3.9980i
-11.6083 + 5.8807i
4.6455 + 0.7204i
0.4721 -20.2755i
x(n) X(f)
11.5517
11.1624
Best 5 Coefficients + symmetric ones
(11)
Exact data mining on in-exact data
objective
Calculate the tightest possible upper/lower boundsusing the coefficients with the highest energy
This will result in better pruning of the search space ->faster search
(12)
Exact data mining on in-exact data
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
x X Q q -1.7313
-0.7221
-1.2267
-0.4194
-0.4194
-0.0158
-0.8231
-1.2267
-0.3185
-0.3185
-0.1167
-0.5203
0.0851
0.6906
0.1861
0.7915
0.7915
1.2961
2.3052
1.9016
0.1861
-0.0158
0.5897
0.8924
0.1861
-1.7313
-0.3185
-0.9240
-0.5203
-0.4194
-0.3185
2.2043
-1.3356
0.6182
-1.6135
1.0842
0.7981
0.2667
-1.2048
0.7654
0.6182
-1.4500
-0.2892
0.6264
0.2503
-1.0740
0.7163
0.7654
-1.5073
1.1987
0.7735
0.1277
-1.2865
0.6509
0.5283
-1.0413
0.9207
1.2150
0.4302
-1.2211
1.0678
1.0351
-1.4337
-1.0004
(13)
Exact data mining on in-exact data
Find best k coefficients of X (k=4)
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
0.0000
12.8901
7.8261
4.3321
2.0729
6.9035
2.7897
6.9880
4.4229
3.9548
2.7512
1.9304
4.2094
4.2856
6.4725
3.5360
2.9264
3.5360
6.4725
4.2856
4.2094
1.9304
2.7512
3.9548
4.4229
6.9880
2.7897
6.9035
2.0729
4.3321
7.8261
12.8901
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
||X|| X Q 0
3.0592
2.9700
6.0414
3.1439
2.6650
4.1292
1.6061
2.8571
16.3656
3.6904
0.6745
1.9288
1.4004
9.3694
2.8265
5.1339
2.8265
9.3694
1.4004
1.9288
0.6745
3.6904
16.3656
2.8571
1.6061
4.1292
2.6650
3.1439
6.0414
2.9700
3.0592
||Q||Magnitude
vector
(13)
Exact data mining on in-exact data
Find best 4 coefficients of T
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
0.0000
12.8901
7.8261
4.3321
2.0729
6.9035
2.7897
6.9880
4.4229
3.9548
2.7512
1.9304
4.2094
4.2856
6.4725
3.5360
2.9264
3.5360
6.4725
4.2856
4.2094
1.9304
2.7512
3.9548
4.4229
6.9880
2.7897
6.9035
2.0729
4.3321
7.8261
12.8901
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
||X|| X Q 0
3.0592
2.9700
6.0414
3.1439
2.6650
4.1292
1.6061
2.8571
16.3656
3.6904
0.6745
1.9288
1.4004
9.3694
2.8265
5.1339
2.8265
9.3694
1.4004
1.9288
0.6745
3.6904
16.3656
2.8571
1.6061
4.1292
2.6650
3.1439
6.0414
2.9700
3.0592
||Q||
(13)
Exact data mining on in-exact data
Identify smallest magnitude ( power)
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
minPower
6.9035
0.0000
12.8901
7.8261
4.3321
2.0729
6.9035
2.7897
6.9880
4.4229
3.9548
2.7512
1.9304
4.2094
4.2856
6.4725
3.5360
2.9264
3.5360
6.4725
4.2856
4.2094
1.9304
2.7512
3.9548
4.4229
6.9880
2.7897
6.9035
2.0729
4.3321
7.8261
12.8901
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
||X|| X Q 0
3.0592
2.9700
6.0414
3.1439
2.6650
4.1292
1.6061
2.8571
16.3656
3.6904
0.6745
1.9288
1.4004
9.3694
2.8265
5.1339
2.8265
9.3694
1.4004
1.9288
0.6745
3.6904
16.3656
2.8571
1.6061
4.1292
2.6650
3.1439
6.0414
2.9700
3.0592
||Q||
(13)
Exact data mining on in-exact data
The remaining powers are less than minPower
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
minPower
6.9035
0.0000
12.8901
7.8261
4.3321
2.0729
6.9035
2.7897
6.9880
4.4229
3.9548
2.7512
1.9304
4.2094
4.2856
6.4725
3.5360
2.9264
3.5360
6.4725
4.2856
4.2094
1.9304
2.7512
3.9548
4.4229
6.9880
2.7897
6.9035
2.0729
4.3321
7.8261
12.8901
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
||X|| X Q 0
3.0592
2.9700
6.0414
3.1439
2.6650
4.1292
1.6061
2.8571
16.3656
3.6904
0.6745
1.9288
1.4004
9.3694
2.8265
5.1339
2.8265
9.3694
1.4004
1.9288
0.6745
3.6904
16.3656
2.8571
1.6061
4.1292
2.6650
3.1439
6.0414
2.9700
3.0592
||Q||
(13)
Exact data mining on in-exact data
We keep also the sum of squares of the remaining powers
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
0.0000
12.8901
7.8261
4.3321
2.0729
6.9035
2.7897
6.9880
4.4229
3.9548
2.7512
1.9304
4.2094
4.2856
6.4725
3.5360
2.9264
3.5360
6.4725
4.2856
4.2094
1.9304
2.7512
3.9548
4.4229
6.9880
2.7897
6.9035
2.0729
4.3321
7.8261
12.8901
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
||X|| X Q 0
3.0592
2.9700
6.0414
3.1439
2.6650
4.1292
1.6061
2.8571
16.3656
3.6904
0.6745
1.9288
1.4004
9.3694
2.8265
5.1339
2.8265
9.3694
1.4004
1.9288
0.6745
3.6904
16.3656
2.8571
1.6061
4.1292
2.6650
3.1439
6.0414
2.9700
3.0592
||Q||
e2
(13)
Exact data mining on in-exact data
Calculate distance from k known coeffs
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
0.0000
12.8901
7.8261
4.3321
2.0729
6.9035
2.7897
6.9880
4.4229
3.9548
2.7512
1.9304
4.2094
4.2856
6.4725
3.5360
2.9264
3.5360
6.4725
4.2856
4.2094
1.9304
2.7512
3.9548
4.4229
6.9880
2.7897
6.9035
2.0729
4.3321
7.8261
12.8901
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
||X|| X Q 0
3.0592
2.9700
6.0414
3.1439
2.6650
4.1292
1.6061
2.8571
16.3656
3.6904
0.6745
1.9288
1.4004
9.3694
2.8265
5.1339
2.8265
9.3694
1.4004
1.9288
0.6745
3.6904
16.3656
2.8571
1.6061
4.1292
2.6650
3.1439
6.0414
2.9700
3.0592
||Q||
(13)
Exact data mining on in-exact data
Optimize distance from remaining coefficients
0.0000
-12.0861 + 4.4812i
7.0708 - 3.3545i
0.8045 + 4.2567i
0.2386 + 2.0592i
-1.6003 + 6.7154i
-2.6539 + 0.8595i
5.8845 + 3.7689i
-2.0182 + 3.9356i
-3.9484 + 0.2234i
-1.2440 + 2.4538i
-1.6268 + 1.0393i
-1.0458 + 4.0774i
-4.2436 + 0.5988i
-6.4020 - 0.9529i
-3.3663 - 1.0825i
-2.9264
-3.3663 + 1.0825i
-6.4020 + 0.9529i
-4.2436 - 0.5988i
-1.0458 - 4.0774i
-1.6268 - 1.0393i
-1.2440 - 2.4538i
-3.9484 - 0.2234i
-2.0182 - 3.9356i
5.8845 - 3.7689i
-2.6539 - 0.8595i
-1.6003 - 6.7154i
0.2386 - 2.0592i
0.8045 - 4.2567i
7.0708 + 3.3545i
-12.0861 - 4.4812i
0.0000
12.8901
7.8261
4.3321
2.0729
6.9035
2.7897
6.9880
4.4229
3.9548
2.7512
1.9304
4.2094
4.2856
6.4725
3.5360
2.9264
3.5360
6.4725
4.2856
4.2094
1.9304
2.7512
3.9548
4.4229
6.9880
2.7897
6.9035
2.0729
4.3321
7.8261
12.8901
0.0000
-2.4756 + 1.7973i
-2.8455 - 0.8510i
-5.5245 - 2.4452i
-2.4342 - 1.9897i
-2.1082 + 1.6303i
-3.6438 - 1.9424i
-1.5989 - 0.1514i
1.6186 - 2.3544i
15.2057 - 6.0515i
-3.6746 + 0.3413i
-0.0532 + 0.6724i
-1.8331 - 0.6000i
0.1642 + 1.3907i
-7.3632 - 5.7938i
-2.2362 - 1.7287i
-5.1339
-2.2362 + 1.7287i
-7.3632 + 5.7938i
0.1642 - 1.3907i
-1.8331 + 0.6000i
-0.0532 - 0.6724i
-3.6746 - 0.3413i
15.2057 + 6.0515i
1.6186 + 2.3544i
-1.5989 + 0.1514i
-3.6438 + 1.9424i
-2.1082 - 1.6303i
-2.4342 + 1.9897i
-5.5245 + 2.4452i
-2.8455 + 0.8510i
-2.4756 - 1.7973i
||X|| X Q 0
3.0592
2.9700
6.0414
3.1439
2.6650
4.1292
1.6061
2.8571
16.3656
3.6904
0.6745
1.9288
1.4004
9.3694
2.8265
5.1339
2.8265
9.3694
1.4004
1.9288
0.6745
3.6904
16.3656
2.8571
1.6061
4.1292
2.6650
3.1439
6.0414
2.9700
3.0592
||Q||
minPower
6.9035
Optimization
Generalization
(13)
Exact data mining on in-exact data
water-filling
X = -α Q
Q
d1
d2
(14)
Exact data mining on in-exact data
water-filling
X = -α Q
Q
d1
d2
Boundary condition on total energy:
d12 + d2
2 < e2 e
(14)
Exact data mining on in-exact data
water-filling
X = -α Q
Q
minP
d1
d2
minP
Boundary condition on each dimension:
d1 < minPower, d2 < minPowere
(14)
Exact data mining on in-exact data
water-filling
X = -α Q
Q
minP
d1
d2
minP
We hit one of our constraints
e
(14)
Exact data mining on in-exact data
water-filling
X = -α Q
Q
d1
d2
We start moving on the other dimension
until we use all our energyeminP
minP
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensionsXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensional spaceXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
in N-dimensional spaceXQ
Available EnergyminPower
(14)
Exact data mining on in-exact data
single waterfilling
When both sequences are compressed the optimal distance estimation can be solved using a double waterfilling process.
double waterfilling
(15)
Exact data mining on in-exact data
the bounds are optimally tight Theorem (Freris et al. 2012): The computation of lower and upper bounds given the aforementioned compressed representations can be solved exactly using double water-filling problem. The lower and upper bounds are optimally tight; no tighter solutions can be provided
Sketch of proof: - convex re-parametrization - KKT conditions - properties of optimal solutions
(16)
Double Water-filling algorithm
Intuition:
Water-fill for the discarded
coefficients of the two vectors
separately..
..using the appropriate energy
allocation
(17)
Zero-overhead algorithm
Complexity: Θ(nlong)
Exact data mining on in-exact data
Experiments
Unica database. IBM web traffic for year of 2010
BUSINESS DYNAMICS IBM EINSURANCE CUSTOMER EXPERIENCE.
BUSINESS CONSULTING
YIN YANG OF FINANCIAL DISRUPTION
AMERICA MEDIA PLAYER INDUSTRY
IBM GLOBAL BUSINESS ANDREW STEVENS
GLENN FINCH IBM STRATEGIE ENTREPRISE RENTABILIT
Marketing/Adwords recommendation
– Analysis/Storage of weblog queries (1TB of data per month)
– GBS: Scheduling advertising campaigns / pricing
(18)
Exact data mining on in-exact data
our analytic solution is 300x faster than convex optimizers
(19)
Our algorithm converges very fast: in 2-3 iterations irrespective of the compression rate
(20)
Exact data mining on in-exact data
the derived LB/UB are 20% tighter than state-of-art
(21)
Exact data mining on in-exact data
when searching in a DB this results in up to 80% data pruning
the previous (10-20%) improvement in distance estimation can significantly reduce the search space when searching for k-NN results
We retrieve 20%-80% fewer sequences than other approaches
(22)
Exact data mining on in-exact data
Conclusions
Increasing data sizes is a perennial problem for data analysis It is important to support mining directly on the compressed data
We have shown an exact algorithm for obtaining optimally tight Euclidean bounds on compressed representations Mining directly on compressed data with provable guarantees
Many generalizations:Cosine Similarity (text documents):
cos(x,y) = 1 - L2(x,y)2/2Correlation (financial analysis):
corr(x,y) = 1 - L2(x,y)2/2 (for normalized signals x,y)Dynamic Time Warping (flexible similarity metric):
(23)
Exact data mining on in-exact data
Compression with provable K-means preservation
Exact data mining on in-exact data
Scope of this work Multi-bit compression and invertible transformation of high-dimensional data
with provable K-means preservation
K-means
cluster 1 cluster 2 cluster 3
K-means
cluster 1 cluster 2 cluster 3
identical clustering
results
original data Original data quantized data Compressed data
(24)
Exact data mining on in-exact data
Applications Reduced storage requirements
Reduced processing time Data hiding / anonymization
Original values are hidden via an invertible cluster contraction Encoder-decoder scheme
Shape preservation High-quality data reconstruction Extends applicability to other mining operations such as Nearest-Neighbor search, data visualization, etc.
Tunable compression level based on number of allocated bits in order to achieve good storage/quality trade-off
(25)
Exact data mining on in-exact data
Overview1. Pre-clustering2. Quantization per cluster and dimension (one or multiple bits)3. Data contraction transformation 4. Post-quantization clustering 5. Decoding using key (contraction parameter)
(26)
Exact data mining on in-exact data
Our method effectively compresses the data1. MMSE quantization per cluster and dimension (one or multiple bits)
Preserves mean and reduces variance and dynamic range Example: 1-bit MMSE quantization
Quantizer:
(27)
Exact data mining on in-exact data
These are the points for one dimension and for one cluster of objects.
Process is repeated for all dimensions and for all clusters. We have one quantizer per class
Dimension d (or time instance )d)
How does our scheme apply to high-dimensional data?(28)
Exact data mining on in-exact data
We can also support Data-Hiding with our scheme2. Data contraction transformation
Invertible Preserves centroids and reduces variance and dynamic range
Better cluster separation
(29)
Exact data mining on in-exact data
Results Theoretical guarantee on cluster preservation
for sufficiently contracted clusters (sufficiently small a) Analytical calculation of optimal contraction coefficient (a)
K(K-1) calculations, K: number of clusters
Lower distortion (MSE) than MPQ which decreases with the number of allocated bits
Multi-bit allocation (per cluster) Generalization of MMSE quantizers for multiple bits Efficient algorithm for optimal allocation given storage constraints
(30)
Exact data mining on in-exact data
We preserve 100% the clustering structure Stock values
2100 stocks during a period of three years Clustered in 3,5,8 clusters
Cluster preservation was always 100% Shape preservation
(31)
Exact data mining on in-exact data
Multi-Bit quantization improves quality up to 96%! MSE due to quantization
decreases with number of bits and number of clusters
37%, 70% and 96% over MPQ using MMSE with 1, 2, 4 bits 76%, 84% using 5, 8 clusters over 3 clusters
(32)
Exact data mining on in-exact data
Compress > 8x without loss in clustering accuracy Compression efficiency:
1.8x – 8x reduction
(33)
Exact data mining on in-exact data
Conclusions Efficient K-means preserving compression of high-dimensional data
Data hiding (encoder-decoder data transformation) Tunable compression level (storage / distortion trade-off)
Theoretical guarantees Our scheme is the first one which always guarantees 100% cluster preservation Efficient bit allocation algorithm for optimal distortion given storage constraints
Experiments 100% cluster preservation Excellent shape preservation even for 1 bit quantizers High compression efficiency
Extensions Our approach works with any quantization scheme
Example: Minimum Absolute Error Quantization (MAE)
(34)
Exact data mining on in-exact data
Thank youQuestions…