a nonlinear approach to dimension reduction

A Nonlinear Approach to Dimension Reduction

Lee-Ad Gottlieb

Weizmann Institute of Science

Joint work with Robert Krauthgamer

A Nonlinear Approach to Dimension Reduction 2

Data As High-Dimensional Vectors Data is often represented by vectors in Rm

For images, color or intensity For document, word frequency

A typical goal – Nearest Neighbor Search: Preprocess data, so that given a query vector, quickly find closest

vector in data set. Common in various data analysis tasks – classification, learning,

clustering.


Curse of Dimensionality Cost of many useful operations is exponential in dimension

First noted by Bellman (Bel-61) in the context of PDFs Nearest Neighbor Search (Cla-94)

Dimension reduction: Represent high-dimensional data in a low-dimensional space

Specifically: Map given vectors into a low-dimensional space, while preserving most of the data’s “structure”

Trade-off accuracy for computational efficiency


The JL Lemma Theorem (Johnson-Lindenstrauss, 1984):

For every n-point Euclidean set X, with dimension d, there is a linear map : XY (Euclidean Y) with Interpoint distortion 1± Dimension of Y : k = O(--2 log n)

Can be realized by a trivial linear transformation Multiply d x n point matrix by a k x d matrix of random entries {-1,0,1} [Ach-01]

An near matching lower bound was given by [Alon-03]

Applications in a host of problems in computational geometry

But can we do better?


Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is dim(M)=log ¸(M) [GKL-03]

Applications: Approximate nearest neighbor search [KL-04,CG-06] Distance oracles [HM-06] Spanners [GR-08a,GR-08b] Embeddings [ABN-08,BRS-07]

Here ≤7.


The JL Lemma Theorem (Johnson-Lindenstrauss, 1984):

For every n-point Euclidean set X, with dimension d, there is a linear map : XY with Interpoint distortion 1± Dimension of Y : O(-2 log n)

An almost matching lower bound was given by [Alon-03] This lower bound considered n roughly equidistant points

So it had dim(X) = log n So in fact the lower bound is (-2 dim(X))


A stronger version of JL? Open questions:

Can the JL log n lower bound be strengthened to apply to spaces with low doubling dimension? (dim(X) << log n)

Does there exist a JL-like embedding into O(dim(X)) dimensions? [LP-01,GKL-03] Even constant distortion would be interesting A linear transformation cannot attain this result [IN-07]

Here, we present a partial resolution to these questions: Two embeddings that use Õ(dim2(X)) dimensions Result I: (1±) embedding for a single scale, interpoint distances close to

some r. Result II: (1±) global embedding into the snowflake metric, where every

interpoint distance s is replaced by s½


Result I – Embedding for Single Scale Theorem 1 [GK-09]:

Fix scale r>0 and range 0<<1. Every finite X½l2 admits embedding f:Xl2

k for k=Õ(log(1/)(dim X)2), such that

1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X

2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r]

3. Boundedness: ||f(x)|| ≤ r for all x2X

We’ll illustrate the proof for constant range and distortion.


distance: 1

Result I: The construction We begin by considering the entire point set. Take for example

scale r=20 range = ½ Assume minimum interpoint distance 1


Step 1: Net extraction From the point set, we extract a net

For example, a 4-net Net properties:

Covering Packing

A consequence of the packing property is that a ball of radius s contains O(sdim(X)) points

Covering radius: 4

Packing distance: 4


Step 1: Net extraction We want a good embedding for just the net points

From here on, our embedding will ignore non-net points Why is this valid?


Step 1: Net extraction Kirszbraun theorem (Lipschitz extension, 1934):

Given an embedding f : XY , X ½ S (Euclidean space) there exists a extension f ’ : S Y

The restriction of f ’ to X is equal to f f ’ is contractive for S \ X

Therefore, a good embedding just for the net points suffices Smaller net radius less distortion for the non-net points

f ’

2020


Step 2: Padded decomposition Decompose the space into probabilistic padded clusters


Step 2: Padded decomposition Decompose the space into probabilistic padded clusters

Cluster properties for a given random partition [GKL03,ABN08]: Diameter: bounded by 20 dim(X)

Size: By the doubling property, bounded (20 dim(X))dim(X) Padding: A point is 20-padded with probability 1-c, say 9/10 Support: O(dim(X)) partitions

≤ 20 dim(X)

Padded


Step 3: JL on individual clusters For each partition, consider each individual cluster


Step 3: JL on individual clusters For each partition, consider each individual cluster

Reduce dimension using JL-Lemma Constant distortion Target dimension:

logarithimic in size: O(log(20 dim(X))dim(X)) = Õ(dim(X)) Then translate some point to the origin

JL


The story so far… To review

Step 1: Extract net points Step 2: Build family of partitions Step 3: For each partition, apply JL to each cluster, and translate a

cluster point to the origin

Embedding guarantees for

a singe partition Intracluster distance: Constant distortion Intercluster distance:

Min distance: 0 Max distance: 20 dim(X)

Not good enough Let’s backtrack…


The story so far… To review

Step 1: Extract net points Step 2: Build family of partitions Step 3: For each partition, apply Gaussian transform to each cluster Step 4: For each partition, apply JL to each cluster, and translate a

cluster point to the origin

Embedding guarantees for

a singe partition Intracluster distance: Constant distortion Intercluster distance:

Min distance: 0 Max distance: 20 dim(X)

Not good enough Let’s backtrack…


Step 3: Gaussian transform For each partition, apply the Gaussian transform to distances

within each cluster (Schoenberg’s theorem, 1938) f(t) = (1-e-t2)1/2

Threshold at s:

fs(t) = s(1-e-t2/s2)1/2

Properties for s=20: Threshold: Cluster diameter is at most 20 (Instead of 20dim(X)) Distortion: Small distortion of distances in relevant range

Transform can increase dimension… but JL is the next step


Step 4: JL on individual cluster Steps 3 & 4:

New embedding guarantees Intracluster: Constant distortion Intercluster:

Min distance: 0 Max distance: 20 (instead of 20dim(X))

Caveat: Also smooth the edges

JLGaussian

smaller diameter smaller dimension


Step 5: Glue partitions We have an embedding for a single partition

For padded points, the guarantees are perfect For non-padded points, the guarantees are weak

“Glue” together embeddings for all dim(X) partitions Concatenate images (and scale down)

Non-padded case occurs 1/10 of the time, so it gets “averaged away” Final dimension for non-net points:

Number of partitions: O(dim(X)) dimension of each embedding: Õ(dim(X)) = Õ (dim2(X))

f1(x) = (1,7,2), f2(x) = (5,2,3), f3(x) = (4,8,5)

F(x) = f1(x) f2(x) f3(x) = (1,7,2,5,2,3,4,8,5)


Kirszbraun’s theorem extends embedding to non-net points within increasing dimension

Step 6: Kirszbraun extension theorem

Embedding

Embedding + K.


Result I – Review Steps:

Net extraction Padded Decomposition Gaussian Transform JL Glue partitions Extension theorem

Theorem 1 [GK-09]: Every finite X½l2 admits embedding f:Xl2

k for k=Õ((dim X)2), such that


2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r]



Result I – Extension Steps:

Net extraction nets Padded Decomposition Larger padding, prob. guarantees Gaussian Transform JL Already (1±) Glue partitions Higher percentage of padded points Extension theorem

Theorem 1 [GK-09]: Every finite X½l2 admits embedding f:Xl2

k for k=Õ((dim X)2), such that


2. Gaussian at scale r: ||f(x)-f(y)|| ≥(1±)G(||x-y||) whenever ||x-y||2 [r, r]



Result II – Snowflake Embedding Theorem 2 [GK-09]:

For 0<<1, every finite subset X½l2 admits an embedding F:Xl2k for

k=Õ(-4(dim X)2) with distortion (1±) to the snowflake: s s½

We’ll illustrate the construction for constant distortion. The constant distortion construction is due to [Asouad-83] (for non-

Euclidean metrics) In the paper, we implement the same construction with (1±) distortion


Snowflake embedding Basic idea.

Fix points x,y 2X, and suppose ||x-y|| ~ s Now consider many single scale embeddings

r = 16s r = 8s r = 4s r = 2s r = s r = s/2 r = s/4 r = s/8 r = s/16

x y

Lipschitz: ||f(x)-f(y)|| ≤ ||x-y||

Gaussian: ||f(x)-f(y)|| ≥(1±)G(||x-y||)

Boundedness: ||f(x)|| ≤ r


Snowflake embedding Now scale down each embedding by r½ (snowflake)

r = 16s s s½/4 r = 8s s s½/8½ r = 4s s s½/2 r = 2s s s½/2½ r = s s s½

r = s/2 s/2 s½/2½ r = s/4 s/4 s½/2 r = s/8 s/8 s½/8½ r = s/16 s/16 s½/4


Snowflake embedding Join levels by concatenation and addition of coordinates

r = 16s s s½/4 r = 8s s s½/8½ r = 4s s s½/2 r = 2s s s½/2½ r = s s s½

r = s/2 s/2 s½/2½ r = s/4 s/4 s½/2 r = s/8 s/8 s½/8½ r = s/16 s/16 s½/4


Result II – Review Steps:

Take collection of single scale embeddings Scale embedding r by r½

Join embeddings by concatenation and addition

By taking more refined scales (jump by 1± instead of 2), can achieve (1±) distortion to the snowflake

Theorem 2 [GK-09]: For 0<<1, every finite subset X½l2 admits an embedding F:Xl2

k for k=Õ(-4(dim X)2) with distortion (1±) to the snowflake: s s½


Conclusion Gave two (1±) distortion low-dimension embeddings for

doubling spaces Single scale Snowflake

This framework can be extended to L1 and L∞

Dimension reduction: Can’t use JL Extension: Can’t use Kirszbraun Threshold: Can’t use the Gaussian

Thank you!

a nonlinear approach to dimension reduction

Documents

dimension d

dimension of y

dimension reductionstep

dimension reductiondata

low doubling dimension

dimxa nonlinear approach

aaaaa nonlinear approach

jl log n lower bound