efficient similarity queries via lossy compression · 2014. 3. 31. · executing queries is timely...

1
Efficient Similarity Queries via Lossy Compression Idoia Ochoa, Amir Ingber and Tsachy Weissman Electrical Engineering Department, Stanford University Introduction Problem Formulation Preliminaries Given two sequences x and y, we measure their similarity with a distortion function. - Hamming distortion = 3/10 Two sequences are D-similar if d(x, y) < D Constraints For each sequence x in the database, store a signature T(x). Given a query sequence y, find the sequences in the database that are D-similar to y, based only on their signature T(x). Apply a decision rule that ensures no false negatives and minimizes false positives: g(T(x), y) = maybe for all x, y s.t. d(x,y) < D Problem Description Compress sequences in a database so that similarity queries can still be performed efficiently on the compressed database. We consider queries of the form: Which sequences in the database are similar to a given sequence y? False positives are not allowed. Importance The amount of data stored in databases is growing exponentially. Executing queries is timely and challenging. Solutions Due to the smaller size, the compressed database can be stored in several locations: - Easier and faster access. - More queries can be performed. Applications Databases consisting of genomic data: - Genbank: almost 200 million DNA sequences. - BIOZON: More than 100 million records. Similarity queries are important in genomics. For example, in molecular phylogenetics, relationships among species are established by the similarity between their respective DNA sequences. X = A C G G T T A C C G Y = A C T G A T A A C G Theoretical Framework For a given similarity threshold D, there is a tradeoff between compression rate and reliability. Let X and Y be independent random vectors, drawn from Px. Definitions 1. A rate R is said to be D-achievable if there exists a sequence of rate-R admissible schemes (T (n) , g (n) ) s.t. lim Pr(g (n) (T (n) (X), Y)= maybe) = 0. 2. The identification rate R ID (D) is the infimum of D-achievable rates. 3. The identification exponent E ID (R) is defined as lim sup -1/n Pr(g (n) (T (n) (X), Y) = maybe). Fundamental limits For symmetric sources with Hamming distortion, R ID (D) and E ID (R) can be explicitly characterized. Proposed Architecture Compression Scheme: T(x) = (i, d(x, x´)) Based on fixed-length lossy compressors - Encoding function f n : x [1:2 nR ] - Decoding function g n : [1:2 nR ] and side-information d(x, x´). Simulation Results Decision Rule: g(T(x), y) = maybe if d - D < d(x´, y) < d + D For any distortion satisfying the triangle inequality, the above decision rule guarantees zero false negatives. We want to minimize the probability of maybe. Rate improvement: We quantize the side-information d(x´, x) by using the k-means algorithm, and modify the decision rule accordingly. Databases: 1. 1000 i.i.d. uniform 4-ary sequences of length 100. 2. 1000 DNA sequences of length 100 taking from BIOZON (empirical distribution: p A = 0.25, p C = 0.23, p G = 0.29, p T = 0.23). Results: We show the resulting P[maybe] for both databases and two approximations. For D = 0.1, we get a probability of maybe of 0.001 with a reduction in size of 83.5%. For D = 0.2 and R = 0.47, the probability of maybe is 0.01.

Upload: others

Post on 06-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Similarity Queries via Lossy Compression · 2014. 3. 31. · Executing queries is timely and challenging. Solutions Due to the smaller size, the compressed database can

Efficient Similarity Queries via Lossy Compression

Idoia Ochoa, Amir Ingber and Tsachy Weissman Electrical Engineering Department, Stanford University

Introduction Problem Formulation

Preliminaries Given two sequences x and y, we measure their similarity with a distortion function.

- Hamming distortion = 3/10

Two sequences are D-similar if d(x, y) < D

Constraints

For each sequence x in the database, store a signature T(x). Given a query sequence y, find the sequences in the database that are D-similar to y, based only on their signature T(x). Apply a decision rule that ensures no false negatives and minimizes false positives: g(T(x), y) = maybe for all x, y s.t. d(x,y) < D

Problem Description

Compress sequences in a database so that similarity queries can still be performed efficiently on the compressed database.

We consider queries of the form: Which sequences in the database are similar to a given sequence y?

False positives are not allowed.

Importance

The amount of data stored in databases is growing exponentially.

Executing queries is timely and challenging.

Solutions

Due to the smaller size, the compressed database can be stored in several locations:

- Easier and faster access.

- More queries can be performed.

Applications

Databases consisting of genomic data:

-  Genbank: almost 200 million DNA sequences.

-  BIOZON: More than 100 million records.

Similarity queries are important in genomics.

For example, in molecular phylogenetics, relationships among species are established by the similarity between their respective DNA sequences.

X = A C G G T T A C C G

Y = A C T G A T A A C G

Theoretical Framework

For a given similarity threshold D, there is a tradeoff between compression rate and reliability.

Let X and Y be independent random vectors, drawn from Px. Definitions

1.  A rate R is said to be D-achievable if there exists a sequence of rate-R admissible schemes (T(n), g(n)) s.t. lim Pr(g(n)(T(n)(X), Y)= maybe) = 0.

2.  The identification rate RID(D) is the infimum of D-achievable rates.

3.  The identification exponent EID(R) is defined as lim sup -1/n Pr(g(n)(T(n)(X), Y) = maybe).

Fundamental limits

For symmetric sources with Hamming distortion, RID(D) and EID(R) can be explicitly characterized.

Proposed Architecture

Compression Scheme: T(x) = (i, d(x, x´))

Based on fixed-length lossy compressors

-  Encoding function fn: x [1:2nR]

-  Decoding function gn: [1:2nR] x´

and side-information d(x, x´).

Simulation Results

Decision Rule:

g(T(x), y) = maybe if d - D < d(x´, y) < d + D

For any distortion satisfying the triangle inequality, the above decision rule guarantees zero false negatives.

We want to minimize the probability of maybe.

Rate improvement:  

We quantize the side-information d(x´, x) by using the k-means algorithm, and modify the decision rule accordingly.

Databases:

1.  1000 i.i.d. uniform 4-ary sequences of length 100.

2.  1000 DNA sequences of length 100 taking from BIOZON (empirical distribution: pA = 0.25, pC = 0.23, pG = 0.29, pT = 0.23).

Results:

We show the resulting P[maybe] for both databases and two approximations.

For D = 0.1, we get a probability of maybe of 0.001 with a reduction in size of 83.5%. For D = 0.2 and R = 0.47, the probability of maybe is 0.01.