efficient similarity queries via lossy compression · 2014. 3. 31. · executing queries is timely...

Efficient Similarity Queries via Lossy Compression Idoia Ochoa, Amir Ingber and Tsachy Weissman Electrical Engineering Department, Stanford University Introduction Problem Formulation Preliminaries Given two sequences x and y, we measure their similarity with a distortion function. - Hamming distortion = 3/10 Two sequences are D-similar if d(x, y) < D Constraints For each sequence x in the database, store a signature T(x). Given a query sequence y, find the sequences in the database that are D-similar to y, based only on their signature T(x). Apply a decision rule that ensures no false negatives and minimizes false positives: g(T(x), y) = maybe for all x, y s.t. d(x,y) < D Problem Description Compress sequences in a database so that similarity queries can still be performed efficiently on the compressed database. We consider queries of the form: Which sequences in the database are similar to a given sequence y? False positives are not allowed. Importance The amount of data stored in databases is growing exponentially. Executing queries is timely and challenging. Solutions Due to the smaller size, the compressed database can be stored in several locations: - Easier and faster access. - More queries can be performed. Applications Databases consisting of genomic data: - Genbank: almost 200 million DNA sequences. - BIOZON: More than 100 million records. Similarity queries are important in genomics. For example, in molecular phylogenetics, relationships among species are established by the similarity between their respective DNA sequences. X = A C G G T T A C C G Y = A C T G A T A A C G Theoretical Framework For a given similarity threshold D, there is a tradeoff between compression rate and reliability. Let X and Y be independent random vectors, drawn from Px. Definitions 1. A rate R is said to be D-achievable if there exists a sequence of rate-R admissible schemes (T (n) , g (n) ) s.t. lim Pr(g (n) (T (n) (X), Y)= maybe) = 0. 2. The identification rate R ID (D) is the infimum of D-achievable rates. 3. The identification exponent E ID (R) is defined as lim sup -1/n Pr(g (n) (T (n) (X), Y) = maybe). Fundamental limits For symmetric sources with Hamming distortion, R ID (D) and E ID (R) can be explicitly characterized. Proposed Architecture Compression Scheme: T(x) = (i, d(x, x´)) Based on fixed-length lossy compressors - Encoding function f n : x [1:2 nR ] - Decoding function g n : [1:2 nR ] x´ and side-information d(x, x´). Simulation Results Decision Rule: g(T(x), y) = maybe if d - D < d(x´, y) < d + D For any distortion satisfying the triangle inequality, the above decision rule guarantees zero false negatives. We want to minimize the probability of maybe. Rate improvement: We quantize the side-information d(x´, x) by using the k-means algorithm, and modify the decision rule accordingly. Databases: 1. 1000 i.i.d. uniform 4-ary sequences of length 100. 2. 1000 DNA sequences of length 100 taking from BIOZON (empirical distribution: p A = 0.25, p C = 0.23, p G = 0.29, p T = 0.23). Results: We show the resulting P[maybe] for both databases and two approximations. For D = 0.1, we get a probability of maybe of 0.001 with a reduction in size of 83.5%. For D = 0.2 and R = 0.47, the probability of maybe is 0.01.

Upload: others

Post on 06-Oct-2020

1 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Efficient Similarity Queries via Lossy Compression

Idoia Ochoa, Amir Ingber and Tsachy Weissman Electrical Engineering Department, Stanford University

Introduction Problem Formulation

Preliminaries Given two sequences x and y, we measure their similarity with a distortion function.

- Hamming distortion = 3/10

Two sequences are D-similar if d(x, y) < D

Constraints

For each sequence x in the database, store a signature T(x). Given a query sequence y, find the sequences in the database that are D-similar to y, based only on their signature T(x). Apply a decision rule that ensures no false negatives and minimizes false positives: g(T(x), y) = maybe for all x, y s.t. d(x,y) < D

Problem Description

Compress sequences in a database so that similarity queries can still be performed efficiently on the compressed database.

We consider queries of the form: Which sequences in the database are similar to a given sequence y?

False positives are not allowed.

Importance

The amount of data stored in databases is growing exponentially.

Executing queries is timely and challenging.

Solutions

Due to the smaller size, the compressed database can be stored in several locations:

- Easier and faster access.

- More queries can be performed.

Applications

Databases consisting of genomic data:

- Genbank: almost 200 million DNA sequences.

- BIOZON: More than 100 million records.

Similarity queries are important in genomics.

For example, in molecular phylogenetics, relationships among species are established by the similarity between their respective DNA sequences.

X = A C G G T T A C C G

Y = A C T G A T A A C G

Theoretical Framework

For a given similarity threshold D, there is a tradeoff between compression rate and reliability.

Let X and Y be independent random vectors, drawn from Px. Definitions

1. A rate R is said to be D-achievable if there exists a sequence of rate-R admissible schemes (T(n), g(n)) s.t. lim Pr(g(n)(T(n)(X), Y)= maybe) = 0.

2. The identification rate RID(D) is the infimum of D-achievable rates.

3. The identification exponent EID(R) is defined as lim sup -1/n Pr(g(n)(T(n)(X), Y) = maybe).

Fundamental limits

For symmetric sources with Hamming distortion, RID(D) and EID(R) can be explicitly characterized.

Proposed Architecture

Compression Scheme: T(x) = (i, d(x, x´))

Based on fixed-length lossy compressors

- Encoding function fn: x [1:2nR]

- Decoding function gn: [1:2nR] x´

and side-information d(x, x´).

Simulation Results

Decision Rule:

g(T(x), y) = maybe if d - D < d(x´, y) < d + D

For any distortion satisfying the triangle inequality, the above decision rule guarantees zero false negatives.

We want to minimize the probability of maybe.

Rate improvement:

We quantize the side-information d(x´, x) by using the k-means algorithm, and modify the decision rule accordingly.

Databases:

1. 1000 i.i.d. uniform 4-ary sequences of length 100.

2. 1000 DNA sequences of length 100 taking from BIOZON (empirical distribution: pA = 0.25, pC = 0.23, pG = 0.29, pT = 0.23).

Results:

We show the resulting P[maybe] for both databases and two approximations.

For D = 0.1, we get a probability of maybe of 0.001 with a reduction in size of 83.5%. For D = 0.2 and R = 0.47, the probability of maybe is 0.01.

Fast Lossy Internet Image Transmission

Executing Queries over Schemaless RDF Databasestozsu/publications/rdf/ICDE15... · · 2015-11-16Executing Queries over Schemaless RDF Databases ... Query Engine x System (b)

A Design Methodology of Lossy Transconductance Filterswilambm/pap/2012/A Design Methodology of Lo… · A Design Methodology of Lossy Transconductance Filters ... the lossy ladder

Multimedia Compression ( Lossy Compression)

Executing Queries on a Sharded Database

1.1.5 Compression Lossy Lossless

Lossy Compression Iii_1

Lossy Compression

The evolution of lossy compression

7. Lossy image compression

It implements RA queries by What’s RADB and how does it ... · Python 3 It implements RA queries by translating them into SQL and executing them on the underlying database system

Presentation of Lossy compression

Query Execution - University of Cretehy460/pdf/006.pdf · 2003-11-19 · QUERY EXECUTION 6.1 An Algebra for Queries In order to talk about good algorithms for executing queries, we

Executing Queries over Schemaless RDF ... - cs.uwaterloo.catozsu/publications/rdf... · Cheriton School of Computer Science, University of Waterloo fgaluc,tamer.ozsu,kdaudjee,[email protected]

Executing Provenance-Enabled Queries over Web Data

A visual language for modeling and executing traceability ...sarec.nd.edu/Preprints/VTML.pdfA visual language for modeling and executing traceability queries ry. One goal of any such

Executing Processes, Taking Decisions and Detecting Situations · Executing Processes, Taking Decisions and Detecting Situations ... (Dynamic Case Management) ... –Update filters/actions/queries

Chapter 6 LOSSY COMPRESSION METHODS

Executing SQL queries over encrypted character strings in the …staff.ustc.edu.cn/~cheneh/paper_pdf/2012/ZongdaWu-KBS12.pdf · 2017-03-17 · Executing SQL queries over encrypted

Lossy Image Compression Using Wavelets

Executing SPARQL Queries of the Web of Linked Data

Executing SQL Queries and Making Plugins

ModelHub: Lifecycle Management for Deep Learning · PDF fileModelHub: Lifecycle Management for Deep Learning Hui Miao, ... ciently executing complex DQL queries and searching ... the

Continuous Queries over Append-Only Databasescs227b/papers/pubsub/TGNO92-Conti… · sequence of incremental queries is the same as executing the orig-inal user query after every

Lossy Compression III

Lossy compression - Stanford University

Strategies for executing federated queries in SPARQL1 · Strategies for executing federated queries in SPARQL1.1 ... Fuseki) Hash Join (used in SIHJoin) How federated queries are

IBM Guardium S-GATEpublic.dhe.ibm.com/software/tw/data/S_Gate_Data_Level_DS.pdf · • Executing queries on sensitive tables • Changing sensitive data values ... Extruded data is

On Lossy Compression

Executing Queries as a form of artistic practice

Meira's Portfolio-LIGHTBURST (lossy)

Lossy Compression

SGBD Practice - Alexandru Ioan Cuza Universityvcosmin/pagini/resurse... · Structured Query Language) 2. 3 Achieving Optimum Performance for Executing SQL Queries in Online Transaction

Lossy Compression Algorithms

A Practical Framework for Executing Complex Queries over ... · A Practical Framework for Executing Complex Queries over Encrypted Multimedia Data FahadShaonandMuratKantarcioglu TheUniversityofTexasatDallas,Richardson,TX75080,USA