matching geometric structures: geometric hashing. ananth ... · hashing points. for dealing...

Matching Geometric Structures: Geometric Hashing.

Ananth Grama.CS490B: Introduction to Bioinformatics.Mar. 20, 2002.

Parts of this lecture are based on a tutorial by Prof. Inge Jonassen at theInternational Conference on Intelligent Systems for Molecular Biology(ISMB), 2001. The complete tutorial is available on the web at:http://www.ii.uib.no/~inge/talks/ismb-tutorial/

What is Hashing?

“Leave similar things (things that look the same, things thatperform the same function, etc.) at the same place so thatyou can find them all together when you need them.”

In real life:

Always leave your wallet and keys ‘together’ and ‘at thesame place each day’. This way, you always find them in ahurry.

Here ‘place’ is called a hash function. It maps an entity, inthis case our keys and wallet to a range, locations wherethey can be found.

Hashing Numbers.

Say, you are given a list of numbers. You are required tosearch this list to see if a specified number exists in the list.We can use exactly the same principle:

Say the list is:7 9 3 6 5 8

We select a hash function that maps these numbers into atable. The hash function value indicates the entry in thetable where this number should be stored.

We select our hash function to beh(x) = x % 7

(i.e., the remainder when dividing x by 7).The table now looks as follows:

0 71 82 93 345 56 6

Now, to see if the number 8 exists in the list, we simplycompute h(8) (= 1) and see if the corresponding entry con-tains the number 8!

Hashing Numbers:

Problem: What happens if more than one entry maps to thesame location in the table?

This problem is called a collision. Collisions are not neces-sarily a bad thing as they indicate similarity with respect tothe hash function. There are several solutions to this prob-lem. The simplest one involves thinking of each entry in thetable as a list.

In our previous example, if we further added 14, 12, and 19to the list, we would get:

0 7 141 82 93 345 5 12 196 6

This is also called chaining.

Other solutions to collisions include open addressing, dou-ble hashing, etc.

Hashing Sequences of Characters:

Associate a number with each character. For example, a =65, b = 66, and so on. (The computer does this internallyanyway using a code called ASCII). Using this code, asequence of characters can be converted into a numberand a hash function can then be applied on it.

For example, the word: ananth can be hashed as follows.

a = 65, n = 78, a = 65, n = 78, t = 94, h = 72.

One way of converting ‘ananth’ to a number would be tosimply add their codes:

Therefore,

‘ananth’ maps to 65 + 78 + 65 + 78 + 94 + 72 = 455.

We can now apply:

h(‘ananth’): h(455) = 455 % TABLE_SIZE

Selecting the Right Hash Function:

But the modulo (%, or remainder) function causes colli-sions between unrelated things!!

The hash function should be selected with a view to opti-mizing specific criteria. If we want to induce collisionsbetween similar entities, we should design a hash functionaccordingly.

Revisiting the example of hashing numbers, say:

h(x) = x / 3

(i.e., the integer part of x divided by 3). The new table nowbecomes:

01 3 52 7 8 63 94 14 1256 19

Now, notice that like numbers end up close to each other inthe table.

Hashing Geometric Objects:

Its all in the design of the hash function!

You want to use the hash function to cancel out the effectsof translation and rotation, i.e., we want (at least) the fol-lowing desirable properties from the hash function.

i) Given an object G and an object T(R(G)), where T issome (unknown) translation operator and R is some rota-tional operator, we want a hash function such that:

h(G) = h(T(R(G)).

ii) The function h should map structurally similar shapes tothe same (or nearby) locations.

So how easy is it to design such functions anyway?

In isolation, each criteria can be easily satisfied:

For example, to satisfy the first criteria:

h(G) = number of acute angles in G.h(G) = number of edges in G.

and, for example, to satisfy the second criteria:

h(G) = h((vector of atom positions)) = Morton key of vector.

However, together, they are a lot more difficult!

Hashing Geometries (2D Objects):

Given: Two figures are given, a model A, and a query B,described by m and n points, respectively.

Objective: Find common subfigures, invariant under rota-tion and translation.

Approach: One simple approach is to place the query ontop of the model, and consider how many points coincide(here ignoring the edges).

Unfortunately, computationally, this is very expensive (NPHard).

A B

B over A

Reference Frames

Define coordinate systems for both figures (A, B), calledreference frames.

Two points (basis pair) can define a reference frame, e.g.origin at one of them, and one of the axis through bothpoints.

The coordinates of the points are computed in the refer-ence frame, constituting a reference frame system.

Count how many pairs of points (one from each figure)have the same coordinates.

Reference Frames:

Reference Frames:

The number of coincident points depends on the resolutionof the coordinate system, on the basis pairs used

Generally, all combination of points should be used as

basis pairs, resulting in comparing O(n2m2) referenceframe systems

Using all combinations might introduce redundancy. Let(ai ,ak ) and (bj ,bl ) be the basis pairs, and (ar ,bu ) and(as ,bv ) both coincide. Then it is likely that the same coinci-dence set is found if (ar ,as ) and (bu ,bv ) are used asbasis.

Note however that similarity and not exact equality is usedin this case.

Hashing Points.

For dealing efficiently with all combinations, hashing isused. It is especially efficient when several queries are tobe compared to one model, or to several models.

The comparison problem can be formulated as: given aquery reference frame system, for each model referenceframe system, in how many cells are there points from boththe query and model frame system.

Hashing makes it possible to simultaneously compare aquery frame system to all model frame systems.

Preprocessing

In the first example, a 2D hash table H is used. It has a binfor each cell in the frame systems. In a preprocessingphase, the coordinates of all points in each model framesystem are found. If there is a point in the cell (p,q) in theframe system with basis (ai ,ak), then (ai,ak ) is placed inthe bin H(p,q).

Since all pairs of points from the model will (generally) act

as basis pairs, totally O(m3) pairs will be in H (m points in

each of the O(m2) pairs).

Preprocessing.

Recognition

The query is compared to the model in the recognitionphase

A pair is chosen as basis, and the coordinates of the otherpoints are calculated

These coordinates are used as indices into H, and for eachcell being indexed, a vote is given for the (model) basispairs in the cell. The number of votes for a model basispair is the number of coinciding points to the query (usingthe specified query basis pair)

Extensions

Labels (e.g., colors and/or forms) assigned to the pointsmight also be included, such that coinciding points alsomust have equal labels. This can be implemented in twoways:

Storing the labels in the table

The table can be hashed by using the labels in additionto the coordinates

It is straightforward to extend the hashing technique toinclude several models, such that a query issimultaneously compared to several models. The onlyextension in the hash table is that a model identificationmust be stored along with the basis pairs

Geometric Hashing for Structure Comparison

Since comparing structures should be invariant totranslation and rotation, geometric hashing is a good can-didate method, when the order of the residues (elements)along the chain is ignored

Finding a coinciding set between two structures thenmeans finding an equivalence (not necessarily analignment)

A 3D reference (frame) system must be defined, and oftenthe Cα-coordinates of three (non-colinear) residues areused.

Algorithm for Preprocessing.

for each model M u do

for each (unordered, noncollinear) triples (a i ,a k,a p)

docalculate the reference frame R ikp

for each residue r docalculateF = F(a i , a k , a p , a r , p L ); // index in H

H(F) = H(F) U (M u, R ikp )

endend

end

// The label L is included in the hashing

Algorithm for Recognition

repeatinitialise the vote table V to 0choose three atoms (a j ,a l ,a s ) from the query as basis

calculate the reference frame system R jls

for each residue q docalculate F = F(a j , a l , a s , a q , q L );

for each pair (M,R) in H(F) doV(M,R) := V(M,R) + 1

endenduntil (satisfactory coincidence sets are foundor all query reference frames are used)

Remarks.

The result of the recognition is a list of (M, RM ,RB ), show-ing that there is a coincidence set between the model Mand the query B, becoming evident when the referenceframes RM and RB are used

The residues of the coincidence sets are known (or can befound), and a superposition can be done for checking thesubstructure similarities

Techniques for reducing the run time for GeometricHashing exist

Techniques also exist for practical adaption (e.g.,checking neighboring cells)

Geometric Hashing for SSE-Representation

Typically the SSEs are represented as sticks

By use of two sticks, a coordinate reference frame can bedefined (usually by placing one axis along one of thesticks)

An entry in the hash table might be (Holm and Sander) alist where each list-element contains:

• identification of the SSE• identification of the basis• the type of SSE (alpha, beta)• the midpoint of the SSE (in the reference frame)• the direction of the SSE• possible other information

matching geometric structures: geometric hashing. ananth ... · hashing points. for dealing...

Documents