matching geometric structures: geometric hashing. ananth ... · hashing points. for dealing...
TRANSCRIPT
Matching Geometric Structures: Geometric Hashing.
Ananth Grama.CS490B: Introduction to Bioinformatics.Mar. 20, 2002.
Parts of this lecture are based on a tutorial by Prof. Inge Jonassen at theInternational Conference on Intelligent Systems for Molecular Biology(ISMB), 2001. The complete tutorial is available on the web at:http://www.ii.uib.no/~inge/talks/ismb-tutorial/
What is Hashing?
“Leave similar things (things that look the same, things thatperform the same function, etc.) at the same place so thatyou can find them all together when you need them.”
In real life:
Always leave your wallet and keys ‘together’ and ‘at thesame place each day’. This way, you always find them in ahurry.
Here ‘place’ is called a hash function. It maps an entity, inthis case our keys and wallet to a range, locations wherethey can be found.
Hashing Numbers.
Say, you are given a list of numbers. You are required tosearch this list to see if a specified number exists in the list.We can use exactly the same principle:
Say the list is:7 9 3 6 5 8
We select a hash function that maps these numbers into atable. The hash function value indicates the entry in thetable where this number should be stored.
We select our hash function to beh(x) = x % 7
(i.e., the remainder when dividing x by 7).The table now looks as follows:
0 71 82 93 345 56 6
Now, to see if the number 8 exists in the list, we simplycompute h(8) (= 1) and see if the corresponding entry con-tains the number 8!
Hashing Numbers:
Problem: What happens if more than one entry maps to thesame location in the table?
This problem is called a collision. Collisions are not neces-sarily a bad thing as they indicate similarity with respect tothe hash function. There are several solutions to this prob-lem. The simplest one involves thinking of each entry in thetable as a list.
In our previous example, if we further added 14, 12, and 19to the list, we would get:
0 7 141 82 93 345 5 12 196 6
This is also called chaining.
Other solutions to collisions include open addressing, dou-ble hashing, etc.
Hashing Sequences of Characters:
Associate a number with each character. For example, a =65, b = 66, and so on. (The computer does this internallyanyway using a code called ASCII). Using this code, asequence of characters can be converted into a numberand a hash function can then be applied on it.
For example, the word: ananth can be hashed as follows.
a = 65, n = 78, a = 65, n = 78, t = 94, h = 72.
One way of converting ‘ananth’ to a number would be tosimply add their codes:
Therefore,
‘ananth’ maps to 65 + 78 + 65 + 78 + 94 + 72 = 455.
We can now apply:
h(‘ananth’): h(455) = 455 % TABLE_SIZE
Selecting the Right Hash Function:
But the modulo (%, or remainder) function causes colli-sions between unrelated things!!
The hash function should be selected with a view to opti-mizing specific criteria. If we want to induce collisionsbetween similar entities, we should design a hash functionaccordingly.
Revisiting the example of hashing numbers, say:
h(x) = x / 3
(i.e., the integer part of x divided by 3). The new table nowbecomes:
01 3 52 7 8 63 94 14 1256 19
Now, notice that like numbers end up close to each other inthe table.
Hashing Geometric Objects:
Its all in the design of the hash function!
You want to use the hash function to cancel out the effectsof translation and rotation, i.e., we want (at least) the fol-lowing desirable properties from the hash function.
i) Given an object G and an object T(R(G)), where T issome (unknown) translation operator and R is some rota-tional operator, we want a hash function such that:
h(G) = h(T(R(G)).
ii) The function h should map structurally similar shapes tothe same (or nearby) locations.
So how easy is it to design such functions anyway?
In isolation, each criteria can be easily satisfied:
For example, to satisfy the first criteria:
h(G) = number of acute angles in G.h(G) = number of edges in G.
and, for example, to satisfy the second criteria:
h(G) = h((vector of atom positions)) = Morton key of vector.
However, together, they are a lot more difficult!
Hashing Geometries (2D Objects):
Given: Two figures are given, a model A, and a query B,described by m and n points, respectively.
Objective: Find common subfigures, invariant under rota-tion and translation.
Approach: One simple approach is to place the query ontop of the model, and consider how many points coincide(here ignoring the edges).
Unfortunately, computationally, this is very expensive (NPHard).
A B
B over A
Reference Frames
Define coordinate systems for both figures (A, B), calledreference frames.
Two points (basis pair) can define a reference frame, e.g.origin at one of them, and one of the axis through bothpoints.
The coordinates of the points are computed in the refer-ence frame, constituting a reference frame system.
Count how many pairs of points (one from each figure)have the same coordinates.
Reference Frames:
Reference Frames:
The number of coincident points depends on the resolutionof the coordinate system, on the basis pairs used
Generally, all combination of points should be used as
basis pairs, resulting in comparing O(n2m2) referenceframe systems
Using all combinations might introduce redundancy. Let(ai ,ak ) and (bj ,bl ) be the basis pairs, and (ar ,bu ) and(as ,bv ) both coincide. Then it is likely that the same coinci-dence set is found if (ar ,as ) and (bu ,bv ) are used asbasis.
Note however that similarity and not exact equality is usedin this case.
Hashing Points.
For dealing efficiently with all combinations, hashing isused. It is especially efficient when several queries are tobe compared to one model, or to several models.
The comparison problem can be formulated as: given aquery reference frame system, for each model referenceframe system, in how many cells are there points from boththe query and model frame system.
Hashing makes it possible to simultaneously compare aquery frame system to all model frame systems.
Preprocessing
In the first example, a 2D hash table H is used. It has a binfor each cell in the frame systems. In a preprocessingphase, the coordinates of all points in each model framesystem are found. If there is a point in the cell (p,q) in theframe system with basis (ai ,ak), then (ai,ak ) is placed inthe bin H(p,q).
Since all pairs of points from the model will (generally) act
as basis pairs, totally O(m3) pairs will be in H (m points in
each of the O(m2) pairs).
Preprocessing.
Recognition
The query is compared to the model in the recognitionphase
A pair is chosen as basis, and the coordinates of the otherpoints are calculated
These coordinates are used as indices into H, and for eachcell being indexed, a vote is given for the (model) basispairs in the cell. The number of votes for a model basispair is the number of coinciding points to the query (usingthe specified query basis pair)
Extensions
Labels (e.g., colors and/or forms) assigned to the pointsmight also be included, such that coinciding points alsomust have equal labels. This can be implemented in twoways:
Storing the labels in the table
The table can be hashed by using the labels in additionto the coordinates
It is straightforward to extend the hashing technique toinclude several models, such that a query issimultaneously compared to several models. The onlyextension in the hash table is that a model identificationmust be stored along with the basis pairs
Geometric Hashing for Structure Comparison
Since comparing structures should be invariant totranslation and rotation, geometric hashing is a good can-didate method, when the order of the residues (elements)along the chain is ignored
Finding a coinciding set between two structures thenmeans finding an equivalence (not necessarily analignment)
A 3D reference (frame) system must be defined, and oftenthe Cα-coordinates of three (non-colinear) residues areused.
Algorithm for Preprocessing.
for each model M u do
for each (unordered, noncollinear) triples (a i ,a k,a p)
docalculate the reference frame R ikp
for each residue r docalculateF = F(a i , a k , a p , a r , p L ); // index in H
H(F) = H(F) U (M u, R ikp )
endend
end
// The label L is included in the hashing
Algorithm for Recognition
repeatinitialise the vote table V to 0choose three atoms (a j ,a l ,a s ) from the query as basis
calculate the reference frame system R jls
for each residue q docalculate F = F(a j , a l , a s , a q , q L );
for each pair (M,R) in H(F) doV(M,R) := V(M,R) + 1
endenduntil (satisfactory coincidence sets are foundor all query reference frames are used)
Remarks.
The result of the recognition is a list of (M, RM ,RB ), show-ing that there is a coincidence set between the model Mand the query B, becoming evident when the referenceframes RM and RB are used
The residues of the coincidence sets are known (or can befound), and a superposition can be done for checking thesubstructure similarities
Techniques for reducing the run time for GeometricHashing exist
Techniques also exist for practical adaption (e.g.,checking neighboring cells)
Geometric Hashing for SSE-Representation
Typically the SSEs are represented as sticks
By use of two sticks, a coordinate reference frame can bedefined (usually by placing one axis along one of thesticks)
An entry in the hash table might be (Holm and Sander) alist where each list-element contains:
• identification of the SSE• identification of the basis• the type of SSE (alpha, beta)• the midpoint of the SSE (in the reference frame)• the direction of the SSE• possible other information