fast fingerprint calculations

30
Fast Fingerprint Calculations Thomas Schwarz, S.J.

Upload: komala

Post on 25-Feb-2016

38 views

Category:

Documents


3 download

DESCRIPTION

Fast Fingerprint Calculations. Thomas Schwarz, S.J. Fingerprints. Definition: A fingerprint (a.k.a. signature) of an object Ob is a small string f(Ob) with the following properties: 1. f is a function of Ob. In particular, if two objects are equal, then so are their fingerprints. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast Fingerprint Calculations

Fast Fingerprint Calculations

Thomas Schwarz, S.J.

Page 2: Fast Fingerprint Calculations

Fingerprints

Definition: A fingerprint (a.k.a. signature) of an object Ob is a small string f(Ob) with the following properties:

1. f is a function of Ob. In particular, if two objects are equal, then so are their fingerprints.

2. Prob(f(Ob1) = f(Ob2)) << 1 for “random” objects Ob1 ≠ Ob2.

Page 3: Fast Fingerprint Calculations

Usage

Fingerprints are used to:• Identify Objects • Compare Objects Remotely• Test an Object for ChangesSince fingerprints are smaller, they are very

useful as stand-ins for remote objects.

Page 4: Fast Fingerprint Calculations

UsageObject identification example:

Software Cloning: During maintenance, the need arises for a module very similar in character to one that already exists. Because of time pressure, this module is simply copied, all names are systematically changed, and then modified to serve the new needs.

Maintenance Problem: If a bug is detected in a clone or in the original, it probably subsides in the original and / or other clones. Besides, clones arise because of time pressure, but the short-cut ends up costing in the long run. Thus, it is better during maintenance to identify clones.

Clone Identification: Systematically suppress names, then test for function code to be identical.

Johnson, J.H. Substring Matching for Software Clone Detection, and Change Tracking. International Conference on Software Maintenance. Victoria, BC, 1994, p. 120 - 126

Page 5: Fast Fingerprint Calculations

Usage

Similarity Testing for Filesn-gram: Contiguous substring of n characters in a

file.File Similarity: Count the number of occurrences of

a particular n-gram.Use the fingerprint of an n-gram as a hash value.

Count the fingerprints instead of the n-gram.

Cohen, J.D. Recursive Hashing Functions for n-grams. ACM Trans. Information Systems, p. 291 -320.

Page 6: Fast Fingerprint Calculations

Usage

Remote String Searches: Find all occurrences of a given string in files on remote servers.

Instead of sending the string to all servers, only a fingerprint and the length l of the string is sent. The servers generate running fingerprints of l-grans and compare them with the string’s fingerprint.

Page 7: Fast Fingerprint Calculations

UsageRemote File ComparisonOriginal Problem: How to compare pages of remote

replicas of a database.Solution: Calculate fingerprints (“signatures”) of each page.

Calculate a super-signature from the pages. If super-signatures coincide, conclude that the replicas are in sync. If not, run a “smart” protocol to find the non-fitting signatures.

Abdel-Ghaffar, K. A. S., El-Abbadi, A. Efficient Detection of Corrupted Pages in a Replicated File. ACM Symp. Distributed Computing, 1993, p. 219-227.

Barbara, D., Garcia-Molina, H. , Feijoo, B. Exploiting Symmetries for Low-Cost Comparison of File Copies. Proc. Int. Conf. Distributed Computing Systems, 1988, p. 471-479.

Barbara, D., Lipton, R. J.: A class of Randomized Strategies for Low-Cost Comparison of File Copies. IEEE Trans. Parallel and Distributed Systems, vol. 2(2), 1991, p. 160-170.

Fuchs, W. Wu, K. L., Abraham, J. A. Low-Cost Comparison and Diagnosis of Large, Remotely Located Files. Proc. Symp. Reliability Distributed Software and Database Systems, p. 67-73, 1986.

Schwarz, Th., Bowdidge, B., Burkhard, W., Low Cost Comparison of Files, Int. Conf. on Distr. Comp. Syst., (ICDCS 90) , 196-201.

Page 8: Fast Fingerprint Calculations

UsageSecure Signatures:To identify an object, maintain its signature. If the

object is altered by an adversary, the adversary cannot do so in a computationally feasible way without changing the signature.

“Cryptographically secure signature” SHA-1, MD5Used for authentication, e.g. in computer forensics,

digital signatures, etc.

Page 9: Fast Fingerprint Calculations

SHA-1

• 20B long.• Designed for Fast Calculation• Considered unbreakable• Used increasingly in applications

were cryptographic security is not needed.

Radia Pearlman’s Law of Cryptography:“If a lot of smart people spent lots of time trying to break a scheme, and did not succeed,

then it cannot be done.”

Page 10: Fast Fingerprint Calculations

Useful Properties of Fingerprints

• Fast Calculation. • Low collision rate.

If the fingerprints have length l then the probability of a collision should be 2-l.

If there are small changes, then fingerprints should change.

• Cryptographically unbreakable.Given a signature, one cannot construct an object with

this signature.

Page 11: Fast Fingerprint Calculations

Useful Properties of Fingerprints

• UpdatableIf the object changes, then we can update the signature

from the old signature and the changes.

• Concatenation of ObjectsIf an object is made up of several objects, then we can

calculate the signature of the super-object from its constituents. Possibly in a way that allows us to quickly pinpoint different component objects.

Page 12: Fast Fingerprint Calculations

Karp Rabin Style Fingerprints

11

0 1 10

sig ( , ,... )N

N iN i

i

a a a a

Here, the calculation takes places in a ring R with multiplication and addition.

Karp, R. M., Rabin, M. O. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, Vol. 31, No. 2, March 1987.

Page 13: Fast Fingerprint Calculations

Karp Rabin Style Fingerprints

• Calculation time linear in N (1 multiplication and 1 addition)

0 1 10

11

0

0 1 1

sig ( , ,... , )

( )

sig ( , ,... )

NN i

N N ii

NN i

i Ni

N N

a a a a a

a a

a a a a

Page 14: Fast Fingerprint Calculations

Karp Rabin Style Fingerprints

• Easy to calculate consecutive n-grams

1 1 1 0 1 0 1sig ( , ,... , ) (sig ( , ,... ) )nn n n na a a a a a a a a

• Easy to calculate signatures of concatenations

sig (a0, a1, … al-1, al, al+1… al+n-1)

= n sig (a0, a1, … al-1, )+ sig (al, al+1… al+n-1)

Possibly use a table of values of n

Page 15: Fast Fingerprint Calculations

Karp Rabin Style Fingerprints

• Cryptographically not secure

A cryptographically secure signature is a one-way function, in order to be efficiently calculable, it needs to process large portions of a string at a time. Thus, cryptographical security is not a desirable property in general.

Page 16: Fast Fingerprint Calculations

Choice of Ring R

1. Integers modulo prime pThe ring is then a well-understood field. But, reducing modulo p is a costly operation.

2. Integers modulo 2f

Powers of two are zero dividers, e.g. 2·2f-1=0This excludes powers of 2 as .

Page 17: Fast Fingerprint Calculations

Choice of Ring3. Reduction by a polynomial.

The space of unsigned integers 0, ..., 232-1 can be naturally identified with the space of all polynomials k[t] over k = {0,1} with degree up to 31.

Select a polynomial (t) with degree f up to 31 and consider the ring R = k[t]/((t)). Elements in this are naturally identified with all unsigned integers 0, ..., 2f-1.

Addition of these polynomials corresponds to the fast XOR, multiplication is more difficult, but multiplication by t is a left shift followed by conditionally XORing with .

This is the most promising construction.

Page 18: Fast Fingerprint Calculations

R = k[t]/((t)) Example• Set = t5+t+1, that is, = 10011. • Elements of R are all bit strings of length 4.• To add 0101 and 1100, just XOR: 0101+1100 = 1001.• To multiply with t = 0010, left shift and XOR conditionally

with .• To multiply 0010 with 0010, left-shift the first and obtain

0 0100. The leading coefficient is zero, which is dropped. Result is 0100.

• To multiply 1100 with 0010, left shift to obtain 1,1000, the leading coefficient is one, so XOR with = 10011 to obtain result 1011.

Page 19: Fast Fingerprint Calculations

Galois Fields

If is irreducible, then R is a Galois field.

If we use a Galois field, we can concatenate fingerprints to obtain a signature:

1 2sig ( ) (sig ( ),sig ( ),...,sig ( ))α

nP P P P

Page 20: Fast Fingerprint Calculations

Galois Fields

If we use = (, 2, 3…n)

then the signature are the parity symbols of a generalized, non-systematic Reed-Solomon code.

Since these codes are MDS, the signature will change for up to n changes in the object.

Page 21: Fast Fingerprint Calculations

Galois Field Signatures

To calculate a Galois field footprint, we only need per symbol:

One XOROne left-shiftOne test whether the leading coefficient is

now one.Conditionally one XOR.

Page 22: Fast Fingerprint Calculations

Speeding up Galois field footprints

However, we do not have to execute the reduction step each time.

Instead, left shift and XOR b times (Broder’s idea). Then do a table to reduce the “overhang”.

Broder, A. Some applications of Rabin's fingerprinting method. In Capocelli, De Santis, and Vaccaro, (ed.), Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.

Page 23: Fast Fingerprint Calculations

Speeding up Galois field footprints

String is (1000, 1100, 1010, 0111, 0011, ... Choose = 10011. This is an irreducible polynomial.

Step 0: 1000Step 1: 1,0000+1100 = 1,1100Step 2: 11,1000+1010 = 11,0010Step 3: 110,0100+0111 = 110,0011Step 4: 1100,0110+0011 = 1100,0101Now use table look-up to calculate0101 + table[1100] = 0101 + 0111 = 0010.

8 elementary ops + 1 shift right + 1 table look-up+1 XOR.

Page 24: Fast Fingerprint Calculations

Speeding up Galois field footprints

String is (1000, 1100, 1010, 0111, 0011, ...Step 0: 1000Step 1: 1,000 + 1100 = 1,1100 = 1,1100+ =

1,1100+1,0011 = 1111.Step 2: 1,1110 + 1010 = 1,0100 = 0111.Step 3: 0,1110 + 0111 = 1001.Step 4: 1,0010 + 0011 = 1,0001 = 0010.8 elementary ops + 4 condition evaluations + 2

elementary ops on average.

Page 25: Fast Fingerprint Calculations

Speeding up Galois field footprintsHow do we calculate the table entries: Systematically reduce by t-multiples of = 10011. To calculate table entry for 12 = 1100 reduce 1100,0000 in four steps.

1100,0000 + t3· = 1100,0000 + 1001,1000= 0101,1000

Now reduce with t2·:0101,1000 + t2·= 0101,1000 + 0100,1100= 0001,0100

No step with t· since the corresponding coefficient is zero.Reduce with :

0001,0100 + 0001,0011 = 0000,0111 = 0111.

Page 26: Fast Fingerprint Calculations

Speeding up Galois field footprints

Optimal table size needs to be determined experimentally.

Table needs to fit in cache, so it cannot be much bigger than 216.

If the table is too small, then the look-up costs does not amortize well.

Page 27: Fast Fingerprint Calculations

Galois field signatures

Galois field signatures are concatenations of Galois field fingerprints.

Broder tables work for multiplication with t2, t3, t4 as well, but less efficiently, since now we shift two, three, or four times so that we need to use table-lookup more often.

Page 28: Fast Fingerprint Calculations

Performance Results

• 1.772 msec per MB for 16 bit parity• 2.012 msec per MB for 16 bit 1 power.• 3.114 msec per MB for 16 bit 2 power.

Results for a 1.99GHz Pentium 4 w. 512MB memory.

Page 29: Fast Fingerprint Calculations

How Long should Signatures be?

• Key fact: There are 31,557,600 seconds in a year.

• At x calculations per second, there will be 31,577,600x incidents, which will lead to a collision < 2-31 * 31,577,600x = 0.015x times per year for 32 bit signatures.

• So, minimum length should be 64 bits. We can achieve this easily.

• Larger signatures protects at a better rate than hard drive failures (writing on the wrong track), software failures, etc.

Page 30: Fast Fingerprint Calculations

Research Questions• * Property of a signature: Changing n symbols

changes the n-fold signature for sure.

• It is known that if we change to a different vector , e.g. one where the components are all primitive elements, we loose the * property. Are there other with this property?

• We can use Broder tabling with different irreducible and then concatenate the Galois field footprints. Can we find a condition under which the * property holds?

• What properties hold when is not irreducible? It seems statistically fine as long as has a constant coefficient.