near duplicate document detection: mathematical modeling and algorithms

7
Near Duplicate Document Detection: Mathematical Modeling and Algorithms Liwei Ren Trend Micro 10101 North De Anza Boulevard Cupertino, CA 95014, USA 1-408-850-1048 [email protected] Qiuer Xu Trend Micro Building B, Soho International Plaza Nanjing, 210012, P.R. China 86-25-52386123 [email protected] ABSTRACT Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution. Categories and Subject Descriptors H.3.3: Information Search and Retrieval information filtering, retrieval models, search process . General Terms Algorithms, Experimentation. Keywords Duplicate Document, Near Duplicate Detection, Document Fingerprint, Document Similarity, Retrieval Model, Information Retrieval, Asymmetric Architecture 1. INTRODUCTION Near duplicate document detection (NDDD) is a well-known problem in the area of information retrieval. It is defined to identify whether a given document is a near duplicate of one or more documents from a well-defined document set. This problem can be found in many technical areas such as crawling and indexing optimization of web search engines, copy detection systems, email archival, spam filtering, and data leak prevention systems. There are profound research literatures discussing this subject with numerous use cases and solutions [1-6]. Recently, Kumar et al. [7] provided a thorough review of the most significant works in decades that covers more than 60 papers. We organize the following sections in the fashion of problem definition, mathematical modeling and algorithmic solutions. We will introduce formal problem definition to describe the problem followed by three text models that are used to present documents. One text model is selected for constructing algorithmic solution. By introducing some concepts like document fingerprint and document similarity, the problem can be decomposed into three independent problems: (a) document fingerprint extraction; (b) document similarity calculation; (c) fingerprint based search engine. Two algorithms are constructed to extract fingerprints from documents and measure the similarity between documents. One can use utility of keyword based search engine for solving the problem (c). Finally, an architecture of asymmetric fingerprint generation is proposed to reduce the number of fingerprints. Less number of fingerprints is critical for the success of some special applications such as data leak prevention systems. 2. Problem Definition and Modeling The problem proposed in the introduction section is not well- defined from the perfective of practical implementation. In practice, we need a quantitative measurement of how “near duplicated” two documents are. We can need a more rigorous definition for NDDD. Definition 1 : Assume that we have a set of documents S. For any given document d and a percentile X% , one needs to identify multiple documents D 1 , D 2 , …, D m from S such that SIM(d, D j ) ≥ X% for 1 ≤j ≤m, where SIM is a well-defined function to calculate the similarity of two documents. The result {D 1 , D 2 , …, D m } is shown in the descending order of percentiles. There are several challenges to solve this problem: (a) The document set may be huge. It could be a scale in multiples of millions or even billions of documents. One certainly cannot compare d with each document of S to calculate the similarity. How to efficiently identify the reference document D from a huge document set ? (b) How to construct the similarity function SIM? Before we are able to answer the questions, we need to propose text models to present a document. A text model allows us to exclude irrelevant textual elements so that we can focus on the essence .

Upload: liwei-ren

Post on 11-Apr-2017

298 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Near Duplicate Document Detection: Mathematical Modeling and Algorithms

Near Duplicate Document Detection: Mathematical Modeling and Algorithms

Liwei Ren Trend Micro

10101 North De Anza Boulevard Cupertino, CA 95014, USA

1-408-850-1048

[email protected]

Qiuer Xu Trend Micro

Building B, Soho International Plaza Nanjing, 210012, P.R. China

86-25-52386123

[email protected]

ABSTRACT

Near-duplicate document detection is a well-known problem in

the area of information retrieval. It is an important problem to be

solved for many applications in IT industry. It has been studied

with profound research literatures. This article provides a novel

solution to this classic problem. We present the problem with

abstract models along with additional concepts such as text

models, document fingerprints and document similarity. With

these concepts, the problem can be transformed into keyword like

search problem with results ranked by document similarity. There

are two major techniques. The first technique is to extract robust

and unique fingerprints from a document. The second one is to

calculate document similarity effectively. Algorithms for both

fingerprint extraction and document similarity calculation are

introduced as a complete solution.

Categories and Subject Descriptors

H.3.3: Information Search and Retrieval – information filtering,

retrieval models, search process .

General Terms

Algorithms, Experimentation.

Keywords

Duplicate Document, Near Duplicate Detection, Document

Fingerprint, Document Similarity, Retrieval Model, Information

Retrieval, Asymmetric Architecture

1. INTRODUCTION Near duplicate document detection (NDDD) is a well-known

problem in the area of information retrieval. It is defined to

identify whether a given document is a near duplicate of one or

more documents from a well-defined document set. This problem

can be found in many technical areas such as crawling and

indexing optimization of web search engines, copy detection

systems, email archival, spam filtering, and data leak prevention

systems. There are profound research literatures discussing this

subject with numerous use cases and solutions [1-6]. Recently,

Kumar et al. [7] provided a thorough review of the most

significant works in decades that covers more than 60 papers.

We organize the following sections in the fashion of problem

definition, mathematical modeling and algorithmic solutions. We

will introduce formal problem definition to describe the problem

followed by three text models that are used to present documents.

One text model is selected for constructing algorithmic solution.

By introducing some concepts like document fingerprint and

document similarity, the problem can be decomposed into three

independent problems: (a) document fingerprint extraction; (b)

document similarity calculation; (c) fingerprint based search

engine. Two algorithms are constructed to extract fingerprints

from documents and measure the similarity between documents.

One can use utility of keyword based search engine for solving

the problem (c). Finally, an architecture of asymmetric

fingerprint generation is proposed to reduce the number of

fingerprints. Less number of fingerprints is critical for the success

of some special applications such as data leak prevention systems.

2. Problem Definition and Modeling The problem proposed in the introduction section is not well-

defined from the perfective of practical implementation. In

practice, we need a quantitative measurement of how “near

duplicated” two documents are. We can need a more rigorous

definition for NDDD.

Definition 1 : Assume that we have a set of documents S. For

any given document d and a percentile X% , one needs to identify

multiple documents D1, D2, …, Dm from S such that SIM(d, Dj) ≥

X% for 1 ≤j ≤m, where SIM is a well-defined function to calculate

the similarity of two documents. The result {D1, D2, …, Dm} is

shown in the descending order of percentiles.

There are several challenges to solve this problem:

(a) The document set may be huge. It could be a scale in

multiples of millions or even billions of documents.

One certainly cannot compare d with each document of

S to calculate the similarity. How to efficiently identify

the reference document D from a huge document set ?

(b) How to construct the similarity function SIM?

Before we are able to answer the questions, we need to propose

text models to present a document. A text model allows us to

exclude irrelevant textual elements so that we can focus on the

essence .

Page 2: Near Duplicate Document Detection: Mathematical Modeling and Algorithms

Documents can be in any document format such as Word, Power

Point, Excel, PDF, Post Script and many others. The individual

words or sentences can be in different styles (bold, italic,

underline) and with varieties of fonts. These are not important

textual elements when we discuss “near duplicate”.

Fundamentally, we are more interested in the textual content that

carries semantic significance.

A document can be written in any writing language. The texts in

different languages can be encoded differently, for example,

English texts can be encoded in ASCII, Chinese in GB, and

Japanese in SJIS. However, all languages can be encoded in the

UTF-8 standard which is able to present all languages in one text.

For documents in English or any western language, most authors

view a text as a string of words [2-6]. Words can be extracted

from texts with tokenization technique that uses spaces to separate

words (or tokens) in sentences.

Some languages such as Chinese and Japanese do not use spaces

between words. In those eastern languages, a sentence is a string

of characters without spaces between them. All characters of

different languages can be encoded in UTF-8 characters. As such,

a text in all languages can be considered as a string of UTF-8

characters.

Depending on the languages, each UTF-8 character consists of

one or multiple bytes, for example, a Chinese character typically

consists of three bytes while an ASCII character is one byte.

Therefore, one can view a text also as a string of bytes if we

convert them from its original encoding into UTF-8.

Definition 2: We have three text models to present a document:

Model 1: A text is a string of tokens ( or a sequence

of tokens)

Model 2: A text is a string of UTF-8 characters.

Model 3: A text is a string of bytes when the text is

encoded in UTF-8.

In summary, a text is a string of basic textual items

where a basic textual unit item means a token, UTF-

8 character or byte.

Besides three models, there exist other text models that basic

textual units are sentences [5], textual lines, or even pages. Those

models are not interests to the authors of this article.

Numerous articles study NDDD using the text model 1. While

this model is good enough to study NDDD for documents in

western languages, it has obstacles when dealing with non-

western languages. Model 1 needs tokenization techniques.

Tokenization is a taunting task for processing documents in

Chinese and Japanese, especially.

There are few works adopting the text model 2 and 3 in academic

world. Manber [1] discussed duplicate detection in terms of pair

wise file matching of ASCII files. This is a special case of the

model 2 and 3. In contrast, it has become a common practice in

industry to apply text model 2 or 3 to many document

management problems such as DLP [8-10] , spam filtering and e-

Discovery. In this article, we use text model 2 to extract

fingerprints from documents and calculate the similarity between

two documents. Both text model 2 and 3 are language

independent while model 1 is not. Therefore, the techniques

developed in this article apply equally to documents in any

languages, and even apply to a document written in multiple

languages.

Definition 3: A document normalization is a process that consists

of three sub-processes applied sequentially:

(a) Converting a document in any format, such as Word,

Excel and PDF, into a plain text encoded in UTF-8;

(b) Converting any plain text in other encodings into a plain

text encoded in UTF-8;

(c) Removing the trivial characters such as white spaces,

delimiters, and control characters and etc. from the

UTF-8 texts.

Definition 4: The result of the document normalization is a string

of UTF-8 characters that contains the most significant information

of the original document. It is called a normalized text or

normalized document.

There are many software tools available for the document

normalization. Without loss of generality, we can consider all

documents as normalized texts in the rest of this article unless we

specify otherwise.

With so much discussion already, it is the time to tackle the two

challenges of Definition 1. To meet the first challenge, let us

introduce the concept of document fingerprint.

Definition 5: A document fingerprint is an integer or a binary

string with fixed length. Fingerprints can be generated from

documents by a function GEN. The fingerprints have the

following characteristics:

(a) A document D has multiple fingerprints { F1, F2, …,

Fn}, i.e., GEN(D) = { F1, F2, …, Fn}.

(b) Two irrelevant documents d and D do not have a

common fingerprint. That is GEN(d) ∩ GEN(D) = ϕ.

This is called the uniqueness.

(c) A fingerprint can survive moderate document changes.

That means GEN(d) ∩ GEN(D) ≠ ϕ if d is a near

duplicated copy of D . This is the robustness.

(d) In summary, a fingerprint is a unique invariant of

document variants.

A document D can be presented by multiple fingerprints, and let

us denote this relationship as D ↔ { F1, F2, …, Fn}. For any

document D from the document set S in Definition 1, we can

assign a unique document ID to it so that we establish a mapping

between ID and the fingerprints. We also denote this as ID ↔ {

F1, F2, …, Fn}. This would remind us of the keyword based

Page 3: Near Duplicate Document Detection: Mathematical Modeling and Algorithms

searching problem as we can index this relationship ID ↔ { F1,

F2, …, Fn} into indexing files when treating the fingerprints as

keywords. We can present the NDDD problem of Definition 1

with the following model supported by two procedures indexer

and searcher.

NDDD Model : Assume we have two functions: (a) fingerprint

generation function GEN; (b) document similarity measurement

function SIM, the NDDD problem is reduced into a fingerprint

based indexing and searching problem:

Indexer: Given a set of documents S, each document

is assigned a unique ID. We extract multiple

fingerprints { F1, F2, …, Fn} from each document D

with the function GEN. The indexer indexes them

together with the document ID, i.e., ID ↔ { F1, F2,

…, Fn}. The indexing results are saved into indexing

files.

Searcher: For any query document d and the

percentile X%, we extract multiple fingerprints { f1,

f2, …, fn} from the query document d with the

function GEN . The searcher uses them to retrieve

relevant document IDs from the indexing files. If a

reference document contains any of { f1, f2, …, fn},

its ID will be retrieved. With the ID, the reference

document D is retrieved as result. Then, we calculate

SIM(d,D) to measure the similarity. There may be

multiple reference documents retrieved. We

calculate the similarity for all of them, and rank the

results in descending order of similarity.

With the model shown as above, the NDDD problem actually is

decomposed into three independent problems.

Three Sub-Problems:

1. Fingerprint generation --- Generate multiple

fingerprints from a given document D by a fingerprint

generation function GEN(D).

2. Similarity measurement --- Calculate the similarity

between two documents d and D by the similarity

function SIM(d,D).

3. Indexing/Searching --- The indexer indexes document

ID and its fingerprints { F1, F2, …, Fn}. The searcher

retrieves document IDs against indices with given

fingerprints { F1, F2, …, Fn}. This is similar to keyword

based search engine such as Google or Lucene.

One can use general search engine framework or even relational

database system for solving the 3rd problem. Therefore, we will

propose algorithmic solutions to the first and second problems

only.

3. Algorithms This section provides algorithms to construct the two functions

GEN and SIM respectively.

The function GEN is to extract fingerprints from a given

normalized document. A fingerprint is a possible invariant of

text that can survive document changes. What can survive

changes? Changes of text can be caused by document

modification with editing operations such as insertion, deletion,

copy/paste and etc.. However, there are many pieces remaining in

the new text. These unchanged pieces shift relatively in text. If

we can identify some unchanged text pieces, we can use them as

text invariants to generate fingerprints. How to locate these

unchanged yet shifting pieces?

First of all, we use text model 2 to present a text as a string of

UTF-8 characters, i.e., let us denote this as T = c1 c2… cL where

L is the string length. Hence, we can discuss strings of characters

instead of texts or documents. Secondarily, we introduce a

concept as “anchoring points” which is briefly discussed in [1]

without implementation suggestions. An anchoring point is a

character in the string that remains the same relative to its

neighborhood when the string changes. One can use the

neighborhood around the anchoring point to generate a fingerprint

with a good hash function H. With multiple anchoring points, we

have multiple fingerprints for the document. There are two issues

to be solved. The first issue is how to select the robust anchoring

points since the string can change. The second issue is that there

may be too many anchoring points so that we generate too many

fingerprints from a given string. We propose algorithm 1 to

construct the function GEN which can handle these two issues.

Definition 6: We need some notations for writing up algorithm 1:

The alphabet A of UTF-8 characters appearing in the

string.

Two numbers N and M that selects most robust

anchoring points for generating fingerprints. M can be

fixed for any text string while N is selected according to

string size. Table 1 shows how M and N are configured

as an example.

The width W of anchoring neighborhoods.

A hash function H that generate a fingerprint from a

sub-string of size W. There is no specific requirement

for the hash function.

Character score function defined as

𝑛 ∗ (𝑃𝑛 − 𝑃1) (𝑃𝑖+1 − 𝑃𝑖)2

1≤𝑖<𝑛

Table 1: M and N are configured accordingly

Text Size Range M N

0-10K 4 128

10-20K 4 256

20-30K 4 256

30-50K 4 512

50-70K 4 1024

70-80K 4 1024

80-100K 4 1024

100-500K 4 1024

Page 4: Near Duplicate Document Detection: Mathematical Modeling and Algorithms

> 500K 4 1024

Algorithm 1:

Input: String T as c1 c2… cL

Output: Fingerprint set.

Procedure:

Step 1: Select the number N from Table 1 according to the string

length L.

Step 2: Run through the string T while counting the occurrences

of each unique UTF-8 character in A and saving the offsets.

Step 3: For each C ∈ A , the character C should have one or

multiple occurrences in T. Their offsets can be denoted as P1,

P2,… Pn . We use the score function to calculate the score for C.

Step 4: Pick M characters from A that have the highest scores .

That is B = { C1, C2,… CM }.

Step 5: For each C ∈ B, do step 6 to step 9

Step 6: For each occurrence of C in T, we have an anchoring

neighborhood which has C as its center. Each neighborhood is a

sub-string of size W. We denote these neighborhoods as S1, S2,…

Sn with respect to the occurrence offsets P1, P2, … Pn .

Step 7: Sort the list of sub-strings S1, S2,… Sn . Without loss of

generality, we can still denote the sorted list as S1, S2,… Sn .

Step 8: Select first K items from the sorted list where K =

MIN(N , n). They are {S1, S2,… SK }.

Step 9: Apply hash function H to {S1, S2,… SK} to generate K

fingerprints and add them to the fingerprint set.

The algorithm is stated based on text model 2. However, it is

good for other two models as well by replacing “character” by

either “token” or “byte”. The idea of the algorithm is

straightforward. First of all, it selects the most significant

character from the alphabet of the input string with a scoring

function to measure the significance. When calculating the score

of a given character, we consider both the frequency and

distribution of the character across the string. This is reflected in

the score function. Secondarily, for each picked character, it

chooses the robust anchoring points by sorting and picking the top

items from the list. Sorting is a mechanism to change randomness

into order. The result is a set of at most M*N fingerprints. For

example, when the normalized text size is less than 10KB which

is typical in real world, we get at most 4*128=512 fingerprints.

The function SIM is to calculate the similarity between two

normalized documents. We can use text model 2 to present a

document such that we actually compare two strings of characters.

What similarity means to them? If there are some common sub-

strings between two strings and the total length summed up is

long enough, we would consider that they are similar to each

other. We also expect that similarity can be measured in

percentile. We propose algorithm 2 to calculate the similarities

between one given document and a set of reference documents.

The main idea is to identify common sub-strings with hash based

greedy matching strategy.

Definition 7: We need some notations to present algorithm 2:

A number M that defines the minimum length of

common sub-strings. Common sub-strings must have

minimum length to avoid triviality, otherwise, a single

character can be a common sub-string.

A hash function H that generate a hash value from a

sub-string of size M. The hash table has chaining

capability to resolve collisions. There is no specific

requirement for the hash function. However, due to the

nature of the algorithm, a rolling hash function is

recommended for good performance.

A hash table HT.

For a string T, its substring can be denoted as T[s,…,e]

where s and e are the starting and ending offsets.

The algorithm is stated with text model 2. However, it

can be applied to other two models as well.

Algorithm 2:

Input: Query string d, and multiple reference strings {D1, D2,

…, Dm}

Output: The similarities {SIM1, SIM2, …, SIMm }

Procedure:

Step 1: Create the hash table HT based on L which is the size of

the input string d.

Step 2: For j = 0 to L-M

Apply the hash function H to the sub-string d[j,…,j+M-

1] of d to calculate the hash value h

Store offset j in H[h] or its chained linked-list.

Step 3: For each k in {1,2,…,m}, do step 4 to step 12

Step 4: Let Lk be the length of Dk , set P = 0 and SUM=0.

Step 5: Let h = H(Dk [P,…,P+M-1])

Step 6: If H[h] is empty, we have no match of sub-strings at this

offset P, let P=P+1, go to step 11

Step 7: For each sub-string offset s stored in the chaining linked-

list at H[h], do step 8

Step 8: If d[s,..,s+M-1] ≠Dk [P,…,P+M-1], set V(s)=0, otherwise,

let us extend the two equal sub-strings forward with common

characters as many as possible that arrives at the maximum

common sub-string size V(s.)

Step 9: Let V be the largest of all V(s) that we get from step 8.

Step 10: If V>0, let SUM = SUM + V, P = P + V, otherwise let

P = P + 1

Step 11: If P < Lk-M, go to Step 5

Step 12: Let SIMk = SUM / Lk

Algorithm 2 actually calculates all SIM(d,D1), SIM(d,D2), …

SIM(d, Dm) in one construction. The step 1 and 2 actually pre-

process d. And the step 4 to 12 are the steps to calculate

individual SIM(d,Dj) once a time.

For the normalized query document d and reference document D,

the algorithm 2 identifies a set of common sub-strings and sum up

all their lengths as SUM. Then similarity SIM can be measured

Page 5: Near Duplicate Document Detection: Mathematical Modeling and Algorithms

by SUM / Length(D). One may ask why we do not include the

length of d for the similarity. This is because we care more how

much of D is duplicated in the query document d than how much

of d is the content of D. One can certainly design another formula

to calculate the similarity from SUM and both lengths. Finally we

need to make sure SIM measures the similarity meaningfully.

This is guaranteed by the following theorem.

Theorem 1: The function SIM defined by algorithm 2 satisfies

the following properties for two normalized documents d and D:

1. 0 ≤SIM(d,D)≤ 1

2. If d and D are the same document, SIM(d,D)=1

3. If d and D have no common sub-strings at all,

SIM(d,D)=0.

Proof: From step 4 to 11 of algorithm 2, we have 0≤ SUM

≤Length(D). That proves 0 ≤SIM(d,D)≤1. If d=D, it is not

difficult to prove that SUM= Length(D), i.e., SIM(d,D)=1. The

last assertion is trivial.

4. Asymmetric Fingerprint Generation For some special applications such as DLP (data loss prevention)

endpoint products, indexed fingerprinting files created on servers

must be delivered to remote machines which host searchers. It is

necessary to use less fingerprints to present a document in order to

save network bandwidth and cost. In algorithm 1, there are two

important parameters when generating the fingerprints. They are

the numbers M and N where M is fixed and N is configured

according to the text size defined by a table.

Based on recent experimental research, we can reduce the

fingerprints and keep almost the same recall rate if we apply

smaller number N to the function GEN at indexer side while the N

at server side is kept the same. In other words, we can solve the

NDDD problem even if the indexer can generate much less

number of fingerprints than the searcher. Table 2 is an example

for defining different N’s for both indexer and searcher.

Table 2 : Different N for Indexer and Searcher

Text Size Range M N for Indexer N for Searcher

0-10K 4 8 128

10-20K 4 16 256

20-30K 4 32 256

30-50K 4 32 512

50-70K 4 64 1024

70-80K 4 128 1024

80-100K 4 256 1024

100-500K 4 512 1024

> 500K 4 1024 1024

This method is referred as asymmetric fingerprint generation

while algorithm 1 is the symmetric fingerprint generation. And its

capability to keep almost the same recall rate is supported by the

following theoretical results.

Definition 8: Lets assume M is a constant number. For any

normalized document T, let us denote S( T, N) as the set of

fingerprints that is extracted from T with the number N.

Theorem 2: Let T be any normalized document, and n and m be

two positive integers. If n < m, we have S( T, n) ⊆ S(T,m)

which means the set S(T ,n) is a subset S(T, m).

Proof: This is a natural outcome from the step 8 of algorithm 1.

Theorem 3: Let D and d be two versions of same normalized

document, and n and m be two positive integers. If n < m, we have

S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d, m).

Proof: Since n<m, we have S(d, n) ⊆ S(d, m) and S(D, n)

⊆ S(D, m) and by theorem 1. Therefore, we have S(D, n) ∩ S(d,

n) ⊆ S(D, n) ∩ S(d, m) and S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d,

m). Together we have S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m)

⊆ S(D, m) ∩ S(d, m). This competes the proof.

Theorem 3 implies that the recall rate of asymmetric fingerprint

generation is between the two cases of symmetric fingerprint

generation with smaller and larger number of fingerprints. As a

matter of factor, the experimental data shows it is closer to the

second case while it generates much less fingerprints at indexer.

5. Experiments In this section, we report a data experiment that we implemented

with the asymmetric architecture of fingerprint generation defined

by the parameters of table 2. Both indexer and searcher reside on

a server with Windows server 2003, Intel Xeron [email protected],

8GB of RAM.

We prepared experimental data sets as follows:

Normalized document for indexing:

Corpus 1: this set consists of 1 million plain text

files in UTF-8 encoding. Let denote the corpus

1 as S1.

Corpus 2: this set consists of 2115 plain text

files in many different languages and has

different file sizes. They are totally irrelevant to

the files in S1. Let denote the corpus 1 as S2.

Let S = S1 ∪ S2. All files in S are registered for

fingerprint generation and indexing.

Normalized documents for querying:

Corpus 3: this set consists of 6*6* 2115 = 76140

files. This corpus consists of documents that are

made from S2 with 6 editing operations and 6

levels of changes presented in percentiles.

Corpus 3 will be used for querying experiment.

6 levels of changes are defined as 5%, 10%,

20%, 30%, 40% and 50%. For example, the

level 1 means we alter 5% content of an original

file.

Page 6: Near Duplicate Document Detection: Mathematical Modeling and Algorithms

6 editing operations ADD, ADH, ADE, DEL,

CHG, MOV.

The 6 editing operations can be defined specifically as follows:

ADD: add a randomly generated block of chars at a

random position in the file.

ADH: add a randomly generated block of chars at a

random position in the file. Also add a randomly

generated block of chars with block size randomly

selected between 50-100 at the beginning of the file.

ADE: add a randomly generated block of chars at a

randomly selected position in the file. Also add a

randomly generated block of chars with block size

randomly selected between 50-100 in the ending of the

file.

DEL: delete a block of chars from the file. The start

point of deletion is randomly selected.

CHG : replace a randomly selected block of chars in the

file with a randomly generated block of chars.

MOV: move a randomly selected block of chars in the

file to a random position in the file.

Table 3: Querying time in seconds

Change level Total file

number

Total

Time

Sec per file in

average

5% 12690 1727 0.136

10% 12690 1776 0.139

20% 12690 1680 0.132

30% 12690 1709 0.134

40% 12690 1699 0.133

50% 12690 1649 0.129

Table 4: Numbers of files matched at each change level

Change

level

ADD ADH ADE DEL CHG MOV

5% 2080 2079 2082 2074 2071 2055

10% 2079 2069 2079 2073 2067 2055

20% 2045 2047 2055 2063 2029 2046

30% 2027 2019 2023 2058 1979 2041

40% 1993 2000 1998 2021 1924 2049

50% 1969 1977 1978 2020 1894 2049

Table 5: Total recall rate at each change level

Change level Total Files Recall Rate

5% 12441 98.03%

10% 12422 97.88%

20% 12285 96.80%

30% 12147 95.72%

40% 11985 94.44%

50% 11887 93.67%

Figure 1: Recall vs change level for different operations.

Experiment steps:

1. Fingerprint and index all the files in S.

2. Set X% = 20%. For any file from corpus 3, we use it

as a query document for the NDDD problem. The

recall and precision are measured according to the

query results. The performance of the querying speed

can be measured in seconds .

The experimental results are shown in table 3 , table 4, and the

figure 1.

Table 3 shows the performance when executing search for

6*2115=12690 query files with total number and the time per file

in average. For example, for change level 5%, the total time is

1727 seconds which means 0.136 second per file in average. This

is pretty fast when we consider the set S has more than 1 millions

documents fingerprinted.

Table 4 shows the recall rate for each change level and editing

operation. For example, for the change level 5% and ADD

Page 7: Near Duplicate Document Detection: Mathematical Modeling and Algorithms

operation, one has 2115 query files, we have 2080 successful

queries, that is 98.3%. Figure 1 illustrates recall rate vs change

level for each operation.

Table 5 shows the recall rates for all change levels. As the

document changes increase, the recall rate drops. The worst recall

rate is 93.67% when the change is around 50%.

We should mention that there is no false positive for all our

76140 query files. This is a natural outcome due to the following

reasons:

GEN and SIM are two string matching functions that

are independently constructed.

Even we may have false positives with fingerprint

match, X% will stop the false positives.

6. Conclusion This article has examined and solved the problem of near

duplicate document detection. What we have studied can be

summarized as follows:

Formal definition for the problem NDDD.

Text models are discussed for effective presentation. A

language independent text model is selected to present

the documents

A NDDD model is proposed to refine the problem

definition which decomposes the NDDD problem into

three separate sub-problems that can be solved

independently.

Algorithms are introduced to extract document

fingerprints and calculate document similarity.

An architecture of asymmetric fingerprint generation is

introduced to reduce the number of fingerprints for

some special application.

The data experiment shows that our algorithmic solution

has good performance, near zero false positives and

pretty higher recall rate even the documents change up

to 50%.

The problem definition and algorithmic solution in this article has

advantages over other approaches. It has near zero false positive

since the similarity calculation is independent of fingerprint

generation. The recall rate is pretty good due to the fact that the

fingerprints are robust with moderate document changes. Finally,

the solution is language independent. It means we can apply the

solution to documents written in any language and even to

documents written in multiple languages.

7. REFERENCES [1] Manber, U.1994. Finding Similar Files In A Large File

System. Proceedings of the USENIX Winter 1994 Technical

Conference, San Francisco, California

[2] Shivakumar, N. and Garcia Molina, H. 1999. Finding near-

replicas of documents on the web. Lecture Notes in Computer

Science, Springer Berlin / Heidelberg, 1590, 204-212.

[3] Lopresti, D. P. 1999. Models and Algorithms for Duplicate

Document Detection. Proceedings of the Fifth International

Conference on Document Analysis and Recognition, Bangalore,

India, 297-300, September, 1999

[4] Broder, A. Z. 2000. Identifying and Filtering Near-Duplicate

Documents. Proceedings of the 11th Annual Symposium on

Combinatorial Pattern Matching, UK. Springer-Verlag, pp.1-10,

2000.

[5] Campbell, D. M. , Chen,W.R. and Smith, R. D.. 2000. Copy

detection systems for digital documents. Proceedings of

Advances in Digital Libraries , pp. 78-88, 2000

[6] Ignatov, D. I. and Jánosi-Rancz, K. T. 2009. Towards a

framework for near-duplicate detection in a document collections

based on closed sets of attributes. ACTA Univ. Sapientiae,

Informatica, 1, 2 (2009), 215-233

[7] Kumar, J.P. and Govindarajulu, P. 2009. Duplicate and Near

Duplicate Documents Detection: A Review. European Journal of

Scientific Research, 32, 4 (2009), 514-527.

[8] Ren, L.,Tan, D., Huang, F., Huang S. and Dong, A. 2009.

Matching engine with signature generation. US patent 7,516,130.

[9] Ren, L., Huang S, Huang, F., Dong, A. and Tan, D. 2010.

Matching engine for querying relevant documents . US patent

7,747,642.

[10] Ren, L., Huang S., Huang, F. and Lin, Y. 2010. Document

matching engine using asymmetric signature generation. US

patent 7,860,853.