1 semi-automatic reconstruction of cross-cut shredded...

12
1 Semi-Automatic Reconstruction of Cross-Cut Shredded Documents Zachary Daniels, Mentor: Haroon Idrees Abstract—We propose a new approach for cross-cut shredded document reconstruction and evaluate it on the DARPA Shredder Challenge dataset. We begin by pre-processing chads. A set of costs based on shape (gaps, overlaps, edge similarity), graphical content (ruling line alignment, text line alignment), and semantic content (character and letter combinations) is calculated and used to rank putative chad matches. Documents are then reconstructed chad-by-chad. We introduce the concept of an oracle which knows the ground truth for puzzles one and two of the DARPA Shredder Challenge dataset, replacing the need for human verification of matches and adding the capability to evaluate the efficiency of algorithms for reassembling cross-cut shredded documents in a standard, quantitative way, both issues which have not been addressed by previous attempts at solving this problem. KeywordsDocument Image Analysis, Shredded Document Re- construction, Optical Character Recognition I. I NTRODUCTION T HE principle purpose of shredding a document is to ensure that the information concealed within is unable to be viewed by those for which it was not intended. There are, however, instances when shredded and torn documents need to be reconstructed. For example, shredded documents might contain evidence of criminal wrongdoing, and thus, reconstructing these documents is of great interest to law enforcement. Likewise, archeologists are often faced with the task of reassembling historic documents that have been damaged by age. Reassembling torn and shredded documents strictly by hand is often a difficult and sometimes impossible task. In recent years, the problem of reassembling shredded documents with the help of a computer has become a topic of growing interest. There are three widely used procedures for shredding docu- ments. The first is known as strip-shredding. Strip-shredded documents are cut purely in the vertical direction and as the name implies, form strips. Torn documents are ripped at random intervals with each piece differing in shape and size. Lastly, there is cross-cut shredding. Cross-cut shredded docu- ments are cut both vertically and horizontally with most pieces being approximately the same size and shape. Because cross- cut shredding produces more pieces than strip-shredding and the pieces produced are not as distinct in relation to one another as those produced by tearing documents, cross-cut shredded documents are generally regarded as the most difficult type of shredded document to reassemble. This paper proposes a Z. Daniels is with the Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA, 18015 USA e-mail: [email protected] H. Idrees is with the Center for Research in Computer Vision, University of Central Florida, Orlando, FL, 32816 USA method for reconstructing cross-cut shredded documents with the aid of a computer. In 2011, DARPA issued a challenge to reassemble a set of cross-cut shredded documents [1]. DARPA released images of five shredded documents and participants were awarded points based on the number of questions relating to the content of the shredded documents that they were able to answer. The method for reassembling cross-cut shredded proposed in this paper is evaluated using the DARPA Shredder Challenge dataset. II. RELATED WORK Deever and Gallagher’s method for reassembling shredded documents was among the most successful of those that were tested in the DARPA Shredder Challenge [2]. It was able to fully reassemble the first two puzzles of the challenge and partially reassemble the third and fourth. First, pre-processing occurred which involved automatically segmenting individual chads (individual pieces of the shredded document), rotating the chads in the direction of their principal components, and fi- nally correcting the orientation of the chad (i.e. flipping upside down pieces) by making use of characteristics specific to the chads in the DARPA dataset. Next, edge-based features such as ruling line positions and foreground pixel positions were automatically extracted from each chad, and these features were used to match pairs of chads. Lastly, a GUI allowed humans to select correct matches and make corrections to chad positions and orientation. Butler et al. also attempted to reconstruct the puzzles of the DARPA Shredder Challenge [3]. Their attempt, ‘The Deshredder’, modeled edge shapes and brightness as time series and used these time series to match and align chads. A visual analytic system allowed users to find, verify, and make fine-grained adjustments to matches. 50% of the correct matches were found in the first 20% of recommendations. Zhang et al.’s attempt at solving the cross-cut document reconstruction problem, “Hallucination”, combined a human’s ability to fill in missing details with a computer’s ability to perform fast pattern recognition [4]. A human subject was given a single chad and asked to draw what he or she felt was the expected content of a neighboring chad. These drawings were used as templates to match neighboring pieces. Zhang’s method reduced the number of pieces that needed to be reviewed by 44% when compared with a random baseline. Hallucination was tested on the DARPA Shredder Challenge dataset. The problem of reassembling torn documents was investi- gated by Richter et al. [5]. Their work involved using a dataset composed of 120 pages from two magazines where each page

Upload: truongbao

Post on 06-Aug-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

1

Semi-Automatic Reconstruction of Cross-CutShredded Documents

Zachary Daniels, Mentor: Haroon Idrees

Abstract—We propose a new approach for cross-cut shreddeddocument reconstruction and evaluate it on the DARPA ShredderChallenge dataset. We begin by pre-processing chads. A set ofcosts based on shape (gaps, overlaps, edge similarity), graphicalcontent (ruling line alignment, text line alignment), and semanticcontent (character and letter combinations) is calculated and usedto rank putative chad matches. Documents are then reconstructedchad-by-chad. We introduce the concept of an oracle which knowsthe ground truth for puzzles one and two of the DARPA ShredderChallenge dataset, replacing the need for human verification ofmatches and adding the capability to evaluate the efficiency ofalgorithms for reassembling cross-cut shredded documents ina standard, quantitative way, both issues which have not beenaddressed by previous attempts at solving this problem.

Keywords—Document Image Analysis, Shredded Document Re-construction, Optical Character Recognition

I. INTRODUCTION

THE principle purpose of shredding a document is toensure that the information concealed within is unable

to be viewed by those for which it was not intended. Thereare, however, instances when shredded and torn documentsneed to be reconstructed. For example, shredded documentsmight contain evidence of criminal wrongdoing, and thus,reconstructing these documents is of great interest to lawenforcement. Likewise, archeologists are often faced withthe task of reassembling historic documents that have beendamaged by age. Reassembling torn and shredded documentsstrictly by hand is often a difficult and sometimes impossibletask. In recent years, the problem of reassembling shreddeddocuments with the help of a computer has become a topic ofgrowing interest.

There are three widely used procedures for shredding docu-ments. The first is known as strip-shredding. Strip-shreddeddocuments are cut purely in the vertical direction and asthe name implies, form strips. Torn documents are ripped atrandom intervals with each piece differing in shape and size.Lastly, there is cross-cut shredding. Cross-cut shredded docu-ments are cut both vertically and horizontally with most piecesbeing approximately the same size and shape. Because cross-cut shredding produces more pieces than strip-shredding andthe pieces produced are not as distinct in relation to one anotheras those produced by tearing documents, cross-cut shreddeddocuments are generally regarded as the most difficult typeof shredded document to reassemble. This paper proposes a

Z. Daniels is with the Department of Computer Science and Engineering,Lehigh University, Bethlehem, PA, 18015 USA e-mail: [email protected]

H. Idrees is with the Center for Research in Computer Vision, Universityof Central Florida, Orlando, FL, 32816 USA

method for reconstructing cross-cut shredded documents withthe aid of a computer.

In 2011, DARPA issued a challenge to reassemble a set ofcross-cut shredded documents [1]. DARPA released images offive shredded documents and participants were awarded pointsbased on the number of questions relating to the content of theshredded documents that they were able to answer. The methodfor reassembling cross-cut shredded proposed in this paper isevaluated using the DARPA Shredder Challenge dataset.

II. RELATED WORK

Deever and Gallagher’s method for reassembling shreddeddocuments was among the most successful of those that weretested in the DARPA Shredder Challenge [2]. It was able tofully reassemble the first two puzzles of the challenge andpartially reassemble the third and fourth. First, pre-processingoccurred which involved automatically segmenting individualchads (individual pieces of the shredded document), rotatingthe chads in the direction of their principal components, and fi-nally correcting the orientation of the chad (i.e. flipping upsidedown pieces) by making use of characteristics specific to thechads in the DARPA dataset. Next, edge-based features suchas ruling line positions and foreground pixel positions wereautomatically extracted from each chad, and these featureswere used to match pairs of chads. Lastly, a GUI allowedhumans to select correct matches and make corrections to chadpositions and orientation.

Butler et al. also attempted to reconstruct the puzzles ofthe DARPA Shredder Challenge [3]. Their attempt, ‘TheDeshredder’, modeled edge shapes and brightness as timeseries and used these time series to match and align chads.A visual analytic system allowed users to find, verify, andmake fine-grained adjustments to matches. 50% of the correctmatches were found in the first 20% of recommendations.

Zhang et al.’s attempt at solving the cross-cut documentreconstruction problem, “Hallucination”, combined a human’sability to fill in missing details with a computer’s ability toperform fast pattern recognition [4]. A human subject wasgiven a single chad and asked to draw what he or she felt wasthe expected content of a neighboring chad. These drawingswere used as templates to match neighboring pieces. Zhang’smethod reduced the number of pieces that needed to bereviewed by 44% when compared with a random baseline.Hallucination was tested on the DARPA Shredder Challengedataset.

The problem of reassembling torn documents was investi-gated by Richter et al. [5]. Their work involved using a datasetcomposed of 120 pages from two magazines where each page

2

Original Document Chads Reconstructed Document

Sh

red

din

g

Re

co

nstru

ctio

n

Fig. 1: A visual representation of the cross-cut shredded document reconstruction problem: to reassemble a document that hasbeen cut into chads of approximately equal shape and size

was torn into between 16 and 32 pieces. They representedchads as sets of support points along the chad edges andextracted features relating to the shape and content of the chad.An SVM was used to match corresponding points.

Schauer, Prandtstetter, and Raidl devised heuristics for re-constructing cross-cut shredded documents and then appliedevolutionary and genetic algorithms to find valid chad com-binations [6] [7]. Two of the algorithms explored includeant colony optization and variable neighborhood search. Fulldocuments were able to be reconstructed in some but notall instances. In another experiment, Prandstetter and Raidlwere able to prove that the reconstruction of strip-shreddeddocuments is an NP-hard problem, implying that cross-cutreconstruction is at least as difficult a problem [8].

An alternative approach to edge matching is probabilisticscoring. Ranca built a probablistic model for estimating thelikelihood of two edges matching [9]. Additionally, he pro-posed a modular approach for combining chads inspired byKruskal’s algorithm for minimum spanning trees.

III. PROBLEM

The problem of cross-cut shredded document reconstructioncan be defined as follows:

Definition 1: A document D is cut in both the vertical andhorizontal directions resulting in a set of chads of approxi-mately equal size and shape c. c is given to a reconstructionsystem which produces a mapping: c→ D.A visual representation of the problem statement appears inFig. 1.

In this paper, reconstruction on two documents is attempted.Both documents originate from the DARPA Shredder Chal-lenge dataset. The first document is composed of 198 usablechads containing only handwritten text in black ink and rulinglines. The second document is composed of 350 usable chads

and contains handwritten text in both black and red ink as wellas ruling lines.

IV. METHODOLOGY

The reconstruction system is composed of four key compo-nents:• Data pre-processing and storage• Oracle• Pairwise and multi-chad cost extraction• Solution expansion and pruningEach component is discussed in greater detail. The general

architecture of the system is presented in Fig. 2.

A. Pre-processingGiven color images of chads, c1:n, we first orient each

chad such that its major axis is oriented vertically. We doso by rotating the chad until the chad’s binary mask’s secondmoments are vertically aligned. We denote the binary mask ofthe oriented chad as Γ.

For each chad, the minimum rectangle that inscribes Γi isfound, and the position of each chad in the solution is givenwith respect to the top-left corner of this rectangle. We alsocompute the derivative magnitude, ρ, derivative orientation, θ,boundary, Ω, blurred boundary, Υ, and hough transform, H ,for each chad.

For textual documents, whether handwritten (or printed), wecluster the RGB pixels from all chads into a small number ofclusters (30) using the k-means algorithm. The background isidentified as clusters with the largest number of pixels in thedocument, clusters with high correlation to peaks in the Houghtransform are selected as lines while rest of the clusters areselected as foreground. Detection of lines is important as itsignificantly reduces the size of the solution space allowing

3

Input Image Pre-Processing Chad

Information Cost Extraction

Matching and Alignment

Match Validation

Reconstructed Document

Oracle Information

Fig. 2: Architecture for the reconstruction process where red boxes denote the input and output, blue boxes denote processes,and green boxes denote data

chads to be matched at only discrete intervals. Lines withthe same orientation and in close proximity with each otherare merged to obtain longer, thinner lines. The chads arefurther rotated so that all lines in the textual document becomehorizontal.

Next, a human is asked to annotate the character informationon each chad. The human specifies the spin of the chad(whether or not it is upside down) and the number of text linesthat appear on the chad. The chad’s spin is correct, and thechad is binarized. The connected components for black ink textare extracted, and the centroids of the connected componentsare clustered into the specified number of text lines using the k-means algorithm. For each text line, the baselines and top linesare computed, and the human is asked to verify that the text isproperly clustered. Finally, the human annotates the charactersthat appear on each textline and has the option to specifywhether the character is indistinguishable, distinguishable butcut off, or completely present.

Since there are two possible alignments with the vertical axis(y and -y) for each chad, we keep both possibilities and denotethem with the spin 1 and -1. Both possibilities are treated asdifferent chads for the purpose of finding putative matches,but for the complete solution, selection of one spin impliesthat matches for both spins become unavailable. Note thatwhile information about the spin of some chads was previouslycollected during the text annotation stage of pre-processing,this informinsteadation is not used extensively in the actualreconstruction of the document. In fact, it is only used foraligning text lines when calculating several text-related costs.There are two reasons for restricting the use of this informa-tion. First, for some chads with only miniscule amounts of text,it is difficult to distinguish what the correct spin of the chad is,and as a means of preventing the unnecessary propagation oferror, the annotated spin is not used to correct chad orientation.The second reason the use of this information is restricted isthat the system is designed to be as automated as possible and

to generalize in such a way that it should be able to handleunseen document types. For example, annotating text for somedocuments might be too costly or simply impossible, and thus,the system should be able to adapt in such a way that it doesnot depend too heavily on the information obtained during theannotation stage. Similarly, future versions of the system mightreplace human annotation of text with machine annotationusing optical character recognition. The implementation ofautomatic spin detection will likely result in some additionalerror.

B. OracleEvery cross-cut shredded document reconstruction system

requires a human-in-the-loop for match verification, correction,and evaluation. The oracle replaces the need for humans forthese functions. For every chad extracted during the pre-processing stage, the oracle knows the following informationwith respect to the global solution:• Position in the x-direction• Position in the y-direction• Spin• Rotation• Size• Neighboring chads to the right that share large (major)

boundaries• Neighboring chads to the right that share small (minor)

boundariesWe annotated ground truth for the first two puzzles of the

DARPA Shredder Challenge. The methodology for doing sois as follows:

1) Clean the solution image using the Textcleaner scriptfor ImageMagick. [10].

2) Binarize the solution such that only text and linesremain.

3) Apply Gaussian smoothing to the binarized solution

4

Fig. 3: Various stages in the oracle-building process. The first image shows the binarization of the real solution; the second, theresults of correlation; the third, the final ground truth

4) Find several chads whose content is distinct and locatethe corresponding image patches in the solution image.

5) Binarize the the chads such that only text and linesremain.

6) Correlate the chads against their respective patches inthe solution image at different scales. Find the max-imum correlation values and take note of the scale atwhich these values were achieved. Use this informationto determine the difference in scale between the solutionimage and chads.

7) Scale each chad.8) Correlate each chad across the solution image and use

the peak value to find its approximately correct location.9) Correlate the binary masks of the remaining chads

across the mask image of the current oracle solutionto fill gaps.

10) Manually correct the oracle solution by adding missingpieces, adjusting positions on a pixel-level, and rotatingchads for better alignment.

11) Use geometric information to find neighboring chads.

Several steps of the process are shown in Fig. 3

The oracle is used in two ways. First, it replaces the needfor a human for match verification and correction purposes.This means it is no longer necessary to have humans sitat computers for hours on end trying to determine if chadsare placed correctly. While our system does chad placementautomatically and the results are typically within acceptableerror, this is not the case for all other methods of documentreconstruction where sometimes a human is needed to correctchad alignments, rotation, and spin. Since the oracle knowsall of this information, it can be used to check if two chads’relative positions are within a certain error and automaticallycorrect their placements.

The oracle is also used for evaluation purposes. The oracleis able to determine whether or not a match is correct, andas a result, it can be used to measure the number of guessesneeded to reconstruct the entire document or a subset of thepuzzle. Its ability to do so has two consequences.

First, it allows researchers to evaluate the current state oftheir system without much effort. For example, our methodfor document reconstruction involves minimizing a set ofcosts with varying weights. The oracle allows us to quicklydetermine whether adding a specific cost is beneficial to thesystem as well as whether the weights of costs are properlytuned and balanced.

Second, the oracle and ground truth provide a standardized,quantitative benchmark for the problem of cross-cut shreddeddocument reconstruction. In every paper on the subject, themetric used for evaluating the given methodology differs. Sometake a qualitative approach, e.g. descriptions and images of thereassembled document, while others take a more quantitativeapproach, e.g. the number of correct matches that appear inthe top-n percent of matches. The oracle enables more precisemetrics. The metrics used to evaluate the framework proposedby this paper is described in depth in Section 6.

C. Pairwise and Multi-Chad MatchingIn order to find putative matches, we compute a set of

different costs described below:Shape (CS). For pairwise matching between chads i and

j, we first find the score between shape matching using 2Dcorrelation:

Ψi,j = corr(Υi,Υj > 0 + Υj − Γj). (1)

Correct matches have chads with boundaries that lie adjacentto one another. By filling the interior of one of the chads with

5

Reference Overlap Gap Edge Shape Ruling Lines Text Lines

Fig. 4: Costs make use of various properties that exist between and within chads. Some properties are a result of how theshapes of chads fit together such as gap and overlap while others are content-based such as ruling lines and text lines

-1, overlap is discouraged. Local maxima are obtained usingfirst and second derivatives of the correlation surface, and thisgives us putative matches. It is highly likely that incorrectchads will have high values for this cost. Fig. 4 shows anexample of two chads whose edges match fairly well. Redand blue denote edge segments that are specific to a singlechad, purple denotes edge segments that are good matches,and green denotes the opposite.

Line Alignment (CL). The line alignment cost measureshow well the horizontal ruling lines between two chads align.Fig. 4 shows an example of a chad with ruling lines wherethe ruling lines are marked in green. This cost is computed byfinding the Euclidean distance between all pairs of horizontallines from the two chads. The cost is the difference betweenthe number of pairs within a loose threshold (20 pixels), h,minus those within a strict threshold (5 pixels), l:

CL = |h| − |l|

The following costs are computed only if there is at least onealigned pair of lines in the two chads.

Gap (CG). Gaps are areas between two or more chads wherethere exists no material from any chad. A visual representationof gaps between two chads can be seen in Fig. 4. The gapsare marked in blue. In an ideal shredding, i.e. one whereshredding the document results in chads with no added noiseor degradation, chads that neighbor each other will have nogap between them. With real shredded documents, such idealconditions cannot be expected, but gap size can be a usefulmetric for helping to identify neighboring chads. Let Γrec

denote the mask of the current reconstruction. Let δ denotethe mask image of the current reconstruction combined witha potential chad at the putative location, given by, Γrec + Γi.Then the cost of gap is obtained by subtracting δ ≥ 1 fromits filled version δfill, obtained by dilating δ ≥ 1, filling any

holes, and then eroding it.

CG = δfill − (δ ≥ 1)

Overlap. Overlap is the opposite of gaps. It is the areawhere material from multiple chads exists simultaneously.Reassembling an ideally shredded document would result inzero overlap between chads. Two overlapping chads can beseen in Fig. 4. The overlap is colored red. Of course, thisis likely to be untrue for real shredded documents, but theamount of overlap provides a good heuristic for determiningwhich pieces are unlikely to border one another. The overlapcost can be easily calculated as:

CO = δ > 1

Text Alignment (CA). This cost measures how well theforeground pixels of two chads align with each other. It isobtained as the ratio of the number of pixels where foregroundpixels occur on both chads divided by the number of pixelswhere foreground pixels occur on either of the chads.

Histogram of Overlap Sizes (CI ). There are two instanceswhere using the overlap cost might be an inaccurate represen-tation of the goodness-of-match between a chad and the currentreconstruction. In the first instance, two chads minimallyoverlap but the overlap occurs along a long boundary resultingin a fairly high overlap score. This type of overlap mightsimply be caused by noise added during shredding or inexactsegmentation, and thus, should not be heavily penalized. Inthe second instance, two chads match well along a segmentof their boundaries, resulting in low overlap at these pointsbut heavily overlap along other segments of their boundaries.In this case, the overlap cost might under penalize the match.The histogram of overlap cost takes into the account of thenumber of overlap pixels that occur in a row and is computed

6

by forming a histogram of overlap sizes in the horizontaldirection, m, using the image produced by δ ≥ 1, countingoverlap sizes from one to some specified n, penalizing moreheavily for larger overlap. It is calculated as:

CH =

n∑i=1

mi ∗ i2

Histogram of Gap Sizes (CH ). Like the overlap cost,the gap cost also fails to accurately represent the goodness-of-match between a chad and the current reconstruction intwo instances. In the first instance, a gap with a small widthbetween two chads occurs along a long boundary resulting in afairly high gap score. This type of gap might simply be causedby degradation of chad borders during the shredding process orerrors in alignment, and thus, should not be heavily penalized.In the second instance, two chads match well along a segmentof their boundaries, resulting in small gaps at these points, butlarge gaps occur along other segments of their boundaries inwhich case the gap cost might under penalize the match. Thehistogram of gap cost takes into the account of the number ofgap pixels that occur in a row and is computed by forming ahistogram of overlap sizes in both the horizontal and verticaldirection, p, using the image produced by δ ≥ 1, counting gapsizes from one to some specified n, penalizing more heavilyfor larger overlap. It is calculated as:

CI =

n∑i=1

pi ∗ i2

Character Combination (CC). When the character infor-mation of each chad is being annotated, the annotator has theoption of specifying the level of certainty for each character:indistinguishable, distinguishable but cut-off, or fully present.A series of rules is created regarding combinations of char-acters at chad boundaries using this information and someadditional information (e.g. a cut-off character should not befollowed by a space).

Text Line Overlap (CT ). This cost is similar to CA butdiffers in that it uses the top lines, t, and baselines, b, of textlines to measure the amount of shared text between a chad andits neighboring chads. A top line marks the highest point in atext line where ink exists while a baseline marks the lowest.Top lines are marked in light blue and base lines are markedin purple in Fig. 4. The text line overlap cost is calculatedas follows where nj denotes the number of lines for eachneighboring chad:

CT = 1−

∑nj

min(bni=nj,bnj

)−max(tni=nj,tnj

)

max(bni=nj,bnj

)−min(tni=nj,tnj

)

nj

Text Line Alignment (CX ). The text line alignment costexamines how many of the text lines of a chad, ci, align withthose of each neighboring chad. It is calculated as the sumof the difference between the minimum number of text lines,q, between ci and its neighbor, cj , and the number of textlines ci and cj share, qij , for each neighbor where alignment

?

?

Fig. 6: Computation of relative location B with respect to Awith spin -1.

is determined using top lines and baselines.

CX =∑j

min(qi, qj)− qij

Bigram Lookup (CB). Many documents worth reconstruct-ing contain valid text. This fact can be exploited to aid inthe matching process. Bigrams provide occurence counts orprobabilities of two characters occuring in a given order. Tocompute the bigram lookup cost, first, for a chad and itsneighbors, find all bigrams. Then, look up the occurrencecounts for each bigram and find the mean. Finally, if l is theset of means, standardize the values of the set to be between0 and 1 using:

max(l)− limax(l)−min(l)

Trigram Lookup (CR). The trigram lookup cost is similarto the bigram lookup but uses sets of three letters (trigrams)instead of bigrams. As such, to calculate the trigram lookupcost, perform the same computations as the bigram lookupcosts, but instead of finding all bigrams, find all trigrams.

V. RECONSTRUCTION

An important part of the system is automatically finding thelocation of each chad in relation to the current reconstruction.In order to do so, we first align chads based on maximas ofthe shape correlation score. Next, we use the detected rulinglines to correct errors in vertical alignment. We also computethe relative location of two chads when one chad has a spinof -1 as seen in Fig. 6. The algorithm for finding the relativelocations of all chads is presented in Alg. 1.

Reconstruction begins by calculating the costs for eachpairwise match. These costs are then combined as follows:

C1 = 0.01 ∗Cs + 0.5 ∗CL + 0.0005 ∗CG + 0.0005 ∗CO +CA

The matches are sorted in ascending order with respect to thefinal cost, C1. The top ten putative pairwise matches basedon C1 are shown in Fig. 5 where green denotes a correct

7

Fig. 5: Top 10 putative pairwise matches: In the reconstruction process, matches are first made between pairs of chads whichhelps to limit the solution space. Later, matches are made between the current reconstruction and a single chad.

match and red denotes an incorrect match. Six of the topten matches are valid suggesting that the pairwise costs arerelatively effective.

Next, the reconstruction is built chad-by-chad. The top nputative matches that meet the following requirements becomepotential future selections for the next chad to add to thereconstruction:• One chad in the putative match must exist in the current

reconstruction. If no pieces exist in the reconstruction(i.e. the very first match has not been made yet), thisrequirement is ignored.

• The overlap cost between the match and current solutionmust remain below a specified threshold. This helps toquickly discard poor matches. An additional benefit tousing the overlap cost to restrict potential matches is thatit lessens the number of times the other costs must becalculated and as a result, reduces computation time.

• The match must not have been previously seen anddiscarded by the oracle.

The values of n used are specified in Table I.

Minimum Number of Pieces Number of Putative Matches1 205 40

10 6020 8025 100

TABLE I: Number of putative matches to examine given thenumber of pieces in the reconstruction

After a match is deemed plausible, the costs for eachpotential new reconstruction are calculated and standardizedbetween 0 and 1 where values closer to zero are lower cost.A new final cost is calculated for each putative match as:

C2 = 3 ∗ CS + 2.5 ∗ CO + 2.5 ∗ CG + 1.5 ∗ CI + 1.5 ∗ CH

+ 0.5 ∗ CL + 0.5 ∗ CC + 1.5 ∗ CT + 0.5 ∗ CX + CB

+ 0.5 ∗ CR

(2)

The costs are sorted in ascending order, and the oracle is usedto check if each match agrees with the ground truth. If amatch does not agree with the ground truth, it is discarded, andinformation about the match is stored so it doesn’t ever appearagain as a potential solution. Once the first correct match isfound, the remaining potential matches are discarded.

The cycle of calculating costs, ranking matches based onthe values of the costs, and adding valid matches to thereconstruction repeats until all pieces are in the reconstructionor some specified limit is achieved.

The reconstruction process involves using two different costmeasures. The first, pairwise matching (C1), is done as a meansof restricting the solution space. As the reconstruction grows,the images used for determining costs grows. Examining onlythe n lowest cost pairwise matches before recomputing costson the entire reconstruction gives a good idea of which chadmatches have the potential to be correct. Pairwise matchingalone is not good enough for accurately reconstructing thedocument, however. Multi-chad matching C2 contains richerinformation and should produce better chad matches. Usingpairwise matching as a filter and multi-chad matching forfinal decisions means that the problem of finding a validreconstruction can be both tractable to compute and efficientin the number of guesses needed.

VI. RESULTS

We tested our algorithm on puzzle one of the DARPAShredder Challenge and intend to test it on puzzle two in thenear future. After computation of the pairwise match costs, thesystem was given approximately 48 hours to run, and was ableto resconstruct an 82-piece sub-solution of the document. Hadthe system been given more time, it is expected that most ofthe document would have been able to have been reassembled.The partially reassembled document can be seen in Fig. 9.

We define two measures for evaluation. The first is thenumber of guesses needed to reconstruct n pieces of thedocument. A guess is defined as the presentation of a potentialchad match to the oracle for verification, i.e. the number of

8

Algorithm 1 Function to find relative location and spins of all chads in a solution given a list of chad pairs, c1:k, where a chadpair is given by [idi, spini, offseti] and [idj , spinj , offsetj ] and offsetj is adjusted such that offseti is [0 0]. The size ofa chad is 1x2 vector containing width and height.

1: function RELATIVELOCATIONANDSPIN2: relloc← [ ] . id, spin, offset3: processed← [0 0 0 ... 0] . k zeros4:5: relloc(1)← [c1.idi, c1.spini, c1.offseti]6: relloc(2)← [c1.idj , c1.spinj , c1.offsetj ]7: processed(1)← 1, count← 38: while ∃ l | processed(l) = 0 do9: l← minx|processed(x) = 0 ∧ (∃ y|cx.idi = relloc(y).id ∨ cx.idj = relloc(y).id)

10:11: if ∃y|cl.idi = relloc(y).id then12:13: if cl.spini = relloc(y).spin then14: relloc(count)← [cl.idj , cl.spinj , relloc(y).offset+ cl.offsetj ]15: else16: relloc(count)← [cl.idj ,−cl.spinj , relloc(y).offset− cl.offsetj + cl.sizei − cl.sizej ]17: end if18:19: else if ∃y|cl.idj = relloc(y).id then20:21: if cl.spinj = relloc(y).spin then22: relloc(count)← [cl.idi, cl.spini, relloc(y).offset− cl.offsetj ]23: else24: relloc(count)← [cl.idi,−cl.spini, relloc(y).offset+ cl.offsetj − cl.sizei + cl.sizej ]25: end if26:27: end if28: processed(l)← 129: count← count+ 130: end while31: end function

times a human is shown a putative match for verification. Thenumber of guesses using each individual cost and the combinedcosts is shown in Table II.

It should be noted that no single cost can perform as wellas when all costs are combined. It should also be noted thatthe results for each cost are not entirely comparable with oneanother because different costs lead to different sub-solutionsof the document as well as different paths for arriving at thesame sub-solution. Since some matches at various points inthe reconstruction process might be more difficult to find thanothers, the selection of the first match can largely affect thesubsequent solution path and number of guesses needed.

While some costs seem to be highly discriminating, thisis not always the case. For example, the text line alignmentcost is a relatively non-discriminating cost because it tendsto produce many ties. These ties get sorted by C1 and thus,it is actually the combination of costs used in C1 with somefiltering provided by the text line alignment cost that resultsin a lower number of guesses.

The fact that the bigram lookup cost is the individual costwhich leads to the lowest number of guesses for the 25-piece

reconstruction is slightly surprising since the document usedfor evaluation has few chads with distinguishable characters.Our hypothesis is that for this document, the bigram lookupcost is likely acting similarly to the text line alignment cost inthat it’s able to indirectly identify when text lines align acrosschads by examining top lines and baselines and adds additionaldistinguishing ability by being able to penalize some instanceswhere unusual and unlikely letter combinations appear on thechads. This hypothesis is further supported by the fact thatfor individual costs, the text line alignment cost and charactercombination costs result in the second and third lowest numberof guesses for the 25-piece reconstruction and the first step incomputing both costs is to align text lines using top and baselines.

Four of the bottom five costs are related to gaps andoverlaps. This is surprising because these costs appear to makesense intuitively (see Section 4C for discussion). While usingthese costs individually produce poor results, we hypothesizethat they may be helpful when combined with other costs foraddressing some of the more difficult cases of chad matching.

Trigram lookup also performs poorly; however, this is ex-

9

!"

#!"

$!!"

$#!"

%!!"

%#!"

&!!"

&#!"

'!!"

'#!"

()*+,-".//012"

3456".)74"89)*7-476"

:;,+,<64+":/-=)7,>/7"

.)74"89)*7-476"

?;,24"

3456".)74"@A4+9,2"

@A4+9,2"

B,2"

@A4+9,2"C)D6/*+,-"

3+)*+,-".//012"

B,2"C)D6/*+,-"

!"#$%&'()'*"%++%+'

,--%-'.(+/'

*"%++%+'0+'.(+/+',--%-'

Fig. 7: Number of guesses needed to reconstruct n pieces of the document as each cost is added

Number of Pieces Shape Line Alignment Overlap Gap Histogram of Overlap Sizes Histogram of Gap Sizes5 17/3.4 39/7.8 29/5.8 37/7.4 27/5.4 37/7.4

10 105/10.5 228/22.8 220/22 197/19.7 242/24.2 185/18.515 213/14.2 281/18.7 320/21.3 430/28.7 385/25.7 863/57.320 257/12.85 374/18.7 454/22.7 671/33.55 611/30.55 1016/50.825 438/17.5 423/16.9 621/24.8 742/29.7 754/30.2 1258/50.3

Character Combination Text Line Overlap Text Line Alignment Bigram Lookup Trigram Lookup Combined71/14.2 37/7.4 31/6.2 47/9.4 210/42 9/1.8210/21 140/14 90/9 95/9.5 354/35.4 18/1.8

231/15.4 267/17.8 140/9.3 151/10.1 502/33.5 75/5339/16.9 399/20 265/13.25 211/10.55 710/35.5 151/7.55411/16.4 471/18.8 383/15.3 320/12.8 802/32.1 235/9.4

TABLE II: Number of guesses needed to reconstruct n pieces of the document for each cost and the average number ofguesses needed to add another chad to the reconstruction

pected for the document used. Few chads with distinguishablecharacters exist for the document used for evaluation. Whileenough information exists on each chad to make the bigramlookup a potentially useful cost, the same cannot be said aboutusing this information along side the trigram lookup cost. Weexpect that the trigram lookup cost will be produce betterresults on puzzle two of the DARPA Shredder Challenge wherecharacter information is more plentiful.

In addition to examining how well costs work in isolation,we also explore how the number of guesses decreases (or insome cases increases) as each new cost is added as shown in

Fig. 7. We start with the cost that produced the lowest numberof guesses for a 25-piece reconstruction and add costs basedon the results of Table II from lowest to highest.

In general, there is a downward trend in the number ofguesses needed as costs are added. It should, however, be notedthat as the first few costs are added, the number of guessesrises. There are several likely reasons for why this occurs. First,the weights given to each cost are manually tuned for the finalcombination and not the combinations leading up to it. As aresult, it is likely that the number of guesses required can belowered further with more precise tuning. Second, as costs are

10

!"

#"

$"

%"

&"

'!"

'#"

'$"

'%"

'&"

#!"

()*+,-".//012"

3456".)74"89)*7-476"

:;,+,<64+":/-=)7,>/7"

.)74"89)*7-476"

?;,24"

3456".)74"@A4+9,2"

@A4+9,2"

B,2"

@A4+9,2"C)D6/*+,-"

3+)*+,-".//012"

B,2"C)D6/*+,-"

!"#$%&'()'*(+,-'."%//%/''

011%1'2(/3'

*(+,-'."%//%/'4/'2(/3/'011%1''

3/2E'"

3/2E#"

3/2EF"

3/2E'!"

Fig. 8: Number of times n or fewer guesses are needed to reconstruct the next piece of the document as each cost is added

added, the reconstruction paths may differ resulting in differentreconstructions as was previously discussed. Lastly, some costsare similar to others. As had been previously discussed, thebigram lookup, text line alignment, and character combinationcosts all share the same first step of matching text lines and as aresult, it is possible that they share some common characteristicat a fundamental level, and thus, don’t complement each otherwell.

As the final few costs are added, the number of guessesneeded seems to plateau and even slightly increase. While anincrease in number of guesses might seem to be a negativeresult, this is not necessarily the case. If adding a cost resultsin being able to discriminate specific, difficult cases at theexpense of adding a few guesses, then the system generalizesbetter, and the system should scale better as the size of thereconstruction increases. As such, trading a lower number ofguesses for generalization will not necessarily have a negativeimpact. To explore how discriminative costs are, we introducea second metric.

The second metric we explore is the number of guessesneeded to add the next chad to the reconstruction as each costis added in the same order previously specified. We examinethe number of top-n guesses where n is 1, 2, 5, and 10. Highlydiscriminating costs should push the number of guesses neededto add a chad as close to one as possible. This will resultin many chads added with low numbers of guesses neededand a few with high numbers of guesses needed. Using this

information, new costs can be created that specifically targetproblem cases further driving the number of guesses down.The evaluation of these metrics can be seen in Fig. 8.

For all evaluations of the second metric, an upward trendis apparent. As more costs are added, the system generalizesbetter. When n is low, it should be noted that the patternis more chaotic. For example, the number of top-1 and top-2 guesses peak when the shape correlation cost is added inaddition to when the gap histogram cost is added. While onemight conclude that no additional costs should be added afterthe shape correlation cost, this is a fallacy because examiningFig. 7 we see that a minimum is not achieved at this point. In anideal situation, The total number of guesses should decreaseas costs are added and the number of top-n guesses shouldincrease as costs are added.

It is interesting to note that the peaks also inform us thatthe shape correlation cost is a highly discriminating cost thatcomplements the other costs well. Oppositely, costs relatedto gaps and overlap tend to reduce the number of top-1 andtop-2 guesses which would indicate that they may not behighly discriminating. When one examines top-5 and top-10guesses, however, costs related to gap and overlap significantlyimprove the system suggesting that there may exist smallmodifications to how these costs are computed that could makethem potentially more discriminating.

11

VII. FUTURE WORK

We intend to continue exploring the problem of cross-cutdocument reconstruction. We have begun looking into usingoptical character recognition (OCR) systems for automaticallyannotating chad content. We ran preliminary experimentswhere we synthetically shredded scanned pages obtained fromGoogle Books and added noise and deformities to the edges ofthe produced chads. Using OCRopus, a free, opensource OCRengine that performs text line segmentation, binarization, andcharacter recognition, we were able to achieve relatively highlevels of accuracy on machine-printed text [11]. However, wewere unable to achieve similar accuracy on handwritten text.Another problem encountered is that using an OCR system toannotate information about chad content would not producethe information needed for the character combination cost.

The results of our experiment show that human calibrationof a large number of weights is imperfect and time consuming.We intend to implement an algorithm for determining theseweights automatically. For example, this can be done usinggenetic or evolutionary algorithms.

To reduce the number of guesses needed to reconstructdocuments, we intend to implement new costs. One relativelyeasy cost to implement that has been shown to be effective inrelated work is matching color along chad boundaries. Another,new cost we plan to explore involves using contextual textinformation (the semantics of the document). This methodhas the added benefit of correcting unknown and mislabeledcharacters on chads.

Our system works well for documents composed of a fewhundred chads, but is unlikely to scale well for documentsconsisting of a thousand or more chads. It is worth inves-tigating other methods for the reconstruction process, e.g.reconstructing many smaller pieces of the reconstruction andthen later combining these.

VIII. CONCLUSION

We introduce a new framework for reconstructing cross-cut shredded documents. We first pre-process each chad andthen perform pairwise matching between the chads. Next,we reconstruct the solution chad-by-chad and verify putativematches using an oracle. Following this procedure, we’re ableto reconstruct large portions of the original document in arelatively low number of guesses.

Cross-cut shredded document reconstruction is very muchstill an open problem. We intend to continue exploring waysto reduce the amount of human interaction needed as well asfind more discriminating costs. Finally, we’d like to investigateways to reduce the computational complexity of our system.

ACKNOWLEDGMENTS

This research is made possible by NSF Grant #1156990.Special thanks to Professors Mubarak Shah and Niels daVitoria Lobo.

Fig. 9: 82-piece reconstruction

REFERENCES

[1] “Darpa shredder challenge,” http://archive.darpa.mil/shredderchallenge/,Dec 2011, U.S. Department of Defense.

[2] A. Deever and A. Gallagher, “Semi-automatic assembly of real cross-cut shredded documents,” in Image Processing (ICIP), 2012 19th IEEEInternational Conference on. IEEE, 2012, pp. 233–236.

[3] P. Butler, P. Chakraborty, and N. Ramakrishan, “The deshredder: A vi-sual analytic approach to reconstructing shredded documents,” in VisualAnalytics Science and Technology (VAST), 2012 IEEE Conference on.IEEE, 2012, pp. 113–122.

[4] H. Zhang, J. K. Lai, and M. Bacher, “Hallucination: A mixed-initiativeapproach for efficient document reconstruction,” in Workshops at theTwenty-Sixth AAAI Conference on Artificial Intelligence, 2012.

[5] F. Richter, C. Ries, N. Cebron, and R. Lienhart, “Learning to reassembleshredded documents,” 2013.

[6] M. Prandtstetter and G. R. Raidl, “Meta-heuristics for reconstructingcross cut shredded text documents,” in Proceedings of the 11th Annualconference on Genetic and evolutionary computation. ACM, 2009, pp.349–356.

[7] C. Schauer, M. Prandtstetter, and G. R. Raidl, “A memetic algorithm

12

for reconstructing cross-cut shredded text documents,” in Hybrid Meta-heuristics. Springer, 2010, pp. 103–117.

[8] M. Prandtstetter and G. R. Raidl, “Combining forces to reconstruct stripshredded text documents,” in Hybrid Metaheuristics. Springer, 2008,pp. 175–189.

[9] R. Ranca, “A modular framework for the automatic reconstructionof shredded documents,” in Workshops at the Twenty-Seventh AAAIConference on Artificial Intelligence, 2013.

[10] F. Weinhaus, “Textcleaner for image magick,”http://www.fmwconcepts.com/imagemagick/textcleaner/, Jul 2013.

[11] T. M. Breuel, “The ocropus open source ocr system.” DRR, vol. 6815,p. 68150, 2008.