a heuristic block coordinate descent approach for controlled tabular adjustment

Computers & Operations Research 38 (2011) 1826–1835

Contents lists available at ScienceDirect

Computers & Operations Research

0305-05

doi:10.1

� Corr

E-m

jordi.ca

journal homepage: www.elsevier.com/locate/caor

A heuristic block coordinate descent approach for controlledtabular adjustment

Jose A. Gonzalez, Jordi Castro �

Department of Statistics and Operations Research, Universitat Polit�ecnica de Catalunya, c. Jordi Girona 1–3, 08034 Barcelona, Catalonia, Spain

a r t i c l e i n f o

Available online 19 February 2011

Keywords:

Statistical confidentiality

Statistical disclosure control

Controlled tabular adjustment

Mixed integer linear programming

Heuristics

Block coordinate descent

Decomposition techniques

48/$ - see front matter & 2011 Elsevier Ltd. A

016/j.cor.2011.02.008

esponding author.

ail addresses: [email protected] (J.A. G

[email protected] (J. Castro).

a b s t r a c t

One of the main concerns of national statistical agencies (NSAs) is to publish tabular data. NSAs have to

guarantee that no private information from specific respondents can be disclosed from the released

tables. The purpose of the statistical disclosure control field is to avoid such a leak of private

information. Most protection techniques for tabular data rely on the formulation of a large mathema-

tical programming problem, whose solution is computationally expensive even for tables of moderate

size. One of the emerging techniques in this field is controlled tabular adjustment (CTA). Although CTA

is more efficient than other protection methods, the resulting mixed integer linear problems (MILP) are

still challenging. In this work a heuristic approach based on block coordinate descent decomposition is

designed and applied to large hierarchical and general CTA instances. This approach is compared with

CPLEX, a state-of-the-art MILP solver. Our results, from both synthetic and real tables with up to

1,200,000 cells, 100,000 of them being sensitive (resulting in MILP instances of up to 2,400,000

continuous variables, 100,000 binary variables, and 475,000 constraints) show that the heuristic block

coordinate descent has a better practical behavior than a state-of-the-art solver: for large hierarchical

instances it provides significantly better solutions within a specified realistic time limit, as required by

NSAs in real-world.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

National statistical agencies (NSAs) routinely disseminate bothdisaggregated (i.e., microdata or microfiles) and aggregated (i.e.,tabular data) information. Tables are generated by crossing two ormore categorical variables of a particular microfile (i.e., a census),which results in sets of tables, usually with a large number ofcells. NSAs are obliged by law to guarantee that no particularinformation from any respondent can be disclosed from thereleased information. The goal of the statistical disclosure controlfield is to protect such sensitive information [18]. In this work wefocus on tabular data. Some of the state-of-the-art research in thisfield can be found in the recent monographs [13–15,17,27].

Although cell tables show aggregated data for several respon-dents, there is a risk of disclosing individual information. Theexample of Fig. 1 shows a simple case. The left table (a) reportsthe overall net profit of companies by number of employees (rowvariable) and region (column variable), while table (b) providesthe number of companies. If there were only one company with51–99 employees in region r2, then any external attacker would

ll rights reserved.

onzalez),

know the net profit of this company. For two companies, eitherone of them could disclose the other’s net profit, becoming aninternal attacker. Even if the number of companies in this cell waslarger, but one of them could obtain a good estimator of another’snet profit (for instance, by subtracting its own contribution fromthe cell value), the cell would be considered unsafe. These unsafecells which require protection are named sensitive cells. Rulesfor detecting sensitive cells are beyond the scope of this work(see, e.g., [16,23] for a recent discussion).

There are several available techniques for the protection oftabular data. The most widely used nonperturbative method(i.e., one which does not change cell values) is named cellsuppression problem (CSP) [5,24]. NSAs have been consideringCSP as one of the preferred options. Recently, a new perturbativeapproach (i.e., cell values are slightly adjusted or modified)named controlled tabular adjustment (CTA) was introduced [4,11].Although it is a recent approach, CTA is gaining recognitionamong NSAs. For instance, CTA appears in [23] among the list ofcurrently available techniques for tabular data, and it is consid-ered as one of the emerging methods. In addition, we recentlydeveloped in collaboration with the NSAs of Germany and theNetherlands [21] a package for CTA in the scope of two projectsfunded by Eurostat, the Statistical Office of the EuropeanCommunities. The goal of those projects was the safe dissemina-tion of European business and animal production statistics by

www.elsevier.com/locate/caor

dx.doi.org/10.1016/j.cor.2011.02.008

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.cor.2011.02.008

dx.doi.org/10.1016/j.cor.2011.02.008

r1 r2

51–99 380Me 35Me100–199 700Me 800Me

r1 r2

51–99 20 1or 2100–199 30 35

Fig. 1. Example of disclosure in tabular data. (a) Net profit per number of employees and region. (b) Number of companies per number of employees and region. If there is

only one company with 51–99 employees in region r2, then any attacker knows the net profit of this company. For two companies, any of them can deduce the other’s net

profit, becoming an internal attacker.

r1 r2

51–99 385Me 30Me100–199 695Me 805Me

r1 r2

51–99 375Me 40Me100–199 705Me 795Me

lower protection directionAdjusted table,

upper protection directionAdjusted table,

Fig. 2. Protected (or adjusted) tables with ‘‘lower protection direction’’ and ‘‘upper protection direction’’ considering lower and upper protection levels of five (i.e., value of

sensitive cell is, respectively, reduced and increased by five units) for the original table of Fig. 1.

J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–1835 1827

Eurostat, and CTA was used within the overall protection scheme.CTA is thus a method of actual practical relevance. It is worthnoting that in recent specialized workshops on statistical dis-closure control, relevant members of NSAs stated that perturba-tive methods, like CTA, are gaining acceptance [28], and someperturbative approaches are being used for the protection ofnational census tables (e.g., for Germany [20]).

Fig. 2 illustrates CTA on the small two-dimensional tableof Fig. 1, considering cell (51–99,r2) as the only sensitive cell,with lower and upper protection levels of 5. Note that protectionlevels are either directly obtained from the rules that provide theset of sensitive cells or just computed as a percentage of the cellvalue. Depending on the ‘‘protection direction’’ of the sensitivecell, either ‘‘downward’’ or ‘‘upward’’, which has to be decided,the value to be published for this cell will be, respectively, less orequal than the original cell value minus the lower protectionlevel, or greater or equal than the original cell value plus theupper protection level. In the example of Fig. 2, if the protectiondirection is ‘‘down’’, then the value published for the sensitive cellshould be lesser than or equal to 30; a possible adjusted table forthis case is shown in the left table of Fig. 2. If the protectiondirection is ‘‘up’’, then the value must be greater than or equal to40, as shown in the right table of Fig. 2. Note that the remainingcells have to be adjusted to preserve the value of marginal cells.CTA aims at finding the protection direction of each sensitive cell(binary decisions), and the adjustments to be done to each cell(continuous variables), preserving the value of total cells (or somesubset of particular cells, possibly empty), and minimizing thedistance between the original and the released table.

Although CTA formulates mixed integer linear problems(MILPs) with a number of variables and constraints that is linearin the size of the table—and thus its solution can be attemptedwith generic state-of-the-art solvers—real-world instances arechallenging and may require many hours of execution for anoptimal solution (or quasi optimal solution, e.g., with optimalitygaps below 5%). In practice, however, NSAs do not require suchoptimal solutions, and good, feasible approximate solutions areenough. Indeed, a big effort to reach an optimal solution may notmake sense, since some of the parameters of the CTA model areestimated by NSAs (as, e.g., the amount of protection to be

provided to a certain table). Moreover, in practice tabular dataprotection is the last stage of the data cycle, and, in an attempt tomeet publication deadlines, NSAs require methods that find fastsolutions to protect large tables [10]. This work presents aheuristic procedure based on block coordinate descent(BCD) [3,9] for the solution of CTA formulated as a MILP. As itwill be shown, given a practical time limit (i.e., from someminutes to 1 h), for some kinds of tables—namely, hierarchicaltables, discussed below, which are of great practical interest forNSAs—BCD consistently provided better solutions than a generalstate-of-the-art solver, such as CPLEX. For general tables, althoughthe benefits of BCD were not so significant, it was also the mostefficient approach for the largest, real-world, instances.

As far as we know, this is the first heuristic approach for real-world large CTA instances. A previous work [22] applied somemetaheuristics and learning heuristics only to small tables (two-dimensional tables of up to 625 cells), while we consider morecomplex hierarchical tables of about 1,200,000 cells (100,000 ofthem sensitive). We also tried other general metaheuristics asgenetic algorithms without success. Indeed, combinations orslight modifications of solutions are not expected to satisfy thelarge number of linear equations with no particular structure ofCTA. Therefore, for the purpose of guaranteeing this large numberof linear constraints, any practical heuristic should rely on MILPtechnology, as BCD does.

Briefly, BCD is a simple strategy that decomposes the problemsin blocks (or clusters) of variables, and sequentially solves theMILP subproblem for each block. Therefore, in practice BCDbehaves best when the MILP variables may be clustered, andclusters are loosely coupled. For this reason in this work wemainly focus on hierarchical tables. Hierarchical tables areobtained by crossing a number of categorical variables, and someof them have a hierarchical structure, i.e., some tables aresubtables of other tables. Hierarchical tables are of very practicalinterest for NSAs. The simplest hierarchical table is obtained bycrossing two categorical variables, one of them being hierarchical;this particular case is known as 1H2D tables. Fig. 3 illustrates asmall 1H2D table: rows R1 and R3 (related, e.g., to state levelinformation) are decomposed into two subtables (whose rowsare, for example, related to region level information); rows of

C1 C2 C3

R1

R2

R3

C1 C2 C3

5

8

7

3 14

14167

21

12

11

12

11 10

20

10 4

8

20

38

40

24

38

39

40

R13

R12

R33

R32

R31

R11

42

40 39 42

2820 24 28

Fig. 3. Example of 1H2D table: the row factor in the left table splits into one

subtable per row (only two are shown). The number of rows may be different at

each subtable, but the number of columns is the same as in the parent table.

J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–18351828

these two subtables could be subsequently decomposed (e.g., forcity level information). 1H2D tables can be viewed as a tree ofsubtables. Note, however, that BCD is not tailored to hierarchicaltables. Indeed, in this work it was also applied to general complextables, though the most successful results were obtained forhierarchical tables. This is partly explained by the inherent blockstructure of hierarchical tables. Some attempts for extracting theblock structure of general tables were made [6] but, in general,the resulting blocks were far too coupled for complex instances.Note that, in practice, tables (either 1H2D or general) arerepresented by a set of linear equations imposing each of themthat the sum of some set of cells equals a given marginal cell. Thismodel will be used in the CTA formulation of next section.

There are other approaches similar to BCD, namely fix-and-relaxand branch-and-fix, which have been used in other large MILPsarising, respectively, in scheduling and stochastic programmingproblems [1,12]. As BCD, fix-and-relax decomposes the problem ink blocks of binary variables. It iteratively optimizes for each block i

of binary variables, fixing the variables of blocks jo i, and relaxingthe variables of blocks j4 i, unlike BCD, which also fixes thevariables of blocks j4 i. Moreover, fix-and-relax performs one singleiteration, whereas BCD keeps on iterating while the solution isimproved. Branch-and-fix is a significantly different procedure, butit has in common with the other approaches the idea of decompos-ing a large MILP in smaller subproblems, optimizing —in a coordi-nated way—all of them by fixing some common binary variables.Full details can be found in [1].

The structure of the paper is as follows. Section 2 outlines thegeneral MILP formulation of CTA. Section 3 introduces the BCDprocedure for CTA, and describes an approach based on theBoolean satisfiability problem (SAT for short, hereafter) forobtaining good initial feasible solutions. Section 4 provides someimplementation details and reports the computational resultsusing a set of synthetic and real-world tables.

2. Formulation of CTA as a MILP

Any instance of CTA, either with one table or a number oftables, can be represented by the following elements:

�
An array of cells ai,i¼ 1, . . . ,n, satisfying a set of m linearrelations Aa¼b, aARn being the vector of ai’s, bARm the right-hand-side term of the linear relations (usually b¼ 0), andAARm�n the cell relations matrix. Note that rows of A aremade of 1’s, 0’s and a single �1 (associated to the marginal ortotal cell value). � A vector wARn of positive weights for the deviations of cell
values, used in the definition of the objective function.

�
A lower and upper bounds for each cell i¼ 1, . . . ,n, respec-tively, lxi
and uxi, which are considered to be known by any

data attacker. If no previous knowledge is assumed for cell i

then lxi¼ 0 (lxi

¼�1 if aZ0 is not required) and uxi¼ þ1 can

be used.
� A set S ¼ fi1,i2, . . . ,isgDf1, . . . ,ng of indices of s sensitive or
confidential cells.
� A lower and upper protection levels for each confidential cell
iAS, respectively, lpli and upli, such that the released valuesxARn must satisfy either xiZaiþupli or xirai�lpli.

CTA attempts to find the closest safe values x, according to somedistance L, which makes the released table safe. This involves thesolution of the following optimization problem:

minx

Jx�aJL

subject to Ax¼ b

lxrxrux

xirai�lpli or xiZaiþupli, iAS: ð1Þ

Problem (1) can also be formulated in terms of deviations fromthe current cell values. Defining

z¼ x�a, lz ¼ lx�a, uz ¼ ux�a, ð2Þ

and using the L1 distance weighted by w, (1) can be recast as:

minz

Xn

i ¼ 1

wijzij

subject to Az¼ 0

lzrzruz

zir�lpli or ziZupli, iAS, ð3Þ

zARn being the vector of deviations. Since w40, introducingvariables zþ , z�ARn so that z¼ zþ�z�, the absolute values maybe written as jzj ¼ zþ þz�. Considering a vector of binary variablesyAf0,1gs for the ‘‘or’’ constraints, problem (3) is finally written asan MILP:

minzþ ,z� ,y

Xn

i ¼ 1

wiðzþ

i þz�i Þ ð4aÞ

subject to Aðzþ�z�Þ ¼ 0 ð4bÞ

0rzþruz, 0rz�r�lz ð4cÞ

yAf0,1gs ð4dÞ

upliyirzþi ruziyi

lplið1�yiÞrz�i r�lzið1�yiÞ

)iAS: ð4eÞ

When yi¼1 the constraints mean uplirzþi ruziand z�i ¼ 0, thus

the protection direction is ‘‘upward’’; when yi¼0 we get zþi ¼ 0and lplirz�i r�lzi

, thus the protection direction is ‘‘downward’’.

3. The heuristic block coordinate descent approach for CTA

Block coordinate descent (BCD) solves a sequence of subpro-blems, each of them optimizing the objective function over a subsetof variables while the remaining variables are kept fixed. This isiteratively repeated until no improvement in the objective functionis achieved, e.g., the difference between some (for instance, two)consecutive objective functions is less than a specified optimalitytolerance. Convergence of this algorithm is only guaranteed forconvex problems where each optimization subproblem has a uniqueoptimizer [3, Prop. 2.7.1] (note that strict convexity satisfies thisrequirement). Although MILP problems are not convex, and thusthey do not guarantee convergence, BCD usually behaves properly in


practical complex applications [9,26], and it can be used as aheuristic approach.

For the particular case of CTA, if the number of sensitive cells s

is very large, optimal solutions may require computationallyprohibitive executions. BCD may provide good approximate solu-tions by optimizing at each iteration the protection direction(either ‘‘downward’’ or ‘‘upward’’) of a subset of sensitive cells,and the deviations for all the cells. The protection directions of theremaining sensitive cells are kept constant at the optimal valuesof previous iterations. Note that continuous variables of theproblem (the deviations for all the cells) are never fixed; blockingand fixing is only performed for the binary variables, which is asignificant difference with the standard BCD method. Partitioningthe binary variables y of (4a)–(4e) into k blocks, and denoting yj,i

as the fixed values of block j at inner iteration i, the algorithm isroughly as follows:

Step 0

Fig. 4. E

by dash

Initialization: Set outer iteration counter: t¼0. Set initialvalues, hopefully feasible, to y.

Step 1
t¼t+1: Set inner iteration counter i¼0.Divide y into k blocks: y¼ fy1,i, . . . ,yk,ig, not necessarily ofthe same size.Step 1.1 i : ¼ iþ1 :. Solve (4a)–(4e) with respect to block
yi,i, taking into account that yj,i is fixed for ja i.Let yi,i + 1

¼ (yi,i)n (the point at the optimum). Letyj,i +1

¼ yj,i for ja i.Step 1.2 If iok go to Step 1.1.

xample of

ed or dotte

Step 2
Check for end conditions: if apply, stop, and return thecurrent best solution. Otherwise, go to Step 1
Note that the original problem (4a)–(4e) is solved if only oneblock of variables is considered. Therefore, although BCD is aheuristic approach for MILP problems, it is easily switched to anoptimal approach by setting k¼ 1 at Step 1 for some advanced t.The subproblems of Step 1.1 may be solved by any MILP method;we used the branch-and-cut solver of CPLEX.

Different strategies can be considered for the division ofvariables y between consecutive major iterations, e.g., reversingthe order of blocks, changing the blocks, etc. Using a true partitionof the sensitive cells means that each cell has just one chance ofbeing ‘‘upper’’ or ‘‘lower’’ protected at each major iteration.However, this is not strictly needed, and in practice somesensitive cells could belong to more than one block. In this casethe decision variables associated to these cells would be deter-mined more than once for some t.

Two strategies have been tested, which can be viewed as aframework whose implementation would admit further possibi-lities. The first strategy (named random-BCD) divides S randomlyinto a number of blocks, keeping their sizes as similar as possible.The partition is obtained by shuffling the variables such thatdifferent blocks are considered at major iterations.

The second strategy (that will be referred to as tree-BCD) istailored for 1H2D hierarchical tables, exploiting the tree structure

0.1

0.1.1 0.1.2

Level 1

Level 2

Level 3

the tree structure of a 1H2D table, variables being partitioned by leve

d lines form a block of variables to be optimized together in the fir

of these kinds of tables. In the first major iteration, the sensitivecells are partitioned according to their level: the first block iscomposed of all the variables of cells in the main table (level 1, ortable 0, as seen in Fig. 4); then, the second block takes thevariables in the next level (0.1, 0.2, and so on); the third blockconsiders all the variables in level 3 (0.1.1, 0.1.2, etc.), and so forthuntil the deepest level. Once the first major iteration is finished,the second one builds overlapping blocks of cells: the first blocknow includes level 1 and level 2; the second one, level 2 and level3, etc. A third major iteration makes blocks from three consecu-tive levels. The last major iteration would include all the levels inone block, as a pure CTA problem; if the time limit has not beenreached yet, it could benefit from a warm-start from previoussolutions.

The previous breadth-first strategy was compared with adepth-first strategy, and also implemented. In the latter, blockswere formed with all the subtables from the root to a leaf, takingonly one subtable per level. Note that the derived blocks aresignificantly overlapped. This strategy was not satisfactory, andtheir results are not reported in the computational resultsof Section 4.

One of the main drawbacks of BCD is that dual information forthe whole (4a)–(4e) problem is not obtained, and thus thestopping rule only focuses on improvements between consecutiveiterations. Note, however, that CTA is a minimum distanceproblem (1) and that zero is a readily available lower bound.Another main drawback of BCD is that it may not be able to obtaina feasible solution, unless an initial set of feasible either ‘‘up’’ or‘‘down’’ protection directions for y is set at Step 0. For complexand large instances, looking for an initial feasible pattern ofprotection directions for y is a hard problem (theoretically, ashard as finding the optimal solution). The next subsectiondescribes a heuristic strategy that in practice was very useful.Indeed, as it will be noted in the computational results of Section4, this heuristic for CTA instances provided better initial solutionsthan the several strategies tested with CPLEX.

3.1. Finding a feasible starting point. The SAT method

There are several general approaches for finding an initialpoint in MILP problems (e.g., variable and node selection, feasi-bility pump, etc.). Many of them can be found in [8], and areimplemented in state-of-the-art solvers. An obvious approach forstarting BCD with a feasible point would be to run any of thosesolvers until the first feasible solution for constraints (4a)–(4e) isfound. However, this approach would not exploit the particularstructure of CTA. To avoid this lack of exploitation, a particularstrategy was developed, which has proven to be successful inmost instances. Briefly, this approach consists of two phases: first,the set of constraints Az¼ 0 is scanned to locate those involvingsensitive cells, and each constraint of this subset is analyzed toidentify possible infeasible combinations for protection directionsy of sensitive cells; and second, all the forbidden combinations are

0

0.2

0.2.1 0.2.2

ls. Boxes refer to the set of sensitive variables in a subtable. Groups of boxes closed

st major iteration.


compiled together, and an assignment is sought so that none ofthese combinations is present. For example, assuming cell valuesare nonnegative, if one of the original table relations is

12þ34þ45þ127 ¼ 2010,

where the subindex denotes the cell index, and the lower andupper protection levels of sensitive cells 4 and 7 are, respectively,lpl4¼upl4¼ 2 and lpl7¼upl7¼ 4, then y4¼y7¼1 (i.e., protectiondirection for sensitive cells 4 and 7 is ‘‘up’’) is a forbiddensolution. Indeed, note that in the protected table this relationwould then be

x2þ5þx5þ16¼ 20,

which is infeasible since ðx2,x5ÞZ0. The above two phases areoutlined below.

The first phase exploits the particular structure of the tablerelations. Since a table can be described through linear combina-tions of the cell contents, after rearranging terms, constraint j ofAz¼ 0 of (3) can be recast asXiA Ij

mijzi ¼XiA Ij

0

�mijzi, ð5Þ

where IjDS and Ij0DS are, respectively, the sets of sensitive and

nonsensitive cells in constraint j, and coefficients mij are elementsof the matrix A, either 1 or �1. Next, lower bounds are computedfor each side of (5). The right-hand side is not depending on y, andtherefore bounds areXi A Ij0 :

mij 4 0

�mijðuxi�aiÞþ

Xi A Ij0 :

mij o 0

mijðai�lxiÞr

XiA Ij

0

�mijzi

rXi A Ij0 :

mij 4 0

mijðai�lxiÞþXi A Ij0 :

mij o 0

�mijðuxi�aiÞ: ð6Þ

For some assignment of y, the bounds of the left-hand side areXi A Ij :yi ¼ 0,

mij 4 0

�mijðai�lxiÞþ

Xi A Ij :yi ¼ 0,

mij o 0

�mijlpliþX

i A Ij :yi ¼ 1,

mij 4 0

mijupli

þX

i A Ij :yi ¼ 1,

mij o 0

mijðuxi�aiÞr

XiA Ij

mijzirX

i A Ij :yi ¼ 0,

mij 4 0

�mijlpli

þX

i A Ij :yi ¼ 0,

mij o 0

�mijðai�lxiÞþ

Xi A Ij :yi ¼ 1,

mij 4 0

mijðuxi�aiÞþ

Xi A Ij :yi ¼ 1,

mij o 0

mijupli: ð7Þ

The procedure consists of searching for any y so that the boundsof both sides mismatch (i.e., either the lower bound of the left-hand side is greater than the upper bound of the right-hand side,or the upper bound of the left-hand side is less than the lowerbound of the right-hand side).

Provided that the number of sensitive variables at eachconstraint, jIjj, is small (say, less than 20), exhaustive search couldbe used to find every infeasible combination. However, the cost isprohibitive even for moderate jIjj, so we have developed a simplebut very efficient method to detect the infeasible combinations ina constraint. At first, the procedure determines which combina-tion of y is giving the highest value of the lower bound of (7). Ifthis value is greater than the upper bound of (6), one infeasiblecombination has been found, and others may be around. There-fore, starting from the first one, we go through each combinationthat arises by changing only one direction and that it has not beenvisited before. The combination is pruned, and hence discarded, ifthe resulting lower bound of (7) falls below the upper boundof (6); otherwise it is included in a list of combinations to beexplored, until all of them have been pruned. The same process isrepeated for combinations under the lower bound of (6). Usuallyonly a few combinations appear at each constraint, so the searchis very efficient, whatever the number of sensitive cells in it.

In the second phase, all the infeasible combinations for y

detected in the first phase are collected, and a solution that avoidsall of them is looked for. Note that avoiding all these infeasiblecombinations is a necessary but not a sufficient condition for aninitial feasible point. However, in practice, the resulting solutionwas generally feasible. This problem is known as the Booleansatisfiability (SAT), and it consists of determining a satisfyingvariable assignment for a Boolean function, or determining thatno such assignment exists. The subject of practical SAT solvers hasreceived considerable research attention, with many algorithmsproposed and implemented, e.g. GRASP [25] and SATO [29]. Thesolver used in this work was MiniSAT [19], an open-sourceimplementation of SAT techniques.

In general, a solver operates on problems specified in con-junctive normal form. This form consists of the logical ‘‘and’’ ofone or more clauses (related to one infeasible combination),which themselves consist of the logical ‘‘or’’ of one or moreliterals (related to y variables). For instance, suppose that threecombinations have been detected as infeasible:

(1)
y1 ¼ 1, y2 ¼ 0, y3 ¼ 1, y4 ¼ 1) y14:y24y34y4
(2)
y3 ¼ 1, y2 ¼ 1, y4 ¼ 1) y34y24y4
(3)
y5 ¼ 1, y2 ¼ 0, y1 ¼ 1) y54:y24y1.
None of these combinations is desired, so we seek for an assign-ment of y in such a way that all of them are false. Equivalently, wecan negate them to find an assignment satisfying (as true) allclauses:

ð:y13y23:y33:y4Þ4ð:y33:y23:y4Þ4ð:y53y23:y1Þ:

This expression can be satisfied with (:y14:y3), or numericallyy1¼0 and y3¼0. This condition is sufficient for the SAT problem,so the other variables can take any value.

Once a solution of the corresponding SAT problem is available,the problem (4a)–(4e), with y fixed, can be checked to determineif there exists a feasible solution for zþ and z� . Though theassignment obtained by y does not guarantee the feasibility of theconstraints without sensitive variables, it was usually found thatSAT returned a feasible point, so that BCD was ready to start. Inthose few cases where the SAT assignment failed, other starting y

values were found by a MILP solver, stopping at the first feasiblepoint. The unsuccessful SAT values of y were used as a warm-startfor the MILP solver.

4. Computational results

4.1. Implementation

The BCD approach, including the SAT heuristic for the binaryinitial point, was implemented in C++. The code allows the user toswitch between random-BCD, tree-BCD (only breadth-first strat-egy, the most efficient option), and the solution of CTA by branch-and-cut. Note that BCD subproblems are solved with this samebranch-and-cut for a fair comparison. The SAT heuristic may beactivated to provide an initial point for the branch-and-cutmethod as well, such that the benefits of either the SAT heuristic,or the BCD approach, or both of them together, can be isolated.The optimizer used was CPLEX version 11.

The BCD variants may be tuned with some parameters chosenby the user. The most significant parameters are the total timelimit (also applicable to the branch-and-cut option); the timelimit for each subproblem; the optimality gap for each subpro-blem; and the number of blocks (subproblems) to be considered.Note that a time limit is required by NSAs for data publicationdeadlines in the real-world. The random-BCD variant is also


affected by the randomness of the blocks selection. After explor-ing the effect of these factors with several instances, it can beconcluded that BCD is barely sensitive both to random variabilityand the user adjustments of parameters. No significant differ-ences were found if the optimality gap of subproblems is set tovalues between 1% and 10% (with smaller gaps, subproblems mayachieve better solutions, but they take more time, so fewersubproblems can be solved due to the overall time limit). Forthe total time limit, the larger it is, the better the solution found;for the subproblem time limit, it should be large enough toimprove the initial provided solution, but not too large to avoidthat most of the total time is spent on too few subproblems.

With regard to the number of blocks in the random strategy, itwas observed that it is inadvisable to take a large number, sinceeach subproblem would consider too few sensitive variables atonce, which is unlikely to improve the previous solution. In ourtests, values from 2 to 40 blocks were chosen, noticing that lownumbers (say, below 10) are preferable, with no significantdifferences between them. However, in general it is recom-mended to take more than three or four blocks—or even more,depending on the table size— since lower numbers lead to largesubproblems that could take as much time as the originalproblem.

4.2. Test instances

The BCD approach was tested using both 1H2D (hierarchical)and general complex tables. Real-world tables are provided byNSAs as a set of equations, without providing information aboutthe particular inherent structure (which is difficult to be extractedby general procedures [6]). Hierarchical tables were thus obtainedby a generator of 1H2D synthetic tables.

Some of the parameters of the generator controlling thedimensions of the table are mean number of rows in a table;number of columns per subtable; depth of hierarchical tree;minimum and maximum number of rows with hierarchies for

Table 1Characteristics of instances.

Instance n s m

Table 11 20,280 973 1560

Table 12 21,476 2062 1684

Table 13 36,806 3345 5832

Table 14 5388 224 1474

Table 15 26,884 2443 3368

Table 16 52,063 4732 6900

Table 17 16,852 1531 2390

Table 18 8316 755 1339

Table 21 126,362 12,324 5501

Table 22 43,365 4128 3577

Table 23 71,640 10,442 3430

Table 24 166,248 12,927 7966

Table 25 55,620 4323 2877

Table 26 209,456 16,110 12,164

Table 27 65,241 4744 7240

Table 28 88,164 6854 4321

Table 31 155,841 14,744 6876

Table 32 443,169 44,096 18,230

Table 33 116,841 14,083 4586

Table 34 180,999 26,432 6456

Table 35 499,298 55,527 20,747

Table 36 1,200,439 107,743 45,638

Table 37 296,004 42,652 10,904

Table 38 572,373 81,359 18,873

hier13�13�13d 2197 108 3549

hier16 3564 224 5484

ninenew 6546 858 7340

nine12 10,399 1178 11,362

each table; and probability for a cell to be marked as sensitive.The random generator is available from /http://www-eio.upc.es/� jcastro/generators_csp.htmlS. A set of 24 representative andlarge 1H2D tables was generated. Their main dimensions arereported in Table 1. Columns n, s, m and ‘‘N. coef.’’ show,respectively, the number of cells, sensitive cells, table linearrelations (i.e., rows of matrix A), and nonzero coefficients ofmatrix A. Columns ‘Cont.’’, ‘‘Bin.’’ and ‘‘Constr.’’ show the numberof continuous variables, binary variables, and constraints of theresulting MILP problem (4a)–(4e). The generator parameters wereset such that tables 31–38 are large, 21–28 are medium-sizetables, and 11–18 are the smaller ones. The depth of thehierarchical tree of each instance is six, but tables 16 and 17(depth is five) and table 18 (depth is four).

For the general tables, whose dimensions are also reportedin Table 1, we considered four complex instances of CSPLIB (alibrary of tabular data protection instances, available from/http://webpages.ull.es/users/casc/#CSPlib:S).

4.3. Results

Both BCD variants, random-BCD and tree-BCD, and the state-of-the-art branch-and-cut of CPLEX were compared using theprevious set of instances. For the 1H2D tables, the sequence offeasible solutions obtained for each method was recorded, untilthe time limit was reached. Our goal was to show that, in the caseof early interruption of the optimization process, the quality ofthe best solution reached using BCD was similar or better than thebest solution provided by CPLEX branch-and-cut. All the execu-tions were carried out on a Linux Dell PowerEdge 6950 serverwith four dual core AMD Opteron 8222 3.0 GHZ processors(without exploitation of parallelism capabilities) and 64 GBof RAM.

Figs. 5 and 6 show the evolution of the feasible solutionsobtained for, respectively, the medium-size and the largeinstances. Results for smallest tables, not shown, revealed a

N. coef. Cont. Bin. Constr.

41,314 40,560 973 5452

43,784 42,952 2062 9932

76,087 73,612 3345 19,212

11,346 10,776 224 2370

54,681 53,768 2443 13,140

106,282 104,126 4732 25,828

34,551 33,704 1531 8514

17,204 16,632 755 4359

255,102 252,724 12,324 54,797

88,221 86,730 4128 20,089

144,684 143,280 10,442 45,198

335,808 332,496 12,927 59,674

112,536 111,240 4323 20,169

422,994 418,912 16,110 76,604

131,780 130,482 4744 26,216

178,164 176,328 6854 31,737

314,716 311,682 14,744 65,852

893,718 886,338 44,096 194,614

235,926 233,682 14,083 60,918

364,854 361,998 26,432 112,184

1,007,124 998,596 55,527 242,855

2,417,196 2,400,878 107,743 476,610

597,057 592,008 42,652 181,512

1,152,345 1,144,746 81,359 344,309

11,661 4394 108 3981

19,996 7128 224 6380

32,920 13,092 858 10,772

52,624 20,798 1178 16,074

http://www-eio.upc.es/&sim;jcastro/generators_csp.html

http://www-eio.upc.es/&sim;jcastro/generators_csp.html

http://webpages.ull.es/users/casc/#CSPlib:

case 21 case 22

BCDTree BCDSAT B&CB&C

case 23 case 24

case 25 case 26

case 27

0

case 28

2000 4000 6000 8000 10000 120000 2000 4000 6000 8000 10000 12000

0 2000 4000 6000 8000 10000 120000 2000 4000 6000 8000 10000 12000

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

0 2000 4000 6000 8000 10000

12000

0 2000 4000 6000 8000 10000 12000

750000

850000

950000

450000

550000

650000

750000

400000

500000

600000

1300000

1700000

1500000

1000000

1400000

1200000

2000000

2800000

2400000

1300000

1900000

1600000

800000

1000000

1200000

1400000

Fig. 5. Sequences of feasible solutions for tables 21–28. The horizontal axis shows the CPU time and the vertical axis the best objective function value achieved. The time

limit was 3 h. The branch-and-cut solution for table 26 is not shown because the best value of its objective function was 5.33�109.


similar pattern, though the differences between the methodswere less significant. In those figures, lines ‘‘BCD’’ refer torandom-BCD, ‘‘B&C’’ to standard branch-and-cut solutions and‘‘SAT B&C’’ to branch-and-cut started with SAT solution. Thislatter option allows to compare BCD and branch-and-cut from thesame starting point. For these test tables the SAT techniquereturned feasible points in all cases, and they were of betterquality than the first feasible points provided by CPLEX, whichwere usually of very poor quality.

Five blocks of cells were considered for the instances 11–18 withrandom-BCD, while ten blocks were used for instances 21–28, and

31–38. For the instances of Figs. 5 and 6, the total time limit was setto 3 h. For both BCD variants, the subproblem time limit was 1 h. Inall cases—but for the table 14—branch-and-cut exhausted the timelimit without reaching an optimality gap of 5% (table 14 took lessthan 15 min to be solved, but it is significantly smaller than therest). In some cases the total running time was somewhat largerthan 3 h, as some BCD subproblems exhausted the subproblem timelimit, thus exceeding the total time. This may happen sometimeswith random-BCD (as in table 38), but is more usual with tree-BCDbecause it solves a short sequence of subproblems with increasingdifficulty. In the first major iterations an optimal solution to the

case 31 case 32

BCDTree BCDSAT B&CB&C

case 33 case 34

case 35 case 36

case 37

0

case 38

2000 4000 6000 8000 100000 2000 4000 6000 8000 10000

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 1200010000 14000

12000

0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000 14000

4600000

5000000

5400000

8300000

8600000

8900000

14000000

15000000

16000000

13000000

15000000

17000000

26000000

28000000

30000000

32000000

16000000

18000000

20000000

5800000

6200000

6600000

9000000

9400000

9800000

10200000

Fig. 6. Sequences of feasible solutions for tables 31–38. The horizontal axis shows the CPU time and the vertical axis the best objective function value achieved. The

branch-and-cut solution for all tables but numbers 31 and 33 are not shown because the best value of their objective functions were about 1000 times the values

presented here.


subproblems may quickly be found; in the last major iterations, thesubproblems may take a considerable amount of time. On the otherhand, random-BCD usually consists of a longer sequence of sub-problems, which are usually solved faster, though new solutions donot always improve on the previous one.

With regard to the smallest instances 11–18, the objectivefunction reaches an almost steady state relatively soon, whereBCD variants were faster in achieving an acceptable solution. Onlytable 14 could be solved by the branch-and-cut method within an

optimality tolerance of 5%, and table 18 finished with a gap of just5.20%. Longer computation times were not of much help: forinstance, tables 13 and 16 were run with a time limit of two daysof CPU time. The best objective function value for table 13 was453,459, with an optimality gap of 45%; for table 16 the bestobjective function value was 608,223, with a gap of 55%. Althoughboth solutions are slightly better than those reached by BCDwithin 2 h, the improvement in the objective function is notworth the computational effort.

Table 2Results for general tables.

Instance B&C BCD Time limit (s)

hier13�13�13d 416,604 477,024 350

hier16 5.47�108 6.94�108 7200

ninenew 9.35�108 5.90�108 7200

nine12 1.41�109 9.32�108 7200


With larger instances, the advantage of BCD can even beemphasized with respect to a standard branch-and-cut approach:the latter often gets stuck (at least, in the three-hour time limitconsidered), whereas the objective of BCD sequences tends todecrease. The branch-and-cut sequence for table 26 is not shownin Fig. 5, since the time limit was reached with a (bad) solution ofobjective equal to 5.33�109. The poor initial point obtained bythe CPLEX strategy may be in the root of the weak performance ofthe standard branch-and-cut: in Fig. 6, only instances 31 and 33obtain a solution in the same order of magnitude as that of BCD.The SAT proposal is clearly superior in all cases, and it is seen thatthe evolution of BCD is better than the sequence of branch-and-cut, which usually do not progress within the time windowconsidered.

It is also observed, for instances of any size, that random-BCDand tree-BCD are very similar, the former being slightly better.This was not a priori expected, since tree-BCD was supposed totake advantage of the tree structure of the tables.

Table 2 shows the results obtained for the four complexgeneral tables of Table 1. Columns ‘‘B&C’’ and BCD report theobjective function value reached within the CPU time limit shownby column ‘‘Time limit’’ for, respectively, the CPLEX branch-and-cut and random-BCD. From these results, it cannot be concludedthat BCD is always competitive for general complex tables againsta state-of-the-art branch-and-cut. This is partly explained by thecomplexity of the cell tables’ interrelations, which result in highlycoupled clusters of cells in the BCD approach. However, in thesignificantly two largest instances (ninenew and nine12) BCDreturned a much better solution within the time limit.

It is worth noting that above results were obtained with thedefault options of CPLEX MILP solver, both for the standardbranch-and-cut and the BCD approaches. In order to test if theperformance of CPLEX could be improved by tuning some its MILPparameters, the group of largest tables was solved again with thefollowing changes: first, parameter MIP emphasis set to ‘‘optim-ality’’ (i.e., emphasize optimality over feasibility); second,MIP emphasis set to ‘‘bestbound’’ (i.e., greater emphasis is seton optimality through moving the best bound); third, parameterFPHEUR (feasibility pump [8]) activated. The random-BCDmethod was also executed with both choices of the MIP emphasisparameter. It can be concluded that the default parameters usedby CPLEX performed better, since worse objective functions wereachieved by the manual tuning. BCD always obtained bettersolutions than branch-and-cut solution with the manual tuning,though not as good as those with the default parameters. As forthe feasibility pump heuristic, it did not outperform the auto-matic CPLEX strategy for initial points (which may select feasi-bility pump, if necessary); in any case, points obtained with eitherthe automatic or feasibility pump procedures were always poorerthan those computed by the SAT approach.

5. Conclusions

From our actual experience with real-world instances, it canbe stated that CTA problems can be extremely difficult forlarge and complex tables, even for state-of-the-art MILP solvers.

The BCD approach presented and tested in this work was able toobtain good solutions within 2 or 3 h of the CPU time limit, forlarge (up to 1,200,000 cells, 100,000 being sensitive) 1H2D tables.Moreover, the solution obtained was comparable, and usuallybetter than the incumbent provided by state-of-the-art branch-and-cut solvers within the same time limit, even if the initialpoint found by the SAT procedure is used as warm-start for thebranch-and-cut solver. Even tuning some MILP parameters inorder to improve the CPLEX performance, the BCD approachconsistently returned the best solutions.

For general tables (with unknown internal structure), BCD didnot outperform branch-and-cut for all the tables tested, but it didfor the largest ones. We are thus optimistic about the possibilitiesof the method for even more difficult real-world tables (with ahigher number of cells and sensitive cells). This is partly sup-ported by the better observed behavior of random-BCD againsttree-BCD: random-BCD can be immediately applied to moregeneral (other than 1H2D) classes of tables, without need toexploit the particular internal structure of the table relations.

The (increasing) ability of NSAs to create more complex andhuge tables from collected data is an incentive to develop power-ful tools for CTA. Among them we find Benders’ reformulation ofCTA [2]; some preliminary testing with a prototype showed thatthe approach is efficient for two-dimensional tables [7], butdeeper cuts are needed for more complex tables. Other dataprotection approaches, like interval protection, which results in amassive linear programming problem, and its efficient solution bystructured interior-point methods, are also among the remainingtasks to be addressed in this challenging field.

Acknowledgments

The authors thank Daniel Baena (from the NSA of Catalonia)for generating the tables used in the computational results. Thiswork has been supported by grants MTM2009-08747 of theSpanish Ministry of Science and Innovation, and SGR-2009-1122of the Government of Catalonia.

References

[1] Alonso-Ayuso A, Escudero LF, Ortuno MT. BFC. A branch-and-fix coordinationalgorithmic framework for solving some types of stochastic pure and mixed0–1 programs. European Journal of Operational Research 2003;151:503–19.

[2] Benders JF. Partitioning procedures for solving mixed-variables programmingproblems. Computational Management Science 2005;2(19):3–19 [Englishtranslation of the original paper appeared in Numerische Mathematik 4(1962) 238–52].

[3] Bertsekas DP. Nonlinear programming. 2nd ed. Belmont: Athena Scientific;1999.

[4] Castro J. Minimum-distance controlled perturbation methods for large-scaletabular data protection. European Journal of Operational Research2006;171:39–52.

[5] Castro J. A shortest paths heuristic for statistical disclosure control in positivetables. INFORMS Journal on Computing 2007;19:520–33.

[6] Castro J, Baena D. Automatic structure detection in constraints of tabulardata. In: Lecture notes in computer science, vol. 4302; 2006. p. 12–24.

[7] Castro J, Baena D. Using a mathematical programming modeling language foroptimal CTA. In: Lecture notes in computer science, vol. 5262; 2008. p. 1–12.

[8] Chinneck JW. Feasibility and infeasibility in optimization. New York:Springer; 2008.

[9] Conejo AJ, Castillo E, Minguez R, Garcia-Bertrand R. Decomposition techni-ques in mathematical programming: engineering and science applications.Berlin: Springer; 2006.

[10] Dandekar RA. Energy Information Administration. Department of Energy,USA; 2003, personal communication.

[11] Dandekar RA, Cox LH. Synthetic tabular data: an alternative to complemen-tary cell suppression, manuscript. Energy Information Administration, U.S.Department of Energy, Available from the first author on request ([email protected]); 2002.

[12] Dillenberger Ch, Escudero LF, Wollensak A, Zhang W. On practical resourceallocation for production planning and scheduling with period overlappingsetups. European Journal of Operational Research 1994;75:275–86.


[13] Domingo-Ferrer J, Franconi L, editors. Privacy in statistical databases. In:Lecture notes in computer science, vol. 4302. Berlin: Springer; 2006.

[14] Domingo-Ferrer J, Magkos E. editors. Privacy in statistical databases. In:Lecture notes in computer science, vol. 6344. Berlin: Springer; 2010.

[15] Domingo-Ferrer J, Saigin Y. Privacy in statistical databases. In: Lecture notesin computer science, vol. 5262. Berlin: Springer; 2008.

[16] Domingo-Ferrer J, Torra V. A critique of the sensitivity rules usuallyemployed for statistical table protection. International Journal of UncertaintyFuzziness and Knowledge-Based Systems 2002;10:545–56.

[17] Domingo-Ferrer J, Torra V. editors, Privacy in statistical databases. In: Lecturenotes in computer science, vol. 3050. Berlin: Springer; 2004.

[18] Domingo-Ferrer J, Torra V. Disclosure risk assessment in statisticaldata protection. Journal of Computational and Applied Mathematics2004;164–165:285–93.

[19] Een N, Sorensson N. An extensible sat-solver. In Proceedings of the sixthinternational conference on theory and applications of satisfiability testing;2003.

[20] Giessing S, Hohne J. Eliminating small cells from census counts tables: someconsiderations on transition probabilities. In: Lecture notes in computerscience, vol. 6344; 2010. p. 52–65.

[21] Giessing S, Hundepool A, Castro J. Rounding methods for protecting EU-aggregates. In: Worksession on statistical data confidentiality. Eurostat

methodologies and working papers, Eurostat-Office for Official Publicationsof the European Communities, Luxembourg; 2009. pp. 255–64.

[22] Glover F, Cox LH, Kelly JP, Patil R. Exact, heuristic and metaheuristic methodsfor confidentiality protection by controlled tabular adjustment. InternationalJournal of Operations Research 2008;5:117–28.

[23] Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Lenz R, Naylor J, et al.Handbook on statistical disclosure control. Network of excellence in theEuropean statistical system in the field of statistical disclosure control.Available online at /http://neon.vb.cbs.nl/casc/SDC_Handbook.pdfS; 2010.

[24] Kelly JP, Golden BL, Assad AA. Cell suppression: disclosure protection forsensitive tabular data. Networks 1992;22:28–55.

[25] Marques-Silva JP, Sakallah KA. GRASP: a search algorithm for propositionalsatisfiability. IEEE Transactions on Computers 1999;48:506–21.

[26] Plazas MA. Multistage stochastic model for bidding in electrical markets (inSpanish). PhD thesis, Universidad de Castilla-La Mancha; 2006.

[27] Willenborg L, de Waal T. editors. Elements of statistical disclosure control. In:Lecture notes in statistics, vol. 155. New York: Springer; 2000.

[28] Zayatz L. U.S. Census Bureau, communication at Joint UNECE/Eurostat WorkSession on statistical data confidentiality, Bilbao (Basque Country, Spain);2009.

[29] Zhang H. SATO: an efficient propositional prover. In: Proceedings of theinternational conference on automated deduction; July 1997. pp. 272–5.

http://neon.vb.cbs.nl/casc/SDC_Handbook.pdf

a heuristic block coordinate descent approach for controlled tabular adjustment

Documents