a heuristic block coordinate descent approach for controlled tabular adjustment
TRANSCRIPT
Computers & Operations Research 38 (2011) 1826–1835
Contents lists available at ScienceDirect
Computers & Operations Research
0305-05
doi:10.1
� Corr
E-m
jordi.ca
journal homepage: www.elsevier.com/locate/caor
A heuristic block coordinate descent approach for controlledtabular adjustment
Jose A. Gonzalez, Jordi Castro �
Department of Statistics and Operations Research, Universitat Polit�ecnica de Catalunya, c. Jordi Girona 1–3, 08034 Barcelona, Catalonia, Spain
a r t i c l e i n f o
Available online 19 February 2011
Keywords:
Statistical confidentiality
Statistical disclosure control
Controlled tabular adjustment
Mixed integer linear programming
Heuristics
Block coordinate descent
Decomposition techniques
48/$ - see front matter & 2011 Elsevier Ltd. A
016/j.cor.2011.02.008
esponding author.
ail addresses: [email protected] (J.A. G
[email protected] (J. Castro).
a b s t r a c t
One of the main concerns of national statistical agencies (NSAs) is to publish tabular data. NSAs have to
guarantee that no private information from specific respondents can be disclosed from the released
tables. The purpose of the statistical disclosure control field is to avoid such a leak of private
information. Most protection techniques for tabular data rely on the formulation of a large mathema-
tical programming problem, whose solution is computationally expensive even for tables of moderate
size. One of the emerging techniques in this field is controlled tabular adjustment (CTA). Although CTA
is more efficient than other protection methods, the resulting mixed integer linear problems (MILP) are
still challenging. In this work a heuristic approach based on block coordinate descent decomposition is
designed and applied to large hierarchical and general CTA instances. This approach is compared with
CPLEX, a state-of-the-art MILP solver. Our results, from both synthetic and real tables with up to
1,200,000 cells, 100,000 of them being sensitive (resulting in MILP instances of up to 2,400,000
continuous variables, 100,000 binary variables, and 475,000 constraints) show that the heuristic block
coordinate descent has a better practical behavior than a state-of-the-art solver: for large hierarchical
instances it provides significantly better solutions within a specified realistic time limit, as required by
NSAs in real-world.
& 2011 Elsevier Ltd. All rights reserved.
1. Introduction
National statistical agencies (NSAs) routinely disseminate bothdisaggregated (i.e., microdata or microfiles) and aggregated (i.e.,tabular data) information. Tables are generated by crossing two ormore categorical variables of a particular microfile (i.e., a census),which results in sets of tables, usually with a large number ofcells. NSAs are obliged by law to guarantee that no particularinformation from any respondent can be disclosed from thereleased information. The goal of the statistical disclosure controlfield is to protect such sensitive information [18]. In this work wefocus on tabular data. Some of the state-of-the-art research in thisfield can be found in the recent monographs [13–15,17,27].
Although cell tables show aggregated data for several respon-dents, there is a risk of disclosing individual information. Theexample of Fig. 1 shows a simple case. The left table (a) reportsthe overall net profit of companies by number of employees (rowvariable) and region (column variable), while table (b) providesthe number of companies. If there were only one company with51–99 employees in region r2, then any external attacker would
ll rights reserved.
onzalez),
know the net profit of this company. For two companies, eitherone of them could disclose the other’s net profit, becoming aninternal attacker. Even if the number of companies in this cell waslarger, but one of them could obtain a good estimator of another’snet profit (for instance, by subtracting its own contribution fromthe cell value), the cell would be considered unsafe. These unsafecells which require protection are named sensitive cells. Rulesfor detecting sensitive cells are beyond the scope of this work(see, e.g., [16,23] for a recent discussion).
There are several available techniques for the protection oftabular data. The most widely used nonperturbative method(i.e., one which does not change cell values) is named cellsuppression problem (CSP) [5,24]. NSAs have been consideringCSP as one of the preferred options. Recently, a new perturbativeapproach (i.e., cell values are slightly adjusted or modified)named controlled tabular adjustment (CTA) was introduced [4,11].Although it is a recent approach, CTA is gaining recognitionamong NSAs. For instance, CTA appears in [23] among the list ofcurrently available techniques for tabular data, and it is consid-ered as one of the emerging methods. In addition, we recentlydeveloped in collaboration with the NSAs of Germany and theNetherlands [21] a package for CTA in the scope of two projectsfunded by Eurostat, the Statistical Office of the EuropeanCommunities. The goal of those projects was the safe dissemina-tion of European business and animal production statistics by
r1 r2
51–99 380Me 35Me100–199 700Me 800Me
r1 r2
51–99 20 1or 2100–199 30 35
Fig. 1. Example of disclosure in tabular data. (a) Net profit per number of employees and region. (b) Number of companies per number of employees and region. If there is
only one company with 51–99 employees in region r2, then any attacker knows the net profit of this company. For two companies, any of them can deduce the other’s net
profit, becoming an internal attacker.
r1 r2
51–99 385Me 30Me100–199 695Me 805Me
r1 r2
51–99 375Me 40Me100–199 705Me 795Me
lower protection directionAdjusted table,
upper protection directionAdjusted table,
Fig. 2. Protected (or adjusted) tables with ‘‘lower protection direction’’ and ‘‘upper protection direction’’ considering lower and upper protection levels of five (i.e., value of
sensitive cell is, respectively, reduced and increased by five units) for the original table of Fig. 1.
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–1835 1827
Eurostat, and CTA was used within the overall protection scheme.CTA is thus a method of actual practical relevance. It is worthnoting that in recent specialized workshops on statistical dis-closure control, relevant members of NSAs stated that perturba-tive methods, like CTA, are gaining acceptance [28], and someperturbative approaches are being used for the protection ofnational census tables (e.g., for Germany [20]).
Fig. 2 illustrates CTA on the small two-dimensional tableof Fig. 1, considering cell (51–99,r2) as the only sensitive cell,with lower and upper protection levels of 5. Note that protectionlevels are either directly obtained from the rules that provide theset of sensitive cells or just computed as a percentage of the cellvalue. Depending on the ‘‘protection direction’’ of the sensitivecell, either ‘‘downward’’ or ‘‘upward’’, which has to be decided,the value to be published for this cell will be, respectively, less orequal than the original cell value minus the lower protectionlevel, or greater or equal than the original cell value plus theupper protection level. In the example of Fig. 2, if the protectiondirection is ‘‘down’’, then the value published for the sensitive cellshould be lesser than or equal to 30; a possible adjusted table forthis case is shown in the left table of Fig. 2. If the protectiondirection is ‘‘up’’, then the value must be greater than or equal to40, as shown in the right table of Fig. 2. Note that the remainingcells have to be adjusted to preserve the value of marginal cells.CTA aims at finding the protection direction of each sensitive cell(binary decisions), and the adjustments to be done to each cell(continuous variables), preserving the value of total cells (or somesubset of particular cells, possibly empty), and minimizing thedistance between the original and the released table.
Although CTA formulates mixed integer linear problems(MILPs) with a number of variables and constraints that is linearin the size of the table—and thus its solution can be attemptedwith generic state-of-the-art solvers—real-world instances arechallenging and may require many hours of execution for anoptimal solution (or quasi optimal solution, e.g., with optimalitygaps below 5%). In practice, however, NSAs do not require suchoptimal solutions, and good, feasible approximate solutions areenough. Indeed, a big effort to reach an optimal solution may notmake sense, since some of the parameters of the CTA model areestimated by NSAs (as, e.g., the amount of protection to be
provided to a certain table). Moreover, in practice tabular dataprotection is the last stage of the data cycle, and, in an attempt tomeet publication deadlines, NSAs require methods that find fastsolutions to protect large tables [10]. This work presents aheuristic procedure based on block coordinate descent(BCD) [3,9] for the solution of CTA formulated as a MILP. As itwill be shown, given a practical time limit (i.e., from someminutes to 1 h), for some kinds of tables—namely, hierarchicaltables, discussed below, which are of great practical interest forNSAs—BCD consistently provided better solutions than a generalstate-of-the-art solver, such as CPLEX. For general tables, althoughthe benefits of BCD were not so significant, it was also the mostefficient approach for the largest, real-world, instances.
As far as we know, this is the first heuristic approach for real-world large CTA instances. A previous work [22] applied somemetaheuristics and learning heuristics only to small tables (two-dimensional tables of up to 625 cells), while we consider morecomplex hierarchical tables of about 1,200,000 cells (100,000 ofthem sensitive). We also tried other general metaheuristics asgenetic algorithms without success. Indeed, combinations orslight modifications of solutions are not expected to satisfy thelarge number of linear equations with no particular structure ofCTA. Therefore, for the purpose of guaranteeing this large numberof linear constraints, any practical heuristic should rely on MILPtechnology, as BCD does.
Briefly, BCD is a simple strategy that decomposes the problemsin blocks (or clusters) of variables, and sequentially solves theMILP subproblem for each block. Therefore, in practice BCDbehaves best when the MILP variables may be clustered, andclusters are loosely coupled. For this reason in this work wemainly focus on hierarchical tables. Hierarchical tables areobtained by crossing a number of categorical variables, and someof them have a hierarchical structure, i.e., some tables aresubtables of other tables. Hierarchical tables are of very practicalinterest for NSAs. The simplest hierarchical table is obtained bycrossing two categorical variables, one of them being hierarchical;this particular case is known as 1H2D tables. Fig. 3 illustrates asmall 1H2D table: rows R1 and R3 (related, e.g., to state levelinformation) are decomposed into two subtables (whose rowsare, for example, related to region level information); rows of
C1 C2 C3
R1
R2
R3
C1 C2 C3
5
8
7
3 14
14167
21
12
11
12
11 10
20
10 4
8
20
38
40
24
38
39
40
R13
R12
R33
R32
R31
R11
42
40 39 42
2820 24 28
Fig. 3. Example of 1H2D table: the row factor in the left table splits into one
subtable per row (only two are shown). The number of rows may be different at
each subtable, but the number of columns is the same as in the parent table.
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–18351828
these two subtables could be subsequently decomposed (e.g., forcity level information). 1H2D tables can be viewed as a tree ofsubtables. Note, however, that BCD is not tailored to hierarchicaltables. Indeed, in this work it was also applied to general complextables, though the most successful results were obtained forhierarchical tables. This is partly explained by the inherent blockstructure of hierarchical tables. Some attempts for extracting theblock structure of general tables were made [6] but, in general,the resulting blocks were far too coupled for complex instances.Note that, in practice, tables (either 1H2D or general) arerepresented by a set of linear equations imposing each of themthat the sum of some set of cells equals a given marginal cell. Thismodel will be used in the CTA formulation of next section.
There are other approaches similar to BCD, namely fix-and-relaxand branch-and-fix, which have been used in other large MILPsarising, respectively, in scheduling and stochastic programmingproblems [1,12]. As BCD, fix-and-relax decomposes the problem ink blocks of binary variables. It iteratively optimizes for each block i
of binary variables, fixing the variables of blocks jo i, and relaxingthe variables of blocks j4 i, unlike BCD, which also fixes thevariables of blocks j4 i. Moreover, fix-and-relax performs one singleiteration, whereas BCD keeps on iterating while the solution isimproved. Branch-and-fix is a significantly different procedure, butit has in common with the other approaches the idea of decompos-ing a large MILP in smaller subproblems, optimizing —in a coordi-nated way—all of them by fixing some common binary variables.Full details can be found in [1].
The structure of the paper is as follows. Section 2 outlines thegeneral MILP formulation of CTA. Section 3 introduces the BCDprocedure for CTA, and describes an approach based on theBoolean satisfiability problem (SAT for short, hereafter) forobtaining good initial feasible solutions. Section 4 provides someimplementation details and reports the computational resultsusing a set of synthetic and real-world tables.
2. Formulation of CTA as a MILP
Any instance of CTA, either with one table or a number oftables, can be represented by the following elements:
�
An array of cells ai,i¼ 1, . . . ,n, satisfying a set of m linearrelations Aa¼b, aARn being the vector of ai’s, bARm the right-hand-side term of the linear relations (usually b¼ 0), andAARm�n the cell relations matrix. Note that rows of A aremade of 1’s, 0’s and a single �1 (associated to the marginal ortotal cell value). � A vector wARn of positive weights for the deviations of cellvalues, used in the definition of the objective function.
�
A lower and upper bounds for each cell i¼ 1, . . . ,n, respec-tively, lxiand uxi, which are considered to be known by any
data attacker. If no previous knowledge is assumed for cell i
then lxi¼ 0 (lxi
¼�1 if aZ0 is not required) and uxi¼ þ1 can
be used.
� A set S ¼ fi1,i2, . . . ,isgDf1, . . . ,ng of indices of s sensitive orconfidential cells.
� A lower and upper protection levels for each confidential celliAS, respectively, lpli and upli, such that the released valuesxARn must satisfy either xiZaiþupli or xirai�lpli.
CTA attempts to find the closest safe values x, according to somedistance L, which makes the released table safe. This involves thesolution of the following optimization problem:
minx
Jx�aJL
subject to Ax¼ b
lxrxrux
xirai�lpli or xiZaiþupli, iAS: ð1Þ
Problem (1) can also be formulated in terms of deviations fromthe current cell values. Defining
z¼ x�a, lz ¼ lx�a, uz ¼ ux�a, ð2Þ
and using the L1 distance weighted by w, (1) can be recast as:
minz
Xn
i ¼ 1
wijzij
subject to Az¼ 0
lzrzruz
zir�lpli or ziZupli, iAS, ð3Þ
zARn being the vector of deviations. Since w40, introducingvariables zþ , z�ARn so that z¼ zþ�z�, the absolute values maybe written as jzj ¼ zþ þz�. Considering a vector of binary variablesyAf0,1gs for the ‘‘or’’ constraints, problem (3) is finally written asan MILP:
minzþ ,z� ,y
Xn
i ¼ 1
wiðzþ
i þz�i Þ ð4aÞ
subject to Aðzþ�z�Þ ¼ 0 ð4bÞ
0rzþruz, 0rz�r�lz ð4cÞ
yAf0,1gs ð4dÞ
upliyirzþi ruziyi
lplið1�yiÞrz�i r�lzið1�yiÞ
)iAS: ð4eÞ
When yi¼1 the constraints mean uplirzþi ruziand z�i ¼ 0, thus
the protection direction is ‘‘upward’’; when yi¼0 we get zþi ¼ 0and lplirz�i r�lzi
, thus the protection direction is ‘‘downward’’.
3. The heuristic block coordinate descent approach for CTA
Block coordinate descent (BCD) solves a sequence of subpro-blems, each of them optimizing the objective function over a subsetof variables while the remaining variables are kept fixed. This isiteratively repeated until no improvement in the objective functionis achieved, e.g., the difference between some (for instance, two)consecutive objective functions is less than a specified optimalitytolerance. Convergence of this algorithm is only guaranteed forconvex problems where each optimization subproblem has a uniqueoptimizer [3, Prop. 2.7.1] (note that strict convexity satisfies thisrequirement). Although MILP problems are not convex, and thusthey do not guarantee convergence, BCD usually behaves properly in
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–1835 1829
practical complex applications [9,26], and it can be used as aheuristic approach.
For the particular case of CTA, if the number of sensitive cells s
is very large, optimal solutions may require computationallyprohibitive executions. BCD may provide good approximate solu-tions by optimizing at each iteration the protection direction(either ‘‘downward’’ or ‘‘upward’’) of a subset of sensitive cells,and the deviations for all the cells. The protection directions of theremaining sensitive cells are kept constant at the optimal valuesof previous iterations. Note that continuous variables of theproblem (the deviations for all the cells) are never fixed; blockingand fixing is only performed for the binary variables, which is asignificant difference with the standard BCD method. Partitioningthe binary variables y of (4a)–(4e) into k blocks, and denoting yj,i
as the fixed values of block j at inner iteration i, the algorithm isroughly as follows:
Step 0
Fig. 4. E
by dash
Initialization: Set outer iteration counter: t¼0. Set initialvalues, hopefully feasible, to y.
Step 1
t¼t+1: Set inner iteration counter i¼0.Divide y into k blocks: y¼ fy1,i, . . . ,yk,ig, not necessarily ofthe same size.Step 1.1 i : ¼ iþ1 :. Solve (4a)–(4e) with respect to blockyi,i, taking into account that yj,i is fixed for ja i.Let yi,i + 1
¼ (yi,i)n (the point at the optimum). Letyj,i +1
¼ yj,i for ja i.Step 1.2 If iok go to Step 1.1.
xample of
ed or dotte
Step 2
Check for end conditions: if apply, stop, and return thecurrent best solution. Otherwise, go to Step 1Note that the original problem (4a)–(4e) is solved if only oneblock of variables is considered. Therefore, although BCD is aheuristic approach for MILP problems, it is easily switched to anoptimal approach by setting k¼ 1 at Step 1 for some advanced t.The subproblems of Step 1.1 may be solved by any MILP method;we used the branch-and-cut solver of CPLEX.
Different strategies can be considered for the division ofvariables y between consecutive major iterations, e.g., reversingthe order of blocks, changing the blocks, etc. Using a true partitionof the sensitive cells means that each cell has just one chance ofbeing ‘‘upper’’ or ‘‘lower’’ protected at each major iteration.However, this is not strictly needed, and in practice somesensitive cells could belong to more than one block. In this casethe decision variables associated to these cells would be deter-mined more than once for some t.
Two strategies have been tested, which can be viewed as aframework whose implementation would admit further possibi-lities. The first strategy (named random-BCD) divides S randomlyinto a number of blocks, keeping their sizes as similar as possible.The partition is obtained by shuffling the variables such thatdifferent blocks are considered at major iterations.
The second strategy (that will be referred to as tree-BCD) istailored for 1H2D hierarchical tables, exploiting the tree structure
0.1
0.1.1 0.1.2
Level 1
Level 2
Level 3
the tree structure of a 1H2D table, variables being partitioned by leve
d lines form a block of variables to be optimized together in the fir
of these kinds of tables. In the first major iteration, the sensitivecells are partitioned according to their level: the first block iscomposed of all the variables of cells in the main table (level 1, ortable 0, as seen in Fig. 4); then, the second block takes thevariables in the next level (0.1, 0.2, and so on); the third blockconsiders all the variables in level 3 (0.1.1, 0.1.2, etc.), and so forthuntil the deepest level. Once the first major iteration is finished,the second one builds overlapping blocks of cells: the first blocknow includes level 1 and level 2; the second one, level 2 and level3, etc. A third major iteration makes blocks from three consecu-tive levels. The last major iteration would include all the levels inone block, as a pure CTA problem; if the time limit has not beenreached yet, it could benefit from a warm-start from previoussolutions.
The previous breadth-first strategy was compared with adepth-first strategy, and also implemented. In the latter, blockswere formed with all the subtables from the root to a leaf, takingonly one subtable per level. Note that the derived blocks aresignificantly overlapped. This strategy was not satisfactory, andtheir results are not reported in the computational resultsof Section 4.
One of the main drawbacks of BCD is that dual information forthe whole (4a)–(4e) problem is not obtained, and thus thestopping rule only focuses on improvements between consecutiveiterations. Note, however, that CTA is a minimum distanceproblem (1) and that zero is a readily available lower bound.Another main drawback of BCD is that it may not be able to obtaina feasible solution, unless an initial set of feasible either ‘‘up’’ or‘‘down’’ protection directions for y is set at Step 0. For complexand large instances, looking for an initial feasible pattern ofprotection directions for y is a hard problem (theoretically, ashard as finding the optimal solution). The next subsectiondescribes a heuristic strategy that in practice was very useful.Indeed, as it will be noted in the computational results of Section4, this heuristic for CTA instances provided better initial solutionsthan the several strategies tested with CPLEX.
3.1. Finding a feasible starting point. The SAT method
There are several general approaches for finding an initialpoint in MILP problems (e.g., variable and node selection, feasi-bility pump, etc.). Many of them can be found in [8], and areimplemented in state-of-the-art solvers. An obvious approach forstarting BCD with a feasible point would be to run any of thosesolvers until the first feasible solution for constraints (4a)–(4e) isfound. However, this approach would not exploit the particularstructure of CTA. To avoid this lack of exploitation, a particularstrategy was developed, which has proven to be successful inmost instances. Briefly, this approach consists of two phases: first,the set of constraints Az¼ 0 is scanned to locate those involvingsensitive cells, and each constraint of this subset is analyzed toidentify possible infeasible combinations for protection directionsy of sensitive cells; and second, all the forbidden combinations are
0
0.2
0.2.1 0.2.2
ls. Boxes refer to the set of sensitive variables in a subtable. Groups of boxes closed
st major iteration.
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–18351830
compiled together, and an assignment is sought so that none ofthese combinations is present. For example, assuming cell valuesare nonnegative, if one of the original table relations is
12þ34þ45þ127 ¼ 2010,
where the subindex denotes the cell index, and the lower andupper protection levels of sensitive cells 4 and 7 are, respectively,lpl4¼upl4¼ 2 and lpl7¼upl7¼ 4, then y4¼y7¼1 (i.e., protectiondirection for sensitive cells 4 and 7 is ‘‘up’’) is a forbiddensolution. Indeed, note that in the protected table this relationwould then be
x2þ5þx5þ16¼ 20,
which is infeasible since ðx2,x5ÞZ0. The above two phases areoutlined below.
The first phase exploits the particular structure of the tablerelations. Since a table can be described through linear combina-tions of the cell contents, after rearranging terms, constraint j ofAz¼ 0 of (3) can be recast asXiA Ij
mijzi ¼XiA Ij
0
�mijzi, ð5Þ
where IjDS and Ij0DS are, respectively, the sets of sensitive and
nonsensitive cells in constraint j, and coefficients mij are elementsof the matrix A, either 1 or �1. Next, lower bounds are computedfor each side of (5). The right-hand side is not depending on y, andtherefore bounds areXi A Ij0 :
mij 4 0
�mijðuxi�aiÞþ
Xi A Ij0 :
mij o 0
mijðai�lxiÞr
XiA Ij
0
�mijzi
rXi A Ij0 :
mij 4 0
mijðai�lxiÞþXi A Ij0 :
mij o 0
�mijðuxi�aiÞ: ð6Þ
For some assignment of y, the bounds of the left-hand side areXi A Ij :yi ¼ 0,
mij 4 0
�mijðai�lxiÞþ
Xi A Ij :yi ¼ 0,
mij o 0
�mijlpliþX
i A Ij :yi ¼ 1,
mij 4 0
mijupli
þX
i A Ij :yi ¼ 1,
mij o 0
mijðuxi�aiÞr
XiA Ij
mijzirX
i A Ij :yi ¼ 0,
mij 4 0
�mijlpli
þX
i A Ij :yi ¼ 0,
mij o 0
�mijðai�lxiÞþ
Xi A Ij :yi ¼ 1,
mij 4 0
mijðuxi�aiÞþ
Xi A Ij :yi ¼ 1,
mij o 0
mijupli: ð7Þ
The procedure consists of searching for any y so that the boundsof both sides mismatch (i.e., either the lower bound of the left-hand side is greater than the upper bound of the right-hand side,or the upper bound of the left-hand side is less than the lowerbound of the right-hand side).
Provided that the number of sensitive variables at eachconstraint, jIjj, is small (say, less than 20), exhaustive search couldbe used to find every infeasible combination. However, the cost isprohibitive even for moderate jIjj, so we have developed a simplebut very efficient method to detect the infeasible combinations ina constraint. At first, the procedure determines which combina-tion of y is giving the highest value of the lower bound of (7). Ifthis value is greater than the upper bound of (6), one infeasiblecombination has been found, and others may be around. There-fore, starting from the first one, we go through each combinationthat arises by changing only one direction and that it has not beenvisited before. The combination is pruned, and hence discarded, ifthe resulting lower bound of (7) falls below the upper boundof (6); otherwise it is included in a list of combinations to beexplored, until all of them have been pruned. The same process isrepeated for combinations under the lower bound of (6). Usuallyonly a few combinations appear at each constraint, so the searchis very efficient, whatever the number of sensitive cells in it.
In the second phase, all the infeasible combinations for y
detected in the first phase are collected, and a solution that avoidsall of them is looked for. Note that avoiding all these infeasiblecombinations is a necessary but not a sufficient condition for aninitial feasible point. However, in practice, the resulting solutionwas generally feasible. This problem is known as the Booleansatisfiability (SAT), and it consists of determining a satisfyingvariable assignment for a Boolean function, or determining thatno such assignment exists. The subject of practical SAT solvers hasreceived considerable research attention, with many algorithmsproposed and implemented, e.g. GRASP [25] and SATO [29]. Thesolver used in this work was MiniSAT [19], an open-sourceimplementation of SAT techniques.
In general, a solver operates on problems specified in con-junctive normal form. This form consists of the logical ‘‘and’’ ofone or more clauses (related to one infeasible combination),which themselves consist of the logical ‘‘or’’ of one or moreliterals (related to y variables). For instance, suppose that threecombinations have been detected as infeasible:
(1)
y1 ¼ 1, y2 ¼ 0, y3 ¼ 1, y4 ¼ 1) y14:y24y34y4(2)
y3 ¼ 1, y2 ¼ 1, y4 ¼ 1) y34y24y4(3)
y5 ¼ 1, y2 ¼ 0, y1 ¼ 1) y54:y24y1.None of these combinations is desired, so we seek for an assign-ment of y in such a way that all of them are false. Equivalently, wecan negate them to find an assignment satisfying (as true) allclauses:
ð:y13y23:y33:y4Þ4ð:y33:y23:y4Þ4ð:y53y23:y1Þ:
This expression can be satisfied with (:y14:y3), or numericallyy1¼0 and y3¼0. This condition is sufficient for the SAT problem,so the other variables can take any value.
Once a solution of the corresponding SAT problem is available,the problem (4a)–(4e), with y fixed, can be checked to determineif there exists a feasible solution for zþ and z� . Though theassignment obtained by y does not guarantee the feasibility of theconstraints without sensitive variables, it was usually found thatSAT returned a feasible point, so that BCD was ready to start. Inthose few cases where the SAT assignment failed, other starting y
values were found by a MILP solver, stopping at the first feasiblepoint. The unsuccessful SAT values of y were used as a warm-startfor the MILP solver.
4. Computational results
4.1. Implementation
The BCD approach, including the SAT heuristic for the binaryinitial point, was implemented in C++. The code allows the user toswitch between random-BCD, tree-BCD (only breadth-first strat-egy, the most efficient option), and the solution of CTA by branch-and-cut. Note that BCD subproblems are solved with this samebranch-and-cut for a fair comparison. The SAT heuristic may beactivated to provide an initial point for the branch-and-cutmethod as well, such that the benefits of either the SAT heuristic,or the BCD approach, or both of them together, can be isolated.The optimizer used was CPLEX version 11.
The BCD variants may be tuned with some parameters chosenby the user. The most significant parameters are the total timelimit (also applicable to the branch-and-cut option); the timelimit for each subproblem; the optimality gap for each subpro-blem; and the number of blocks (subproblems) to be considered.Note that a time limit is required by NSAs for data publicationdeadlines in the real-world. The random-BCD variant is also
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–1835 1831
affected by the randomness of the blocks selection. After explor-ing the effect of these factors with several instances, it can beconcluded that BCD is barely sensitive both to random variabilityand the user adjustments of parameters. No significant differ-ences were found if the optimality gap of subproblems is set tovalues between 1% and 10% (with smaller gaps, subproblems mayachieve better solutions, but they take more time, so fewersubproblems can be solved due to the overall time limit). Forthe total time limit, the larger it is, the better the solution found;for the subproblem time limit, it should be large enough toimprove the initial provided solution, but not too large to avoidthat most of the total time is spent on too few subproblems.
With regard to the number of blocks in the random strategy, itwas observed that it is inadvisable to take a large number, sinceeach subproblem would consider too few sensitive variables atonce, which is unlikely to improve the previous solution. In ourtests, values from 2 to 40 blocks were chosen, noticing that lownumbers (say, below 10) are preferable, with no significantdifferences between them. However, in general it is recom-mended to take more than three or four blocks—or even more,depending on the table size— since lower numbers lead to largesubproblems that could take as much time as the originalproblem.
4.2. Test instances
The BCD approach was tested using both 1H2D (hierarchical)and general complex tables. Real-world tables are provided byNSAs as a set of equations, without providing information aboutthe particular inherent structure (which is difficult to be extractedby general procedures [6]). Hierarchical tables were thus obtainedby a generator of 1H2D synthetic tables.
Some of the parameters of the generator controlling thedimensions of the table are mean number of rows in a table;number of columns per subtable; depth of hierarchical tree;minimum and maximum number of rows with hierarchies for
Table 1Characteristics of instances.
Instance n s m
Table 11 20,280 973 1560
Table 12 21,476 2062 1684
Table 13 36,806 3345 5832
Table 14 5388 224 1474
Table 15 26,884 2443 3368
Table 16 52,063 4732 6900
Table 17 16,852 1531 2390
Table 18 8316 755 1339
Table 21 126,362 12,324 5501
Table 22 43,365 4128 3577
Table 23 71,640 10,442 3430
Table 24 166,248 12,927 7966
Table 25 55,620 4323 2877
Table 26 209,456 16,110 12,164
Table 27 65,241 4744 7240
Table 28 88,164 6854 4321
Table 31 155,841 14,744 6876
Table 32 443,169 44,096 18,230
Table 33 116,841 14,083 4586
Table 34 180,999 26,432 6456
Table 35 499,298 55,527 20,747
Table 36 1,200,439 107,743 45,638
Table 37 296,004 42,652 10,904
Table 38 572,373 81,359 18,873
hier13�13�13d 2197 108 3549
hier16 3564 224 5484
ninenew 6546 858 7340
nine12 10,399 1178 11,362
each table; and probability for a cell to be marked as sensitive.The random generator is available from /http://www-eio.upc.es/� jcastro/generators_csp.htmlS. A set of 24 representative andlarge 1H2D tables was generated. Their main dimensions arereported in Table 1. Columns n, s, m and ‘‘N. coef.’’ show,respectively, the number of cells, sensitive cells, table linearrelations (i.e., rows of matrix A), and nonzero coefficients ofmatrix A. Columns ‘Cont.’’, ‘‘Bin.’’ and ‘‘Constr.’’ show the numberof continuous variables, binary variables, and constraints of theresulting MILP problem (4a)–(4e). The generator parameters wereset such that tables 31–38 are large, 21–28 are medium-sizetables, and 11–18 are the smaller ones. The depth of thehierarchical tree of each instance is six, but tables 16 and 17(depth is five) and table 18 (depth is four).
For the general tables, whose dimensions are also reportedin Table 1, we considered four complex instances of CSPLIB (alibrary of tabular data protection instances, available from/http://webpages.ull.es/users/casc/#CSPlib:S).
4.3. Results
Both BCD variants, random-BCD and tree-BCD, and the state-of-the-art branch-and-cut of CPLEX were compared using theprevious set of instances. For the 1H2D tables, the sequence offeasible solutions obtained for each method was recorded, untilthe time limit was reached. Our goal was to show that, in the caseof early interruption of the optimization process, the quality ofthe best solution reached using BCD was similar or better than thebest solution provided by CPLEX branch-and-cut. All the execu-tions were carried out on a Linux Dell PowerEdge 6950 serverwith four dual core AMD Opteron 8222 3.0 GHZ processors(without exploitation of parallelism capabilities) and 64 GBof RAM.
Figs. 5 and 6 show the evolution of the feasible solutionsobtained for, respectively, the medium-size and the largeinstances. Results for smallest tables, not shown, revealed a
N. coef. Cont. Bin. Constr.
41,314 40,560 973 5452
43,784 42,952 2062 9932
76,087 73,612 3345 19,212
11,346 10,776 224 2370
54,681 53,768 2443 13,140
106,282 104,126 4732 25,828
34,551 33,704 1531 8514
17,204 16,632 755 4359
255,102 252,724 12,324 54,797
88,221 86,730 4128 20,089
144,684 143,280 10,442 45,198
335,808 332,496 12,927 59,674
112,536 111,240 4323 20,169
422,994 418,912 16,110 76,604
131,780 130,482 4744 26,216
178,164 176,328 6854 31,737
314,716 311,682 14,744 65,852
893,718 886,338 44,096 194,614
235,926 233,682 14,083 60,918
364,854 361,998 26,432 112,184
1,007,124 998,596 55,527 242,855
2,417,196 2,400,878 107,743 476,610
597,057 592,008 42,652 181,512
1,152,345 1,144,746 81,359 344,309
11,661 4394 108 3981
19,996 7128 224 6380
32,920 13,092 858 10,772
52,624 20,798 1178 16,074
case 21 case 22
BCDTree BCDSAT B&CB&C
case 23 case 24
case 25 case 26
case 27
0
case 28
2000 4000 6000 8000 10000 120000 2000 4000 6000 8000 10000 12000
0 2000 4000 6000 8000 10000 120000 2000 4000 6000 8000 10000 12000
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
0 2000 4000 6000 8000 10000
12000
0 2000 4000 6000 8000 10000 12000
750000
850000
950000
450000
550000
650000
750000
400000
500000
600000
1300000
1700000
1500000
1000000
1400000
1200000
2000000
2800000
2400000
1300000
1900000
1600000
800000
1000000
1200000
1400000
Fig. 5. Sequences of feasible solutions for tables 21–28. The horizontal axis shows the CPU time and the vertical axis the best objective function value achieved. The time
limit was 3 h. The branch-and-cut solution for table 26 is not shown because the best value of its objective function was 5.33�109.
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–18351832
similar pattern, though the differences between the methodswere less significant. In those figures, lines ‘‘BCD’’ refer torandom-BCD, ‘‘B&C’’ to standard branch-and-cut solutions and‘‘SAT B&C’’ to branch-and-cut started with SAT solution. Thislatter option allows to compare BCD and branch-and-cut from thesame starting point. For these test tables the SAT techniquereturned feasible points in all cases, and they were of betterquality than the first feasible points provided by CPLEX, whichwere usually of very poor quality.
Five blocks of cells were considered for the instances 11–18 withrandom-BCD, while ten blocks were used for instances 21–28, and
31–38. For the instances of Figs. 5 and 6, the total time limit was setto 3 h. For both BCD variants, the subproblem time limit was 1 h. Inall cases—but for the table 14—branch-and-cut exhausted the timelimit without reaching an optimality gap of 5% (table 14 took lessthan 15 min to be solved, but it is significantly smaller than therest). In some cases the total running time was somewhat largerthan 3 h, as some BCD subproblems exhausted the subproblem timelimit, thus exceeding the total time. This may happen sometimeswith random-BCD (as in table 38), but is more usual with tree-BCDbecause it solves a short sequence of subproblems with increasingdifficulty. In the first major iterations an optimal solution to the
case 31 case 32
BCDTree BCDSAT B&CB&C
case 33 case 34
case 35 case 36
case 37
0
case 38
2000 4000 6000 8000 100000 2000 4000 6000 8000 10000
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 1200010000 14000
12000
0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000 14000
4600000
5000000
5400000
8300000
8600000
8900000
14000000
15000000
16000000
13000000
15000000
17000000
26000000
28000000
30000000
32000000
16000000
18000000
20000000
5800000
6200000
6600000
9000000
9400000
9800000
10200000
Fig. 6. Sequences of feasible solutions for tables 31–38. The horizontal axis shows the CPU time and the vertical axis the best objective function value achieved. The
branch-and-cut solution for all tables but numbers 31 and 33 are not shown because the best value of their objective functions were about 1000 times the values
presented here.
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–1835 1833
subproblems may quickly be found; in the last major iterations, thesubproblems may take a considerable amount of time. On the otherhand, random-BCD usually consists of a longer sequence of sub-problems, which are usually solved faster, though new solutions donot always improve on the previous one.
With regard to the smallest instances 11–18, the objectivefunction reaches an almost steady state relatively soon, whereBCD variants were faster in achieving an acceptable solution. Onlytable 14 could be solved by the branch-and-cut method within an
optimality tolerance of 5%, and table 18 finished with a gap of just5.20%. Longer computation times were not of much help: forinstance, tables 13 and 16 were run with a time limit of two daysof CPU time. The best objective function value for table 13 was453,459, with an optimality gap of 45%; for table 16 the bestobjective function value was 608,223, with a gap of 55%. Althoughboth solutions are slightly better than those reached by BCDwithin 2 h, the improvement in the objective function is notworth the computational effort.
Table 2Results for general tables.
Instance B&C BCD Time limit (s)
hier13�13�13d 416,604 477,024 350
hier16 5.47�108 6.94�108 7200
ninenew 9.35�108 5.90�108 7200
nine12 1.41�109 9.32�108 7200
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–18351834
With larger instances, the advantage of BCD can even beemphasized with respect to a standard branch-and-cut approach:the latter often gets stuck (at least, in the three-hour time limitconsidered), whereas the objective of BCD sequences tends todecrease. The branch-and-cut sequence for table 26 is not shownin Fig. 5, since the time limit was reached with a (bad) solution ofobjective equal to 5.33�109. The poor initial point obtained bythe CPLEX strategy may be in the root of the weak performance ofthe standard branch-and-cut: in Fig. 6, only instances 31 and 33obtain a solution in the same order of magnitude as that of BCD.The SAT proposal is clearly superior in all cases, and it is seen thatthe evolution of BCD is better than the sequence of branch-and-cut, which usually do not progress within the time windowconsidered.
It is also observed, for instances of any size, that random-BCDand tree-BCD are very similar, the former being slightly better.This was not a priori expected, since tree-BCD was supposed totake advantage of the tree structure of the tables.
Table 2 shows the results obtained for the four complexgeneral tables of Table 1. Columns ‘‘B&C’’ and BCD report theobjective function value reached within the CPU time limit shownby column ‘‘Time limit’’ for, respectively, the CPLEX branch-and-cut and random-BCD. From these results, it cannot be concludedthat BCD is always competitive for general complex tables againsta state-of-the-art branch-and-cut. This is partly explained by thecomplexity of the cell tables’ interrelations, which result in highlycoupled clusters of cells in the BCD approach. However, in thesignificantly two largest instances (ninenew and nine12) BCDreturned a much better solution within the time limit.
It is worth noting that above results were obtained with thedefault options of CPLEX MILP solver, both for the standardbranch-and-cut and the BCD approaches. In order to test if theperformance of CPLEX could be improved by tuning some its MILPparameters, the group of largest tables was solved again with thefollowing changes: first, parameter MIP emphasis set to ‘‘optim-ality’’ (i.e., emphasize optimality over feasibility); second,MIP emphasis set to ‘‘bestbound’’ (i.e., greater emphasis is seton optimality through moving the best bound); third, parameterFPHEUR (feasibility pump [8]) activated. The random-BCDmethod was also executed with both choices of the MIP emphasisparameter. It can be concluded that the default parameters usedby CPLEX performed better, since worse objective functions wereachieved by the manual tuning. BCD always obtained bettersolutions than branch-and-cut solution with the manual tuning,though not as good as those with the default parameters. As forthe feasibility pump heuristic, it did not outperform the auto-matic CPLEX strategy for initial points (which may select feasi-bility pump, if necessary); in any case, points obtained with eitherthe automatic or feasibility pump procedures were always poorerthan those computed by the SAT approach.
5. Conclusions
From our actual experience with real-world instances, it canbe stated that CTA problems can be extremely difficult forlarge and complex tables, even for state-of-the-art MILP solvers.
The BCD approach presented and tested in this work was able toobtain good solutions within 2 or 3 h of the CPU time limit, forlarge (up to 1,200,000 cells, 100,000 being sensitive) 1H2D tables.Moreover, the solution obtained was comparable, and usuallybetter than the incumbent provided by state-of-the-art branch-and-cut solvers within the same time limit, even if the initialpoint found by the SAT procedure is used as warm-start for thebranch-and-cut solver. Even tuning some MILP parameters inorder to improve the CPLEX performance, the BCD approachconsistently returned the best solutions.
For general tables (with unknown internal structure), BCD didnot outperform branch-and-cut for all the tables tested, but it didfor the largest ones. We are thus optimistic about the possibilitiesof the method for even more difficult real-world tables (with ahigher number of cells and sensitive cells). This is partly sup-ported by the better observed behavior of random-BCD againsttree-BCD: random-BCD can be immediately applied to moregeneral (other than 1H2D) classes of tables, without need toexploit the particular internal structure of the table relations.
The (increasing) ability of NSAs to create more complex andhuge tables from collected data is an incentive to develop power-ful tools for CTA. Among them we find Benders’ reformulation ofCTA [2]; some preliminary testing with a prototype showed thatthe approach is efficient for two-dimensional tables [7], butdeeper cuts are needed for more complex tables. Other dataprotection approaches, like interval protection, which results in amassive linear programming problem, and its efficient solution bystructured interior-point methods, are also among the remainingtasks to be addressed in this challenging field.
Acknowledgments
The authors thank Daniel Baena (from the NSA of Catalonia)for generating the tables used in the computational results. Thiswork has been supported by grants MTM2009-08747 of theSpanish Ministry of Science and Innovation, and SGR-2009-1122of the Government of Catalonia.
References
[1] Alonso-Ayuso A, Escudero LF, Ortuno MT. BFC. A branch-and-fix coordinationalgorithmic framework for solving some types of stochastic pure and mixed0–1 programs. European Journal of Operational Research 2003;151:503–19.
[2] Benders JF. Partitioning procedures for solving mixed-variables programmingproblems. Computational Management Science 2005;2(19):3–19 [Englishtranslation of the original paper appeared in Numerische Mathematik 4(1962) 238–52].
[3] Bertsekas DP. Nonlinear programming. 2nd ed. Belmont: Athena Scientific;1999.
[4] Castro J. Minimum-distance controlled perturbation methods for large-scaletabular data protection. European Journal of Operational Research2006;171:39–52.
[5] Castro J. A shortest paths heuristic for statistical disclosure control in positivetables. INFORMS Journal on Computing 2007;19:520–33.
[6] Castro J, Baena D. Automatic structure detection in constraints of tabulardata. In: Lecture notes in computer science, vol. 4302; 2006. p. 12–24.
[7] Castro J, Baena D. Using a mathematical programming modeling language foroptimal CTA. In: Lecture notes in computer science, vol. 5262; 2008. p. 1–12.
[8] Chinneck JW. Feasibility and infeasibility in optimization. New York:Springer; 2008.
[9] Conejo AJ, Castillo E, Minguez R, Garcia-Bertrand R. Decomposition techni-ques in mathematical programming: engineering and science applications.Berlin: Springer; 2006.
[10] Dandekar RA. Energy Information Administration. Department of Energy,USA; 2003, personal communication.
[11] Dandekar RA, Cox LH. Synthetic tabular data: an alternative to complemen-tary cell suppression, manuscript. Energy Information Administration, U.S.Department of Energy, Available from the first author on request ([email protected]); 2002.
[12] Dillenberger Ch, Escudero LF, Wollensak A, Zhang W. On practical resourceallocation for production planning and scheduling with period overlappingsetups. European Journal of Operational Research 1994;75:275–86.
J.A. Gonzalez, J. Castro / Computers & Operations Research 38 (2011) 1826–1835 1835
[13] Domingo-Ferrer J, Franconi L, editors. Privacy in statistical databases. In:Lecture notes in computer science, vol. 4302. Berlin: Springer; 2006.
[14] Domingo-Ferrer J, Magkos E. editors. Privacy in statistical databases. In:Lecture notes in computer science, vol. 6344. Berlin: Springer; 2010.
[15] Domingo-Ferrer J, Saigin Y. Privacy in statistical databases. In: Lecture notesin computer science, vol. 5262. Berlin: Springer; 2008.
[16] Domingo-Ferrer J, Torra V. A critique of the sensitivity rules usuallyemployed for statistical table protection. International Journal of UncertaintyFuzziness and Knowledge-Based Systems 2002;10:545–56.
[17] Domingo-Ferrer J, Torra V. editors, Privacy in statistical databases. In: Lecturenotes in computer science, vol. 3050. Berlin: Springer; 2004.
[18] Domingo-Ferrer J, Torra V. Disclosure risk assessment in statisticaldata protection. Journal of Computational and Applied Mathematics2004;164–165:285–93.
[19] Een N, Sorensson N. An extensible sat-solver. In Proceedings of the sixthinternational conference on theory and applications of satisfiability testing;2003.
[20] Giessing S, Hohne J. Eliminating small cells from census counts tables: someconsiderations on transition probabilities. In: Lecture notes in computerscience, vol. 6344; 2010. p. 52–65.
[21] Giessing S, Hundepool A, Castro J. Rounding methods for protecting EU-aggregates. In: Worksession on statistical data confidentiality. Eurostat
methodologies and working papers, Eurostat-Office for Official Publicationsof the European Communities, Luxembourg; 2009. pp. 255–64.
[22] Glover F, Cox LH, Kelly JP, Patil R. Exact, heuristic and metaheuristic methodsfor confidentiality protection by controlled tabular adjustment. InternationalJournal of Operations Research 2008;5:117–28.
[23] Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Lenz R, Naylor J, et al.Handbook on statistical disclosure control. Network of excellence in theEuropean statistical system in the field of statistical disclosure control.Available online at /http://neon.vb.cbs.nl/casc/SDC_Handbook.pdfS; 2010.
[24] Kelly JP, Golden BL, Assad AA. Cell suppression: disclosure protection forsensitive tabular data. Networks 1992;22:28–55.
[25] Marques-Silva JP, Sakallah KA. GRASP: a search algorithm for propositionalsatisfiability. IEEE Transactions on Computers 1999;48:506–21.
[26] Plazas MA. Multistage stochastic model for bidding in electrical markets (inSpanish). PhD thesis, Universidad de Castilla-La Mancha; 2006.
[27] Willenborg L, de Waal T. editors. Elements of statistical disclosure control. In:Lecture notes in statistics, vol. 155. New York: Springer; 2000.
[28] Zayatz L. U.S. Census Bureau, communication at Joint UNECE/Eurostat WorkSession on statistical data confidentiality, Bilbao (Basque Country, Spain);2009.
[29] Zhang H. SATO: an efficient propositional prover. In: Proceedings of theinternational conference on automated deduction; July 1997. pp. 272–5.