boosting verification by automatic tuning of decision procedures

Automatic Tuning 1/33

Boosting Verification by Automatic Tuning ofDecision Procedures

Domagoj Babić

joint work with Frank Hutter, Holger H. Hoos, Alan J. Hu University of British Columbia


Decision procedures

• Core technology for formal reasoning

• Trend towards completely automatized verification– Scalability is problematic– Better (more scalable) decision procedures needed– Possible direction: application-specific tuning

Decisionprocedureformula SAT(solution)/UNSAT


Outline

• Problem definition• Manual tuning• Automatic tuning• Experimental results• Found parameter sets• Future work


Performance of Decision Procedures

• Heuristics

• Learning (avoiding repeating redundant work)

• Algorithms


Heuristics and search parameters

• The brain of every decision procedure– Determine performance

• Numerous heuristics:– Learning, clause database cleanup, variable/phase

decision,...• Numerous parameters:

– Restart period, variable decay, priority increment,...

• Significantly influence the performance• Parameters/heuristics perform differently on

different benchmarks


Spear bit-vector decision procedureparameter space

• Large number of combinations:– After limiting the range of double & unsigned

– After discretization of double parameters

3.78£1018

– After exploiting dependencies

8.34£1017 combinations– Finding a good

combination – hard!

Spear 1.9:– 4 heuristics X

22 optimization functions– 2 heuristics X

3 optimization functions– 12 double– 4 unsigned– 4 bool

------------------------ 26 parameters


Goal

• Find a good combination of parameters (and heuristics):– Optimize for different problem sets

(minimizing the average runtime)

• Avoid time-consuming manual optimization

• Learn from found parameter sets– Apply that knowledge to design of decision

procedures


Outline



Manual optimization

• Standard way for finding parameter sets

• Developers pick small set of easy benchmarks(Hard benchmarks = slow development cycle)– Hard to achieve robustness– Easy to over-fit (to small and specific benchmarks)

• Spear manual tuning:– Approximately one week of tedious work


When to give up manual optimization?

• Depends mainly on sensitivity of the decision procedure to parameter modifications

• Decision procedures for NP-hard problems extremely sensitive to parameter modifications– 1-2 orders of magnitude changes in performance

usual– Sometimes up to 4 orders of magnitude


Sensitivity Example

• Example: same instance, same parameters, same machine, same solver– Spear compiled with 80-bit floating-point precision:

0.34 [s] – Spear compiled with 64-bit floating-point precision:

times out after 6000 [s]– First ~55000 decisions equal, one mismatch, next

~100 equal, then complete divergence• Manual optimization for NP-hard problems

ineffective.


Outline



Automatic tuning

• Loop until happy (with found parameters)

– Perturb existing set of parameters

– Perform hill-climbing:• Modify one parameter at the time• Keep modification if improvement• Stop when a local optimum is found


Implementation: FocusedILS [Hutter, Hoos, Stutzle, ’07]

• Used for Spear tuning• Adaptively chooses training instances

– Quickly discard poor parameter settings– Evaluate better ones more thoroughly

• Any scalar metric can be optimized– Runtime, precision, number of false

positives,...• Can optimize median, average,...


Outline



Experimental Setup - Benchmarks

• 2 experiments:– General purpose tuning (Spear v0.9)

• Industrial instances from previous SAT competitions– Application-specific tuning (Spear v1.8)

• Bounded model checking instances (BMC)• Calysto software checking instances

• Machines– 55 dual 3.2 GHz Intel Xeon PCs w/ 2 GB RAM cluster

• Benchmark sets divided– Training & test, disjoint– Test timeout: 10 hrs


Tuning 1: General-purpose optimization

• Training – Timeout: 10 sec– Risky, but no experimental evidence of over-fitting– 3 days of computation on cluster

• Very heterogeneous training set– Industrial instances from previous competitions

• 21% geometric mean speedup on industrial test set over the manual settings

• ~3X on bounded model checking• ~78X on Calysto software checking


Tuning 1: Bounded model checking instances


Tuning1: Calysto instances


Tuning 2: Application-specific optimization

• Training – Timeout: 300 sec– Bounded model checking optimization – 2 days on the cluster– Calysto instances – 3 days on the cluster

• Homogeneous training set

• Speedups over SAT competition settings:– ~2X on BMC– ~20X on SWV

• Speedups over manual settings:– ~4.5X on BMC– ~500X on SWV


Tuning 2:Bounded model checking instances

~4.5X


Tuning 2: Calysto instances

~500X


Overall Results

Solver BMC SWV

#(solved) Avg.runtime (solved)


Minisat 289/377 360.9 302/302 161.3

Spear manual

287/377 340.8 298/302 787.1

Spear SAT comp

287/377 223.4 302/302 35.9

Spear auto-tunedapp-specific

291/377 113.7 302/302 1.5


Overall Results

Solver BMC SWV



Minisat 289/377 360.9 302/302 161.3

Spear manual

287/377 340.8 298/302 787.1

Spear SAT comp

287/377 223.4 302/302 35.9


291/377 113.7 302/302 1.5


Outline



Software verification parameters

– Greedy activity-based heuristic• Probably helps focusing on the most frequently

used sub-expressions– Aggressive restarts

• Probably standard heuristics and initial ordering do not work well for SWV problems

– Phase selection: always false• Probably related to checked property

(NULL ptr dereference)– No randomness

• Spear & Calysto highly optimized


Bounded model checking parameters

– Less aggressive activity heuristic– Infrequent restarts

• Probably initial ordering (as encoded) works well– Phase selection: less watched clauses

• Minimizes the amount of work– Small amount of randomness helps

• 5% random variable and phase decisions– Simulated annealing works well

• Decrease randomness by 30% after each restart• Focuses the solver on hard chunks of the design


Outline



Future Work

• Per-instance tuning(machine-learning-based techniques)

• Analysis of relative importance of parameters– Simplify the solver

• Tons of data, little analysis done... Correlations between parameters and stats could reveal important dependencies...


Take-away messages

• Automatic tuning effective– Especially application-specific

• Avoids time-consuming manual tuning

• Sensitivity to parameter modifications– Few benchmarks = inconclusive results?

boosting verification by automatic tuning of decision procedures

Documents

parameter setsdevelopers

parameter modifications1

parameter setsapply

variablephase decision

good combination hard

floatingpoint precision

good combination of

nphard problems ineffective