submodularity in machine learning andreas krause stefanie jegelka
Post on 25-Dec-2015
219 Views
Preview:
TRANSCRIPT
4
MAP inference
How find the MAP labeling in discrete graphical models efficiently?
sky
treehouse
grass
5
What’s common?
Formalization:
Optimize a set function F(S) under constraints
generally very hard
but: structure helps! … if F is submodular, we can …
solve optimization problems with strong guaranteessolve some learning problems
6
Outline
What is submodularity?
OptimizationMinimization
Maximization
LearningLearning for Optimization: new settings
Part I
Part II
Break
many newresults!
7
Outline
What is submodularity?
OptimizationMinimization: new algorithms, constraints
Maximization: new algorithms (unconstrained)
LearningLearning for Optimization: new settings
Part I
Part II
… and many new applications!
many newresults!
10
Set functions
finite ground setset function
will assume (w.l.o.g.)
assume black box that can evaluatefor any
11
Utility of having sensors at subset of all locations
X1
X2
X3
A={1,2,3}: Very informativeHigh value F(A)
X4X5
X1
A={1,4,5}: Redundant infoLow value F(A)
Example: placing sensors
13
B
Decreasing gains: submodularity
X1X2
X3
X4X5
placement B = {1,…,5}
X1
X2
placement A = {1,2}
Adding s helps a lot! Adding s doesn’t help muchXs
new sensor sA + s+ s
Big gain small gain
Example: Set cover
16
Node predictsvalues of positionswith some radius
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE
goal: cover floorplan with discsplace sensorsin building Possible
locations
: “area covered by sensors placed at A”
Formally: Finite set , collection of n subsetsFor define
Set cover is submodular
17
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE
S1 S2
S1 S2
S3
S4 S’
S’
A={s1,s2}
B = {s1,s2,s3,s4}
F(A U {s’}) – F(A)
F(B U {s’}) – F(B)
≥
18
More complex model for sensing
Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(Y1,…,Yn) P(X1,…,Xn | Y1,…,Yn)
Ys: temperatureat location s
Xs: sensor valueat location s
Xs = Ys + noise
Prior Likelihood
Y1 Y2 Y3
Y6
Y5Y4
X1
X4
X3
X6X5
X2
Example: Sensor placement
Utility of having sensors at subset A of all locations
19
X1
X2
X3
A={1,2,3}: High value F(A)
X4X5
X1
A={1,4,5}: Low value F(A)
Uncertaintyabout temperature Ybefore sensing
Uncertaintyabout temperature Yafter sensing
Submodularity of Information Gain
Y1,…,Ym, X1, …, Xn discrete RVs
F(A) = I(Y; XA) = H(Y)-H(Y | XA)
F(A) is NOT always submodular
If Xi are all conditionally independent given Y,then F(A) is submodular! [Krause & Guestrin `05]
20
Y1
X1
Y2
X2
Y3
X4X3
Proof:“information never hurts”
Example: costs
21
breakfast??
cost:time to reach shop+ price of items
t1t2
t3
each item1 $
Market 1 Market 2
Market 3
ground set
Example: costs
22
breakfast??
cost:time to shop+ price of items
F( ) = cost( ) + cost( , )
submodular?
= t1 + 1 + t2 + 2
= #shops + #itemsMarket 1 Market 2
Market 3
Shared fixed costs
23
A
B
marginal cost: #new shops + #new items
• shops: shared fixed cost• economies of scale
decreasing cost is submodular!
24
Another example: Cut functions
a c
db
e g
hf
V={a,b,c,d,e,f,g,h}2
2
2
22 2
1
1
3
3
3
33 3
Cut function is submodular!
Why are cut functions submodular?
25
a b
S Fab(S){} 0{a} w{b} w{a,b} 0
Submodular if
w
a c
db
e g
hf
2
2
2
22 2
1
1
3
3
3
33 3
Cut function in subgraph {i,j} Submodular!
w ≥ 0!
Closedness propertiesF1,…,Fm submodular functions on V and 1,…,m > 0
Then: F(A) = i i Fi(A) is submodular
Submodularity closed under nonnegative linear combinations!
Extremely useful fact:F(A) submodular P() F(A) submodular!
Multicriterion optimizationA basic proof technique!
26
Other closedness propertiesRestriction: F(S) submodular on V, W subset of V
Then is submodular
27
SWV
Other closedness propertiesRestriction: F(S) submodular on V, W subset of V
Then is submodular
Conditioning: F(S) submodular on V, W subset of VThen is submodular
28
SWV
Other closedness propertiesRestriction: F(S) submodular on V, W subset of V
Then is submodular
Conditioning: F(S) submodular on V, W subset of VThen is submodular
Reflection: F(S) submodular on VThen is submodular
29
SV
34
Maximum of submodular functions
submodular. What about
?
|A|
F2(A)
F1(A)
F(A) = max(F1(A),F2(A))
max(F1,F2) not submodular in general!
Minimum of submodular functions
Well, maybe F(A) = min(F1(A),F2(A)) instead?
35
F1(A) F2(A) F(A)
{} 0 0 0{a} 1 0 0{b} 0 1 0{a,b} 1 1 1
F({b}) – F({})=0
F({a,b}) – F({a})=1
<
min(F1,F2) not submodular in general!
What to do with submodular functions
37
Optimization
Minimization
Maximization
Learning
Online/adaptiveoptim.
What to do with submodular functions
38
Optimization
Minimization
Maximization
Learning
Online/adaptiveoptim.
Minimization and maximization not the same??
39
Submodular minimization
structured sparsityregularization
clustering
MAP inference minimum cut
ts
Set functions and energy functionsany set function
with .… is a function on binary vectors!
a
b
d
c
A
1
1
0
0
a
b
c
d
41
pseudo-boolean function
42
Submodularity and convexity
minimum of f is a minimum of Fsubmodular minimization as convex minimization:polynomial time! Grötschel, Lovász, Schrijver 1981
extension
convex
Lovász extension
Lovász, 1982
43
Submodularity and convexity
minimum of f is a minimum of Fsubmodular minimization as convex minimization:polynomial time!
extension
convex
Lovász extension
Lovász, 1982
44
The submodular polyhedron PF
Example: V = {a,b}
x({a}) ≤ F({a})
x({b}) ≤ F({b})
x({a,b}) ≤ F({a,b})PF
-1 x{a}
x{b}
0 1
1
2
-2
A F(A){} 0{a} -1{b} 2{a,b} 0
Evaluating the Lovász extension
45
-1x{a}
x{b}
0 1
12
-2
Linear maximization over PF
Exponentially many constraints!!! Computable in O(n log n) time [Edmonds ‘70]
y*
• Subgradient• Separation oracle
x
greedy algorithm:• sort x• order defines sets•
Submodular minimization
combinatorial algorithms
Fulkerson prizeIwata, Fujishige, Fleischer ‘01 & Schrijver ’00
state of the art:O(n4T + n5logM) [Iwata ’03]
O(n6 + n5T) [Orlin ’09]
minimize convex extension
ellipsoid algorithm [Grötschel et al.
`81]
subgradient method,smoothing [Stobbe & Krause `10]
duality: minimum norm point algorithm
[Fujishige & Isotani ’11] T = time for evaluating F
47
48
-1x{a}
x{b}
0 1
1
2
-2
regularized problemLovász extension
minimizes F:
Fujishige ‘91, Fujishige & Isotani ‘11
[-1,1]
u({a,b})=F({a,b})
u*
Base polytope BF
Example: V = {a,b}A F(A){} 0{a} -1{b} 2{a,b} 0
The minimum-norm-point algorithmdual: minimum norm problem
49
1. find
2.
can we solve this??yes! recall: can solve linear optimization over PF
similar: optimization over BF
can find (Frank-Wolfe algorithm)
The minimum-norm-point algorithm
-1x{a}
x{b}
0 1
1
2
-2
[-1,1]
u({a,b})=F({a,b})
u*
Fujishige ‘91, Fujishige & Isotani ‘11
50
Empirical comparison
Minimum norm point algorithm: usually orders of magnitude faster
Cut functions from DIMACS Challenge
Runn
ing
time
(sec
onds
)
Low
er is
bett
er (l
og-s
cale
!)
Problem size (log-scale!)512 102425612864
Minimum norm point algorithm
[Fujishige & Isotani ’11]
combinatorialalgorithms
52
Example I: Sparsity
pixels largewaveletcoefficients
widebandsignalsamples
largeGabor (TF)coefficients
time
freq
uenc
y
Many natural signals sparse in suitable basis.Can exploit for learning/regularization/compressive sensing...
Sparse reconstruction
53
• explain y with few columns of M: few xi
discrete regularization on support S of x
relax to convex envelope
in nature: sparsity pattern often not random…
subsetselection:S = {1,3,4,7}
S
Structured sparsity
54
x
m3
m2
m4
m5
m7m6
m1m1
m2
m3 m4 m6 m7
Incorporate tree preference in regularizer?
Set function:
if T is a tree and S not|S| = |T|
S
Structured sparsity
55
x
m3
m2
m4
m5
m7m6
m1m1
m2
m3
Incorporate tree preference in regularizer?
Set function:
If T is a tree and S not,|S| = |T|
x
m3
m2
m4
m5
m7m6
m1
m7m6m4
56
Set function:
If T is a tree and S not,|S| = |T|
Structured sparsity
Incorporate tree preference in regularizer?
S
S
Function F is … submodular!
Sparsity
57
• Optimization: submodular minimization
[Bach`10]
• explain y with few columns of M: few xi
• prior knowledge: patterns of nonzeros
• submodular function
Lovász extension
discrete regularization on support S of x
relax to convex envelope
Further connections: Dictionary Selection
58
Where does the dictionary M come from?
Want to learn it from data:
Selecting a dictionary with near-max. variance reduction Maximization of approximately submodular function
[Krause & Cevher ‘10; Das & Kempe ’11]
60
Example: MAP inference
Recall: equivalence
ab
dc
A1100
abcd
function on binary vectors set function
if is submodular (attractive potentials), thenMAP inference = submodular minimization!polynomial-time
1
1 1
1 0 0
00
0000
Special casesMinimizing general submodular functions:
poly-time, but not very scalable
Special structure faster algorithms
Symmetric functionsGraph cutsConcave functionsSums of functions with bounded support...
61
62
0
1 1
1 1
0 0
00
000
MAP inference
if each is submodular:
MAP inference = Minimum cut: fast
then is a graph cut function.
a b a b
Pairwise is not enough…
63
color + pairwise
[Kohli et al.`09]
Pixels in one tile should have the same label
color + pairwise +
Pixels in a superpixel should have the same label
Enforcing label consistency
64
E(x)
concave function of cardinality submodular
> 2 arguments: Graph cut ??
works well for some particular [Billionet & Minoux `85, Freedman & Drineas `05, Živný & Jeavons `10,...]
necessary conditions complex and not all submodular functions equal such graph cuts [Živný et al.‘09]
Higher-order functions as graph cuts?
65
General strategy:reduce to pairwise case by adding auxiliary variables
66
Not all submodular functions can be optimized as graph cutsEven if they can: possibly many extra nodes in the graph
Other options?minimum norm algorithmother special cases:e.g. parametric maxflow
[Fujishige & Iwata`99]
Approximate! Every submodular functioncan be approximated bya series of graph cut functions [Jegelka, Lin & Bilmes `11]
Fast approximate minimization
speech corpus selection [Lin&Bilmes `11]
67
Not all submodular functions can be optimized as graph cutsEven if they can: possibly many extra nodes in the graph
Approximate!
Fast approximate minimization
speech corpus selection [Lin&Bilmes `11]
decompose:• represent as much as
possible exactly by a graph• rest: approximate iteratively
by changing edge weights
solve a series of cut problems
Symmetric:Queyranne‘s algorithm: O(n3) [Queyranne, 1998]
Concave of modular:
[Stobbe & Krause `10, Kohli et al, `09]
Sum of submodular functions, each bounded support [Kolmogorov `12]
Other special cases
68
Submodular minimization
69
Optimization
Maximization
Learning
Online/adaptiveoptim.
unconstrained
constrained
70
Submodular minimization
unconstrained:nontrivial algorithms, polynomial time
constraints: e.g.limited cases doable:odd/even cardinality, inclusion/exclusion of a set
General case: NP hard• hard to approximate within polynomial factors!• But: special cases often still work well
[Lower bounds: Goel et al.`09, Iwata & Nagano `09, Jegelka & Bilmes `11]
special case:balancedcut
72
Recall: MAP and cuts
binary labeling:
pairwise random field:
What’s the problem?
minimum cut: prefershort cut = short object boundary
aimreality
73
MAP and cutsMinimum cut
minimizesum of edge weights
implicit criterion:short cut = short boundary
minimize submodular function of edges
new criterion:boundary may be long if the boundary is homogeneous
Minimum cooperative cut
not a sum of edge weights!
Reward co-occurrence of edges
74
submodular cost function:use few groups Si of edges
sum of weights:use few edges
7 edges, 4 types25 edges, 1 type
Constrained optimization
77
cut matching path spanning tree
convex relaxation minimize surrogate function
[Goel et al.`09, Iwata & Nagano `09, Goemans et al. `09, Jegelka & Bilmes `11, Iyer et al. ICML `13, Kohli et al `13...]
approximate optimization
approximation bounds dependent on F: polynomial – constant – FPTAS
Efficient constrained optimization
78
cut
matchingpath
spanning tree
[Jegelka & Bilmes `11, Iyer et al. ICML `13]
2. Solve easy sum-of-weights problem:
and repeat.
minimize a series of surrogate functions
1. compute linear upper bound
• efficient• only need to solve sum-of-weights problems• unifying viewpoint of submodular min and max
see Wed best student paper talk
79
Submodular min in practice
Does a special algorithm apply?symmetric function? graph cut? …. approximately?
Continuous methods: convexityminimum norm point algorithm
Other techniques [not addressed here]
LP, column generation, …
Combinatorial algorithms: relatively high complexity
Constraints: hardmajorize-minimize or relaxation
top related