submodularity in machine learning andreas krause stefanie jegelka

Submodularity in Machine Learning

Andreas KrauseStefanie Jegelka

Network Inference

How learn who influences whom?

Summarizing Documents

How select representative sentences?

MAP inference

How find the MAP labeling in discrete graphical models efficiently?

treehouse

What’s common?

Formalization:

Optimize a set function F(S) under constraints

generally very hard

but: structure helps! … if F is submodular, we can …

solve optimization problems with strong guaranteessolve some learning problems

Outline

What is submodularity?

OptimizationMinimization

Maximization

LearningLearning for Optimization: new settings

Part I

Part II

many newresults!

Outline

OptimizationMinimization: new algorithms, constraints

Maximization: new algorithms (unconstrained)

Part I

Part II

… and many new applications!

many newresults!

submodularity.orgslides, links, references, workshops, …

Example: placing sensors

Place sensors to monitor temperature

Set functions

finite ground setset function

will assume (w.l.o.g.)

assume black box that can evaluatefor any

Utility of having sensors at subset of all locations

A={1,2,3}: Very informativeHigh value F(A)

A={1,4,5}: Redundant infoLow value F(A)

Example: placing sensors

Marginal gain

Given set function

Marginal gain:

new sensor s

Decreasing gains: submodularity

placement B = {1,…,5}

placement A = {1,2}

Adding s helps a lot! Adding s doesn’t help muchXs

new sensor sA + s+ s

Big gain small gain

Equivalent characterizations

Diminishing gains: for all

Union-Intersection: for all

A B+ s + s

Questions

How do I prove my problem is submodular?

Why is submodularity useful?

Example: Set cover

Node predictsvalues of positionswith some radius

SERVER

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

goal: cover floorplan with discsplace sensorsin building Possible

locations

: “area covered by sensors placed at A”

Formally: Finite set , collection of n subsetsFor define

Set cover is submodular

SERVER

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

SERVER

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

S4 S’

A={s1,s2}

B = {s1,s2,s3,s4}

F(A U {s’}) – F(A)

F(B U {s’}) – F(B)

More complex model for sensing

Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(Y1,…,Yn) P(X1,…,Xn | Y1,…,Yn)

Ys: temperatureat location s

Xs: sensor valueat location s

Xs = Ys + noise

Prior Likelihood

Y1 Y2 Y3

Example: Sensor placement

Utility of having sensors at subset A of all locations

A={1,2,3}: High value F(A)

A={1,4,5}: Low value F(A)

Uncertaintyabout temperature Ybefore sensing

Uncertaintyabout temperature Yafter sensing

Submodularity of Information Gain

Y1,…,Ym, X1, …, Xn discrete RVs

F(A) = I(Y; XA) = H(Y)-H(Y | XA)

F(A) is NOT always submodular

If Xi are all conditionally independent given Y,then F(A) is submodular! [Krause & Guestrin `05]

Proof:“information never hurts”

Example: costs

breakfast??

cost:time to reach shop+ price of items

each item1 $

Market 1 Market 2

Market 3

ground set

Example: costs

breakfast??

cost:time to shop+ price of items

F( ) = cost( ) + cost( , )

submodular?

= t1 + 1 + t2 + 2

= #shops + #itemsMarket 1 Market 2

Market 3

Shared fixed costs

marginal cost: #new shops + #new items

• shops: shared fixed cost• economies of scale

decreasing cost is submodular!

Another example: Cut functions

V={a,b,c,d,e,f,g,h}2

Cut function is submodular!

Why are cut functions submodular?

S Fab(S){} 0{a} w{b} w{a,b} 0

Submodular if

Cut function in subgraph {i,j} Submodular!

w ≥ 0!

Closedness propertiesF1,…,Fm submodular functions on V and 1,…,m > 0

Then: F(A) = i i Fi(A) is submodular

Submodularity closed under nonnegative linear combinations!

Extremely useful fact:F(A) submodular P() F(A) submodular!

Multicriterion optimizationA basic proof technique!

Other closedness propertiesRestriction: F(S) submodular on V, W subset of V

Then is submodular

Conditioning: F(S) submodular on V, W subset of VThen is submodular

Then is submodular

Conditioning: F(S) submodular on V, W subset of VThen is submodular

Reflection: F(S) submodular on VThen is submodular

Submodularity …

discrete convexity ….

… or concavity?

Convex aspects

convex extensiondualityefficient minimization

But this is only half of the story…

Concave aspects

submodularity:

concavity:A + s B + s

F(A) “intuitively”

Submodularity and concavity

suppose and

submodular if and only if … is concave

Maximum of submodular functions

submodular. What about

F(A) = max(F1(A),F2(A))

max(F1,F2) not submodular in general!

Minimum of submodular functions

Well, maybe F(A) = min(F1(A),F2(A)) instead?

F1(A) F2(A) F(A)

{} 0 0 0{a} 1 0 0{b} 0 1 0{a,b} 1 1 1

F({b}) – F({})=0

F({a,b}) – F({a})=1

min(F1,F2) not submodular in general!

Two faces of submodular functions

Convex aspectsminimization!

Concave aspectsmaximization!

What to do with submodular functions

Optimization

Minimization

Maximization

Learning

Online/adaptiveoptim.

What to do with submodular functions

Optimization

Minimization

Maximization

Learning

Minimization and maximization not the same??

Submodular minimization

structured sparsityregularization

clustering

MAP inference minimum cut

submodularity and convexity

Set functions and energy functionsany set function

with .… is a function on binary vectors!

pseudo-boolean function

Submodularity and convexity

minimum of f is a minimum of Fsubmodular minimization as convex minimization:polynomial time! Grötschel, Lovász, Schrijver 1981

extension

convex

Lovász extension

Lovász, 1982

Submodularity and convexity

minimum of f is a minimum of Fsubmodular minimization as convex minimization:polynomial time!

extension

convex

Lovász extension

Lovász, 1982

The submodular polyhedron PF

Example: V = {a,b}

x({a}) ≤ F({a})

x({b}) ≤ F({b})

x({a,b}) ≤ F({a,b})PF

-1 x{a}

A F(A){} 0{a} -1{b} 2{a,b} 0

Evaluating the Lovász extension

-1x{a}

Linear maximization over PF

Exponentially many constraints!!! Computable in O(n log n) time [Edmonds ‘70]

• Subgradient• Separation oracle

greedy algorithm:• sort x• order defines sets•

Lovász extension: example

A F(A){} 0{a} 1{b} .8{a,b} .2

F(a)F(b)

F(a,b)

combinatorial algorithms

Fulkerson prizeIwata, Fujishige, Fleischer ‘01 & Schrijver ’00

state of the art:O(n4T + n5logM) [Iwata ’03]

O(n6 + n5T) [Orlin ’09]

minimize convex extension

ellipsoid algorithm [Grötschel et al.

subgradient method,smoothing [Stobbe & Krause `10]

duality: minimum norm point algorithm

[Fujishige & Isotani ’11] T = time for evaluating F

-1x{a}

regularized problemLovász extension

minimizes F:

Fujishige ‘91, Fujishige & Isotani ‘11

[-1,1]

u({a,b})=F({a,b})

Base polytope BF

Example: V = {a,b}A F(A){} 0{a} -1{b} 2{a,b} 0

The minimum-norm-point algorithmdual: minimum norm problem

1. find

can we solve this??yes! recall: can solve linear optimization over PF

similar: optimization over BF

can find (Frank-Wolfe algorithm)

The minimum-norm-point algorithm

-1x{a}

[-1,1]

u({a,b})=F({a,b})

Fujishige ‘91, Fujishige & Isotani ‘11

Empirical comparison

Minimum norm point algorithm: usually orders of magnitude faster

Cut functions from DIMACS Challenge

Problem size (log-scale!)512 102425612864

Minimum norm point algorithm

[Fujishige & Isotani ’11]

combinatorialalgorithms

Applications?

Example I: Sparsity

pixels largewaveletcoefficients

widebandsignalsamples

largeGabor (TF)coefficients

Many natural signals sparse in suitable basis.Can exploit for learning/regularization/compressive sensing...

Sparse reconstruction

• explain y with few columns of M: few xi

discrete regularization on support S of x

relax to convex envelope

in nature: sparsity pattern often not random…

subsetselection:S = {1,3,4,7}

Structured sparsity

m3 m4 m6 m7

Incorporate tree preference in regularizer?

Set function:

if T is a tree and S not|S| = |T|

Structured sparsity

Set function:

If T is a tree and S not,|S| = |T|

m7m6m4

Set function:

If T is a tree and S not,|S| = |T|

Structured sparsity

Function F is … submodular!

Sparsity

• Optimization: submodular minimization

[Bach`10]

• explain y with few columns of M: few xi

• prior knowledge: patterns of nonzeros

• submodular function

Lovász extension

discrete regularization on support S of x

relax to convex envelope

Further connections: Dictionary Selection

Where does the dictionary M come from?

Want to learn it from data:

Selecting a dictionary with near-max. variance reduction Maximization of approximately submodular function

[Krause & Cevher ‘10; Das & Kempe ’11]

Example: MAP inference

labels pixel values

Example: MAP inference

Recall: equivalence

function on binary vectors set function

if is submodular (attractive potentials), thenMAP inference = submodular minimization!polynomial-time

Special casesMinimizing general submodular functions:

poly-time, but not very scalable

Special structure faster algorithms

Symmetric functionsGraph cutsConcave functionsSums of functions with bounded support...

MAP inference

if each is submodular:

MAP inference = Minimum cut: fast

then is a graph cut function.

a b a b

Pairwise is not enough…

color + pairwise

[Kohli et al.`09]

Pixels in one tile should have the same label

color + pairwise +

Pixels in a superpixel should have the same label

Enforcing label consistency

concave function of cardinality submodular

> 2 arguments: Graph cut ??

works well for some particular [Billionet & Minoux `85, Freedman & Drineas `05, Živný & Jeavons `10,...]

necessary conditions complex and not all submodular functions equal such graph cuts [Živný et al.‘09]

Higher-order functions as graph cuts?

General strategy:reduce to pairwise case by adding auxiliary variables

Not all submodular functions can be optimized as graph cutsEven if they can: possibly many extra nodes in the graph

Other options?minimum norm algorithmother special cases:e.g. parametric maxflow

[Fujishige & Iwata`99]

Approximate! Every submodular functioncan be approximated bya series of graph cut functions [Jegelka, Lin & Bilmes `11]

Fast approximate minimization

speech corpus selection [Lin&Bilmes `11]

Not all submodular functions can be optimized as graph cutsEven if they can: possibly many extra nodes in the graph

Approximate!

Fast approximate minimization

speech corpus selection [Lin&Bilmes `11]

decompose:• represent as much as

possible exactly by a graph• rest: approximate iteratively

by changing edge weights

solve a series of cut problems

Symmetric:Queyranne‘s algorithm: O(n3) [Queyranne, 1998]

Concave of modular:

[Stobbe & Krause `10, Kohli et al, `09]

Sum of submodular functions, each bounded support [Kolmogorov `12]

Other special cases

Optimization

Maximization

Learning

unconstrained

constrained

unconstrained:nontrivial algorithms, polynomial time

constraints: e.g.limited cases doable:odd/even cardinality, inclusion/exclusion of a set

General case: NP hard• hard to approximate within polynomial factors!• But: special cases often still work well

[Lower bounds: Goel et al.`09, Iwata & Nagano `09, Jegelka & Bilmes `11]

special case:balancedcut

Constraints

cut matching path spanning tree

ground set: edges in a graph

minimum…

Recall: MAP and cuts

binary labeling:

pairwise random field:

What’s the problem?

minimum cut: prefershort cut = short object boundary

aimreality

MAP and cutsMinimum cut

minimizesum of edge weights

implicit criterion:short cut = short boundary

minimize submodular function of edges

new criterion:boundary may be long if the boundary is homogeneous

Minimum cooperative cut

not a sum of edge weights!

Reward co-occurrence of edges

submodular cost function:use few groups Si of edges

sum of weights:use few edges

7 edges, 4 types25 edges, 1 type

ResultsGraph cut Cooperative cut

Optimization?

not a standard graph cutMAP viewpoint:global, non-submodular energy function

Constrained optimization

cut matching path spanning tree

convex relaxation minimize surrogate function

[Goel et al.`09, Iwata & Nagano `09, Goemans et al. `09, Jegelka & Bilmes `11, Iyer et al. ICML `13, Kohli et al `13...]

approximate optimization

approximation bounds dependent on F: polynomial – constant – FPTAS

Efficient constrained optimization

matchingpath

spanning tree

[Jegelka & Bilmes `11, Iyer et al. ICML `13]

2. Solve easy sum-of-weights problem:

and repeat.

minimize a series of surrogate functions

1. compute linear upper bound

• efficient• only need to solve sum-of-weights problems• unifying viewpoint of submodular min and max

see Wed best student paper talk

Submodular min in practice

Does a special algorithm apply?symmetric function? graph cut? …. approximately?

Continuous methods: convexityminimum norm point algorithm

Other techniques [not addressed here]

LP, column generation, …

Combinatorial algorithms: relatively high complexity

Constraints: hardmajorize-minimize or relaxation

Outline

OptimizationMinimize costs

Maximize utility

Part I

Part II

Break!

see you in half an hour

submodularity in machine learning andreas krause stefanie jegelka

location s x s

y n y s

sensing slide

fb slide

s doesnt

temperature y

y n px

s big gain small gain

Documents

minimizing general submodular functions cvpr 2015 tutorial...

stefanie 2

efficiently handling discrete structure in machine learning...

by stefanie glotzbach

biology12flour tests stefanie

revenue submodularity

submodularity-inspired data selection for goal-oriented...

stefanie huchzermeier portfolio

streaming weak submodularity: interpreting neural networks...

two examples of submodularity in wireless communications

stefanie schillinger

portfolio stefanie put

adaptive sampling for risk-averse stochastic learning tldr...

extensions of submodularity and their application in

portfolio stefanie dietrich

stefanie schneider website

on testing convexity and submodularity michal parnas dana...

submodularity slides

focused model-learning and planning for non...

streaming weak submodularity: interpreting neural networks...