noise robustness of optimization algorithms via

Noise Robustness of Optimization Algorithms via Statistical Queries

Cristóbal Guzmá[email protected]

CWI Scientific Meeting27-11-2015

mailto:[email protected]

mailto:[email protected]

Motivation

2

Motivation

• In algorithms: Faster is better, right?

2

Motivation


• Sometimes, other objectives are as important

2

Motivation


• Sometimes, other objectives are as important

• This time, we care about robustness

2

Goal: Find

Example: Gradient Methods

3

where smooth convex function, compact convex setf : X ! R X ✓ Rd

f

⇤ = minx2X

f(x)

Goal: Find


3

• Gradient Descent [Euler:1770]

x

t+1 = x

t � �trf(xt)

f

⇤ = minx2X

f(x)

Goal: Find


3


x

t+1 = x

t � �trf(xt) ) f(xT )� f⇤ = O(1/T )

f

⇤ = minx2X

f(x)

Goal: Find


3

x

t = y

t � �trf(yt)

y

t+1 = x

t +↵t � 1

↵t+1(xt � x

t�1)

• Accelerated Gradient Method [Nesterov:1983]


x

t+1 = x

t � �trf(xt) ) f(xT )� f⇤ = O(1/T )

f

⇤ = minx2X

f(x)

Goal: Find


3

x

t = y

t � �trf(yt)

y

t+1 = x

t +↵t � 1

↵t+1(xt � x

t�1)

• Accelerated Gradient Method [Nesterov:1983]


x

t+1 = x

t � �trf(xt) ) f(xT )� f⇤ = O(1/T )

) f(xT )� f⇤ = O(1/T 2)

f

⇤ = minx2X

f(x)


4

Source: Moritz Hardt’s blog

g(x) = rf(x)

g(x) = rf(x) + ⇠


4

Source: Moritz Hardt’s blog

⇠ ⇠ N (0,�2), � = 0.1

• There are other important reasons why acceleration could be harmful

Why Noise Robustness?

5



5

• Overfitting, privacy leakage, etc.



5


• Goal: A theory of optimization algorithms that incorporates noise sensitivity

"

TIterations

Accuracy



5


• Goal: A theory of optimization algorithms that incorporates noise sensitivity

"

TIterations

Accuracy

Data

Statistical Queries

6

• Unknown distribution supported on D W

W

D

1/⌧2

Statistical Queries

6


• Algorithm has oracle access to D

� : W ! [�1, 1]

Ew⇠D

[�(w)]± ⌧

W

D

1/⌧2

Statistical Queries

6



� : W ! [�1, 1]

Ew⇠D

[�(w)]± ⌧

W

D

1/⌧2

An oracle query can be emulated by samples1/⌧2

Statistical Queries

6


• Goals


• Efficiency ~ small number of queries

� : W ! [�1, 1]

Ew⇠D

[�(w)]± ⌧

W

D

1/⌧2

Statistical Queries

6


• Goals


• Efficiency ~ small number of queries• Noise tolerance ~ as large as possible ⌧

� : W ! [�1, 1]

Ew⇠D

[�(w)]± ⌧

W

D

1/⌧2

Why Statistical Queries?

7

SQ algorithms have additional properties, or provide the basis for designing algorithms with additional features:

Why Statistical Queries?

7

SQ algorithms have additional properties, or provide the basis for designing algorithms with additional features:

• Noise tolerance [Kearns:1994]

• Differential privacy [Blum, Dwork, McSherry, Nissim:2005], [Kasiviswanathan, Lee, Nissim, Raskhodnikova, Smith, 2011]

• Distributed computation [Balcan, Blum, Fine, Mansour:2012]

• Generalization in adaptive data analysis [Dwork, Feldman, Hardt, Pitassi, Reingold, Roth:2014]

Statistical Query Convex Opt.

8

• Stochastic convex optimizationminx2X

Ew⇠D

[f(x,w)]


8


Ew⇠D

[f(x,w)]

• Statistical queries

• Function value: �(·) = f(x, ·)


8


Ew⇠D

[f(x,w)]



• Gradient:�(·) = @f(x, ·)

@xii = 1, . . . , d


8

• Difficulty: estimation in 2-norm accumulates errors


Ew⇠D

[f(x,w)]



• Gradient:�(·) = @f(x, ·)

@xii = 1, . . . , d

gi = Ew⇠D

@f(x,w)

@xi

�± ⌧ ) krEw[f(x,w)]� gk2 ⇡

pd⌧

Optimal Estimation Algorithm

9

[Feldman, G., Vempala:2015]

Q: Is it possible to avoid noise accumulation in gradient estimation?

W

D


9



W

D a


9



W

D

U


9



W

D

U

With high probability krEw[f(x,w)]� gk2 ⇡ O(

plog d)⌧

Further Consequences

10

New statistical query algorithms for optimization and machine learning



10



• Gradient methods: mirror-descent, accelerated method, strongly convex minimization• Polynomial-time algorithms: center of gravity, interior-point method• Improved algorithms for high-dimensional classification (Perceptron) and regression• Improved differentially-private convex optimization algorithms


10



• Gradient methods: mirror-descent, accelerated method, strongly convex minimization• Polynomial-time algorithms: center of gravity, interior-point method• Improved algorithms for high-dimensional classification (Perceptron) and regression• Improved differentially-private convex optimization algorithms

We provide new structural lower bounds for convex optimization

Thank you

11

noise robustness of optimization algorithms via

Documents