mark rowland richard e. turner adrian weller google brain

INNOVA

TION +

ASS

ISTA

NCE

1

Structure is all you need - learning compressed RL policies viagradient sensing

Google Brain Robotics: Krzysztof Choromanski, Vikas Sindhwani

University of Cambridge and Alan Turing Institute: Mark Rowland, Richard E. Turner, Adrian Weller

INNOVA

TION +

ASS

ISTA

NCE

2

Gradient sensing - blackbox approach to RL

INNOVA

TION +

ASS

ISTA

NCE

3


INNOVA

TION +

ASS

ISTA

NCE

4


INNOVA

TION +

ASS

ISTA

NCE

5


finite differencegradient sensing

gradient sensingvia randomly rotated basis

INNOVA

TION +

ASS

ISTA

NCE

6


gradient sensingvia independent gaussian vectors(Salimans et al. 2017)

finite differencegradient sensing

gradient sensingvia randomly rotated basis

INNOVA

TION +

ASS

ISTA

NCE

7


Towards smooth relaxations

family of probabilistic distributions on

Gaussian smoothings

● infinitely differentiable objective (Nesterov & Spokoiny, 2017)

● optimal value lower-bounds That of the original problem

INNOVA

TION +

ASS

ISTA

NCE

8

Gradient sensing as a Monte-Carlo estimation

ES-style gradient estimator (Salimans et al. 2017):

Gradient estimator with antithetic pairs (Salimans et al. 2017):

FD-style gradient estimator:

used for variance reduction

randomized version of a standard finite difference method

used in manyevolutionary strategies papers, no control variate

number of samples

INNOVA

TION +

ASS

ISTA

NCE

9








number of samples

INNOVA

TION +

ASS

ISTA

NCE

10





● all three estimators are unbiased● empirically tested that none of them

dominates the others

number of samples




INNOVA

TION +

ASS

ISTA

NCE

11





● all three estimators are unbiased● empirically tested that none of them

dominates the others● how to learn optimal control

variates ?

number of samples

INNOVA

TION +

ASS

ISTA

NCE

12





Some recent success stories of structured random estimators (very biased selection !):

● “Initialization matters: Orthogonal Predictive State Recurrent Neural Networks”, Choromanski, Downey, Boots (to appear at ICLR’18)

● “Optimizing Simulations with Noise-Tolerant Structured Exploration”, Choromanski, Iscen, Sindhwani, Tan, Coumans (to appear at ICRA’18)

● “The Geometry of Random Features”, Choromanski, Rowland, Sarlos, Sindhwani, Turner, Weller (to appear at AISTATS’18)

● “The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings”, Choromanski, Rowland, Weller (NIPS’17)

● “Structured adaptive and random spinners for fast machine learning computations”, Bojarski, Choromanska, Choromanski, Fagan, Gouy-Pailler, Morvan, Sakr, Sarlos, Atif (AISTATS’17)

● “Orthogonal Random Features”, Yu, Suresh, Choromanski, Holtmann-Rice, Kumar (NIPS’16)

● “Recycling Randomness with Structure for Sublinear time Kernel Expansion”, Choromanski, Sindhwani (ICML’16)

● “Binary embeddings with structured hashed projections”, Choromanska, Choromanski, Bojarski, Jebara, Kumar, LeCun (ICML’16)

INNOVA

TION +

ASS

ISTA

NCE

13





Theorem (Choromanski, Rowland, Sindhwani, Turner, Weller’18)

The orthogonal gradient estimator is unbiased and yields lower MSE than the unstructured estimator , namely:

INNOVA

TION +

ASS

ISTA

NCE

14

Efficiency of the orthogonal space exploration for gradient sensing

INNOVA

TION +

ASS

ISTA

NCE

15


INNOVA

TION +

ASS

ISTA

NCE

16


INNOVA

TION +

ASS

ISTA

NCE

17

...

Efficiency of the orthogonal space exploration for gradient sensing - discrete space sensing

INNOVA

TION +

ASS

ISTA

NCE

18

Efficiency of the orthogonal space exploration for gradient sensing - structured neural networks

INNOVA

TION +

ASS

ISTA

NCE

19


INNOVA

TION +

ASS

ISTA

NCE

20


INNOVA

TION +

ASS

ISTA

NCE

21

Experimental results: orthogonal blackbox gradient sensing for low-dimensional problems

stochastic noise

deterministic noise

53 different optimization problems, four settings:● stochastic ● deterministic noise ● smooth problems ● non-differentiable problems

● we compute average ranking of the methods against each other in terms of quality of final objective value in each optimisation task

● we then compare these average rankings using multiple hypothesis testing as described in Demsar (2006)

DFO benchmarking suite,More & Wild (2009)

INNOVA

TION +

ASS

ISTA

NCE

22

Experimental results: orthogonal blackbox gradient sensing for low-dimensional problems

smooth problems

nondifferentiableproblems

53 different optimization problems, four settings:● stochastic ● deterministic noise ● smooth problems ● non-differentiable problems

● we compute average ranking of the methods against each other in terms of quality of final objective value in each optimisation task

● we then compare these average rankings using multiple hypothesis testing as described in Demsar (2006)

DFO benchmarking suite,More & Wild (2009)

INNOVA

TION +

ASS

ISTA

NCE

23

Experimental results: learning compressed policies for RL tasks

Tested variants:

● structured architectures + structured exploration (STST)

● structured architectures + unstructured exploration (STUN)

● unstructured architectures + unstructured exploration (UNUN)

Tested environments:

● Swimmer● Ant● HalfCheetah● Hopper● Humanoid● Walker2d● Pusher● Reacher● Striker● Thrower● ContMountainCar● Pendulum● Minitaur walking

solved by STST

Structured space exploration strategies:

● Gaussian orthogonal matrices● matrices HD (k=1); 256- or

512-dimensional parameter vectors

Optimization setting:

● AdamOptimizer with and● no heuristics used (e.g no fitness shaping)● Base compressed setting: two hidden

layers of size h=41

Distributed TF training on at most 400 machines

INNOVA

TION +

ASS

ISTA

NCE

24


Tested variants:

● structured architectures + structured exploration (STST)

● structured architectures + unstructured exploration (STUN)

● unstructured architectures + unstructured exploration (UNUN)

Tested environments:

● Swimmer● Ant● HalfCheetah● Hopper● Humanoid● Walker2d● Pusher● Reacher● Striker● Thrower● ContMountainCar● Pendulum● Minitaur walking

solved by STST

Structured space exploration strategies:

● Gaussian orthogonal matrices● matrices HD (k=1); 256- or

512-dimensional parameter vectors

Optimization setting:

● AdamOptimizer with and● no heuristics used (e.g no fitness shaping)● Base compressed setting: two hidden

layers of size h=41, Toeplitz matrices

279-dimensional neural network learns reward 4.83 for rollouts of length 500

Learns reward 1842

Distributed TF training on at most 400 machines

INNOVA

TION +

ASS

ISTA

NCE

25


d=253 d=256 d=273 d=247

d=273 d=266 d=246

https://docs.google.com/file/d/1dbXFHoPHuLOeUl75qEA3CMyzafEHdgLv/preview

https://docs.google.com/file/d/1aHf0LbpMhRHeXeGc1kU2JuU2gFDpvAuV/preview

https://docs.google.com/file/d/1h2NlyM_0j3G5lAvZ8BbdZnShdHV85REJ/preview

https://docs.google.com/file/d/16xtCb5pxC33M7s5cAospOzeNoAy81Npk/preview

https://docs.google.com/file/d/1s5Tm83kH8S1ItnWfaHCGQNyy4Du2UnbX/preview

https://docs.google.com/file/d/1Oce6KRNWUtcAC1wsr7EAPC7Kx0TuWPPA/preview

https://docs.google.com/file/d/12re2U-J4-CVsxus6UN1lmilCSLLnxY0_/preview

INNOVA

TION +

ASS

ISTA

NCE

26

Experimental results: learning curves - Ant

STST(GaussOrt) STST(Hadamard)

STUN UNUN

509.2 1144.57

566.42 -12.78

INNOVA

TION +

ASS

ISTA

NCE

27

Experimental results: learning curves - Cont. Mountain Car


STUN UNUN

92.05 93.06

90.69 -0.09

INNOVA

TION +

ASS

ISTA

NCE

28

Experimental results: learning curves - HalfCheetah


STUN UNUN

3619.92 3273.94

1948.26 2660.76

INNOVA

TION +

ASS

ISTA

NCE

29

Experimental results: learning curves - Hopper


STUN UNUN

99883.29 99969.47

99464.74 1540.10

INNOVA

TION +

ASS

ISTA

NCE

30

Experimental results: learning curves - Humanoid


STUN UNUN

1842.70 84.13

1425.85 509.56

INNOVA

TION +

ASS

ISTA

NCE

31

Experimental results: learning curves - Pendulum


STUN UNUN

-128.48 -127.06

-3278.60 -5026.27

INNOVA

TION +

ASS

ISTA

NCE

32

Experimental results: learning curves - Pusher


STUN UNUN

-48.07 -43.04

-36.71 -46.91

INNOVA

TION +

ASS

ISTA

NCE

33

Experimental results: learning curves - Reacher


STUN UNUN

-4.29 -10.31

-73.04 -145.39

INNOVA

TION +

ASS

ISTA

NCE

34

Experimental results: learning curves - Striker


STUN UNUN

-112.78 -87.27

-52.56 -63.35

INNOVA

TION +

ASS

ISTA

NCE

35

Experimental results: learning curves - Swimmer


STUN UNUN

370.36 370.53

368.38 150.90

INNOVA

TION +

ASS

ISTA

NCE

36

Experimental results: learning curves - Thrower


STUN UNUN

-349.72 -243.88

-265.60 -192.76

INNOVA

TION +

ASS

ISTA

NCE

37

Experimental results: learning curves - Walker2d


STUN UNUN

9998.12 9974.60

9756.91 456.51

INNOVA

TION +

ASS

ISTA

NCE

38

Success stories - teaching “Smoky” to walk (ICRA’18) joint work with: Atil Iscen, Vikas Sindhwani, Jie Tan and Erwin Coumans

https://docs.google.com/file/d/0B4CNIuEZQnNZZklLcG9qblJfVmM/preview

INNOVA

TION +

ASS

ISTA

NCE

39

mark rowland richard e. turner adrian weller google brain

Documents