mark rowland richard e. turner adrian weller google brain

39
INNOVATION + ASSISTANCE 1 Structure is all you need - learning compressed RL policies via gradient sensing Google Brain Robotics: Krzysztof Choromanski, Vikas Sindhwani University of Cambridge and Alan Turing Institute: Mark Rowland, Richard E. Turner, Adrian Weller

Upload: others

Post on 31-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

1

Structure is all you need - learning compressed RL policies viagradient sensing

Google Brain Robotics: Krzysztof Choromanski, Vikas Sindhwani

University of Cambridge and Alan Turing Institute: Mark Rowland, Richard E. Turner, Adrian Weller

Page 2: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

2

Gradient sensing - blackbox approach to RL

Page 3: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

3

Gradient sensing - blackbox approach to RL

Page 4: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

4

Gradient sensing - blackbox approach to RL

Page 5: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

5

Gradient sensing - blackbox approach to RL

finite differencegradient sensing

gradient sensingvia randomly rotated basis

Page 6: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

6

Gradient sensing - blackbox approach to RL

gradient sensingvia independent gaussian vectors(Salimans et al. 2017)

finite differencegradient sensing

gradient sensingvia randomly rotated basis

Page 7: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

7

Gradient sensing - blackbox approach to RL

Towards smooth relaxations

family of probabilistic distributions on

Gaussian smoothings

● infinitely differentiable objective (Nesterov & Spokoiny, 2017)

● optimal value lower-bounds That of the original problem

Page 8: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

8

Gradient sensing as a Monte-Carlo estimation

ES-style gradient estimator (Salimans et al. 2017):

Gradient estimator with antithetic pairs (Salimans et al. 2017):

FD-style gradient estimator:

used for variance reduction

randomized version of a standard finite difference method

used in manyevolutionary strategies papers, no control variate

number of samples

Page 9: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

9

Gradient sensing as a Monte-Carlo estimation

ES-style gradient estimator (Salimans et al. 2017):

Gradient estimator with antithetic pairs (Salimans et al. 2017):

FD-style gradient estimator:

used for variance reduction

randomized version of a standard finite difference method

used in manyevolutionary strategies papers, no control variate

number of samples

Page 10: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

10

Gradient sensing as a Monte-Carlo estimation

ES-style gradient estimator (Salimans et al. 2017):

Gradient estimator with antithetic pairs (Salimans et al. 2017):

FD-style gradient estimator:

● all three estimators are unbiased● empirically tested that none of them

dominates the others

number of samples

used in manyevolutionary strategies papers, no control variate

used for variance reduction

randomized version of a standard finite difference method

Page 11: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

11

Gradient sensing as a Monte-Carlo estimation

ES-style gradient estimator (Salimans et al. 2017):

Gradient estimator with antithetic pairs (Salimans et al. 2017):

FD-style gradient estimator:

● all three estimators are unbiased● empirically tested that none of them

dominates the others● how to learn optimal control

variates ?

number of samples

Page 12: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

12

Gradient sensing as a Monte-Carlo estimation

ES-style gradient estimator (Salimans et al. 2017):

Gradient estimator with antithetic pairs (Salimans et al. 2017):

FD-style gradient estimator:

Some recent success stories of structured random estimators (very biased selection !):

● “Initialization matters: Orthogonal Predictive State Recurrent Neural Networks”, Choromanski, Downey, Boots (to appear at ICLR’18)

● “Optimizing Simulations with Noise-Tolerant Structured Exploration”, Choromanski, Iscen, Sindhwani, Tan, Coumans (to appear at ICRA’18)

● “The Geometry of Random Features”, Choromanski, Rowland, Sarlos, Sindhwani, Turner, Weller (to appear at AISTATS’18)

● “The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings”, Choromanski, Rowland, Weller (NIPS’17)

● “Structured adaptive and random spinners for fast machine learning computations”, Bojarski, Choromanska, Choromanski, Fagan, Gouy-Pailler, Morvan, Sakr, Sarlos, Atif (AISTATS’17)

● “Orthogonal Random Features”, Yu, Suresh, Choromanski, Holtmann-Rice, Kumar (NIPS’16)

● “Recycling Randomness with Structure for Sublinear time Kernel Expansion”, Choromanski, Sindhwani (ICML’16)

● “Binary embeddings with structured hashed projections”, Choromanska, Choromanski, Bojarski, Jebara, Kumar, LeCun (ICML’16)

Page 13: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

13

Gradient sensing as a Monte-Carlo estimation

ES-style gradient estimator (Salimans et al. 2017):

Gradient estimator with antithetic pairs (Salimans et al. 2017):

FD-style gradient estimator:

Theorem (Choromanski, Rowland, Sindhwani, Turner, Weller’18)

The orthogonal gradient estimator is unbiased and yields lower MSE than the unstructured estimator , namely:

Page 14: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

14

Efficiency of the orthogonal space exploration for gradient sensing

Page 15: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

15

Efficiency of the orthogonal space exploration for gradient sensing

Page 16: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

16

Efficiency of the orthogonal space exploration for gradient sensing

Page 17: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

17

...

Efficiency of the orthogonal space exploration for gradient sensing - discrete space sensing

Page 18: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

18

Efficiency of the orthogonal space exploration for gradient sensing - structured neural networks

Page 19: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

19

Efficiency of the orthogonal space exploration for gradient sensing - structured neural networks

Page 20: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

20

Efficiency of the orthogonal space exploration for gradient sensing - structured neural networks

Page 21: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

21

Experimental results: orthogonal blackbox gradient sensing for low-dimensional problems

stochastic noise

deterministic noise

53 different optimization problems, four settings:● stochastic ● deterministic noise ● smooth problems ● non-differentiable problems

● we compute average ranking of the methods against each other in terms of quality of final objective value in each optimisation task

● we then compare these average rankings using multiple hypothesis testing as described in Demsar (2006)

DFO benchmarking suite,More & Wild (2009)

Page 22: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

22

Experimental results: orthogonal blackbox gradient sensing for low-dimensional problems

smooth problems

nondifferentiableproblems

53 different optimization problems, four settings:● stochastic ● deterministic noise ● smooth problems ● non-differentiable problems

● we compute average ranking of the methods against each other in terms of quality of final objective value in each optimisation task

● we then compare these average rankings using multiple hypothesis testing as described in Demsar (2006)

DFO benchmarking suite,More & Wild (2009)

Page 23: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

23

Experimental results: learning compressed policies for RL tasks

Tested variants:

● structured architectures + structured exploration (STST)

● structured architectures + unstructured exploration (STUN)

● unstructured architectures + unstructured exploration (UNUN)

Tested environments:

● Swimmer● Ant● HalfCheetah● Hopper● Humanoid● Walker2d● Pusher● Reacher● Striker● Thrower● ContMountainCar● Pendulum● Minitaur walking

solved by STST

Structured space exploration strategies:

● Gaussian orthogonal matrices● matrices HD (k=1); 256- or

512-dimensional parameter vectors

Optimization setting:

● AdamOptimizer with and● no heuristics used (e.g no fitness shaping)● Base compressed setting: two hidden

layers of size h=41

Distributed TF training on at most 400 machines

Page 24: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

24

Experimental results: learning compressed policies for RL tasks

Tested variants:

● structured architectures + structured exploration (STST)

● structured architectures + unstructured exploration (STUN)

● unstructured architectures + unstructured exploration (UNUN)

Tested environments:

● Swimmer● Ant● HalfCheetah● Hopper● Humanoid● Walker2d● Pusher● Reacher● Striker● Thrower● ContMountainCar● Pendulum● Minitaur walking

solved by STST

Structured space exploration strategies:

● Gaussian orthogonal matrices● matrices HD (k=1); 256- or

512-dimensional parameter vectors

Optimization setting:

● AdamOptimizer with and● no heuristics used (e.g no fitness shaping)● Base compressed setting: two hidden

layers of size h=41, Toeplitz matrices

279-dimensional neural network learns reward 4.83 for rollouts of length 500

Learns reward 1842

Distributed TF training on at most 400 machines

Page 26: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

26

Experimental results: learning curves - Ant

STST(GaussOrt) STST(Hadamard)

STUN UNUN

509.2 1144.57

566.42 -12.78

Page 27: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

27

Experimental results: learning curves - Cont. Mountain Car

STST(GaussOrt) STST(Hadamard)

STUN UNUN

92.05 93.06

90.69 -0.09

Page 28: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

28

Experimental results: learning curves - HalfCheetah

STST(GaussOrt) STST(Hadamard)

STUN UNUN

3619.92 3273.94

1948.26 2660.76

Page 29: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

29

Experimental results: learning curves - Hopper

STST(GaussOrt) STST(Hadamard)

STUN UNUN

99883.29 99969.47

99464.74 1540.10

Page 30: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

30

Experimental results: learning curves - Humanoid

STST(GaussOrt) STST(Hadamard)

STUN UNUN

1842.70 84.13

1425.85 509.56

Page 31: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

31

Experimental results: learning curves - Pendulum

STST(GaussOrt) STST(Hadamard)

STUN UNUN

-128.48 -127.06

-3278.60 -5026.27

Page 32: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

32

Experimental results: learning curves - Pusher

STST(GaussOrt) STST(Hadamard)

STUN UNUN

-48.07 -43.04

-36.71 -46.91

Page 33: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

33

Experimental results: learning curves - Reacher

STST(GaussOrt) STST(Hadamard)

STUN UNUN

-4.29 -10.31

-73.04 -145.39

Page 34: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

34

Experimental results: learning curves - Striker

STST(GaussOrt) STST(Hadamard)

STUN UNUN

-112.78 -87.27

-52.56 -63.35

Page 35: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

35

Experimental results: learning curves - Swimmer

STST(GaussOrt) STST(Hadamard)

STUN UNUN

370.36 370.53

368.38 150.90

Page 36: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

36

Experimental results: learning curves - Thrower

STST(GaussOrt) STST(Hadamard)

STUN UNUN

-349.72 -243.88

-265.60 -192.76

Page 37: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

37

Experimental results: learning curves - Walker2d

STST(GaussOrt) STST(Hadamard)

STUN UNUN

9998.12 9974.60

9756.91 456.51

Page 38: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

38

Success stories - teaching “Smoky” to walk (ICRA’18) joint work with: Atil Iscen, Vikas Sindhwani, Jie Tan and Erwin Coumans

Page 39: Mark Rowland Richard E. Turner Adrian Weller Google Brain

INNOVA

TION +

ASS

ISTA

NCE

39