adaptive stochastic natural gradient method for one-shot ...11-16... · adaptive stochastic natural...

6
Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search ○Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento Uchida (Yokohama National University) Shota Saito (Yokohama National University) Kouhei Nishida (Shinshu University)

Upload: others

Post on 15-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive Stochastic Natural Gradient Method for One-Shot ...11-16... · Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search

○Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University)

Kento Uchida (Yokohama National University) Shota Saito (Yokohama National University)

Kouhei Nishida (Shinshu University)

Page 2: Adaptive Stochastic Natural Gradient Method for One-Shot ...11-16... · Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University

Neural Architecture

!2

Neural Network Architectures

Trial and Error!

VGGNet ResNet Inception …Task

(Dataset)

• Find a good one •Design a brand-new

architecture and train it

often pre-trained on some datasets

Sometimes...• a known architecture

works well on our tasks. Happy! Other times...

Page 3: Adaptive Stochastic Natural Gradient Method for One-Shot ...11-16... · Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University

One-Shot Neural Architecture SearchJoint Optimization of Architecture c and Weights w

!3

maxw,c

f(w, c)<latexit sha1_base64="vyiD93EaWMADRqvRUYSVe0cJzgA=">AAACH3icbVC7TsMwFHXKq5RXgZHFoqooEqqSMsCECgwwFok+pCaqHNdprdpJZDtAFeUT+Ah+gIUVZjbESEc2PgP3MdCWI13do3Pu1bWPGzIqlWkOjNTC4tLySno1s7a+sbmV3d6pySASmFRxwALRcJEkjPqkqqhipBEKgrjLSN3tXQ79+h0Rkgb+reqHxOGo41OPYqS01Moe2Bw9tGLb5fF9cgSHHScJ9ArTymErmzOL5ghwnlgTkiufXT3nz79+Kq3st90OcMSJrzBDUjYtM1ROjISimJEkY0eShAj3UIc0NfURJ9KJRx9KYF4rbegFQpev4Ej9uxEjLmWfu3qSI9WVs95Q/M9rRso7dWLqh5EiPh4f8iIGVQCH6cA2FQQr1tcEYUH1WyHuIoGw0hlOXXF5ojOxZhOYJ7VS0Toulm50OBdgjDTYA/ugACxwAsrgGlRAFWDwCF7AK3gznox348P4HI+mjMnOLpiCMfgFCE6nCA==</latexit>

xt

Conv 3 x 3 W1

Conv 5 x 5 W2

max pooling

avg pooling

xt+1

0

0

1

0

+

w: (W1, W2) c: (0, 0, 1, 0)

maxc

f(w⇤(c), c)

subject to w⇤(c) = argmaxw

f(w, c)<latexit sha1_base64="HZyaidN5WsSwoTEAv8/MtN9ob3w=">AAACtHicbZHfb9MwEMedjB+jbNDBIy8W1aaCUJUMBLwgTfDC45DoNqkuleNcWjM7juzLusrKH8oD/wtOG01bx0mWPvqe7873dVYp6TBJ/kTxzoOHjx7vPuk93dt/9rx/8OLMmdoKGAujjL3IuAMlSxijRAUXlQWuMwXn2eW3Nn9+BdZJU/7EVQVTzeelLKTgGKRZf3nENL+eeZZpL5qGFsOWls2vt8ON9OYd7YAy1jtiCNfoXZ39BoEUTcPodgH9Qhm385u2y1ttb5rN+oNklKyD3oe0gwHp4nR2EB2y3IhaQ4lCcecmaVLh1HOLUihoeqx2UHFxyecwCVhyDW7q1w419DAoOS2MDadEulZvV3iunVvpLNzUHBduO9eK/8tNaiw+T70sqxqhFJtBRa2CM7S1m+bSBqPUKgAXVoa3UrHglgsMn3JnSqbDDpU1VzIHYbTmZe47HxvPTAWWo7HtWkuJCyW1ROe7fJhLN9gLxqbbNt6Hs+NR+n50/OPD4ORrZ/EueUVekyFJySdyQr6TUzImgvyNdqK9aD/+GLNYxLC5GkddzUtyJ+LyH1NQ1xc=</latexit>

NAS as hyper-parameter search

One-shot NAS

1 c evaluation = 1 training

optimization of x and c within 1 training

Page 4: Adaptive Stochastic Natural Gradient Method for One-Shot ...11-16... · Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University

Difficulties for PractitionersHow to choose / tune the search strategy?

!4

w w + ✏wrwf(w, c(✓))

✓ ✓ + ✏✓r✓f(w, c(✓))<latexit sha1_base64="HVsL6MfL3cRx/YNlI3wm1Sd7FAs=">AAADBXichVFNb9NAEF2brxI+msKRy4qoqBUosgsSHCu4cCxS01bKmmi8WSer7nqt3XFDZPnMr+GGuHLlL/TfdGMbaBMkRlrpvTczejszaaGkwyi6DMJbt+/cvbd1v/fg4aPH2/2dJyfOlJaLETfK2LMUnFAyFyOUqMRZYQXoVInT9PzDKn96IayTJj/GZSESDbNcZpIDemnS/8VSXS1q+oIpkSFYaxa0k15SJgonlS+rWqlmOaQK/tCM7rXoVdPD64YynAuEen+fsd41vmHxW1/36fS/Zi3/j9mkP4iGURN0E8QdGJAujiY7wS6bGl5qkSNX4Nw4jgpMKrAouRJ1j5VOFMDPYSbGHuaghUuqZuU13fXKlGbG+pcjbdTrHRVo55Y69ZUacO7WcyvxX7lxidm7pJJ5UaLIeWuUlYqioav70am0gqNaegDcSv9XyudggaO/8g2XVPsZCmsu5FRwozXk04qBnWn4UlfMFMICGrsaayFxrqSW6Kou731pC3t+sfH6GjfBycEwfj08+PRmcPi+W/EWeUaekz0Sk7fkkHwkR2REeDAMjoMk+Bx+Db+F38MfbWkYdD1PyY0If14BKe/6VA==</latexit>

Search Space Search Strategy

Gradient-Based Method

Other Choices • Evolutionary Computation Based • Reinforcement Learning Based

hyper-parameter: step-size

- how to treat integer variables such as #filters? - how to tune the hyper-parameters in such situations?

Page 5: Adaptive Stochastic Natural Gradient Method for One-Shot ...11-16... · Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University

ContributionsNovel Search Strategy for One-shot NAS

1. arbitrary search space (categorical + ordinal) 2. robust against its inputs (hyper-param. and search space)

Our approach

maxw,c

f(w, c) ) maxw,✓

J(w,✓) :=

Zf(w, c)p(c | ✓)dc

<latexit sha1_base64="ei1X4bv0b/x2an2mpDxfz7crXD4=">AAAC93icbVLdbtMwFHbC3xb+OrjkxmJC6iRUJZsQ0yS0CW4QVwPRbVJdVY7jtFbtOLJPVqrgh+AJuEPc8jjwDrwDTlqqdeuRLH8+5zvy8fc5LaWwEMe/g/DW7Tt3721tR/cfPHz0uLPz5MzqyjDeZ1pqc5FSy6UoeB8ESH5RGk5VKvl5On3X1M8vubFCF59hXvKhouNC5IJR8KlRB4iiX0Y1mc1eYsKYw3n3P97DEfkkxhOgxuiZP1xhAoDDH7qrwx4+eoOJKGCtv+z6jSiRLTkZSVXNXDTq7Ma9uA18EyRLsHty3Ht1/PXvt9PRTjAlmWaV4gUwSa0dJHEJw5oaEExyF5HK8pKyKR3zgYcFVdwO61Yeh1/4TIZzbfzyE7bZqx01VdbOVeqZisLEXq81yU21QQX54bAWRVkBL9jiorySGDRutMaZMJyBnHtAmRF+Vswm1FAG3pG1W1Ll31AafSkyzrRStMgapV1dk3bcOpWe6Rr9Zs5t4HrrVlzDM7dQehOzsW5F9b/BS0pbOoEJ99A5b09y3Yyb4Gy/lxz09j96n96iRWyhZ+g56qIEvUYn6D06RX3E0J8ABdtBFM7D7+GP8OeCGgbLnqdoLcJf/wAWwfXD</latexit>

1. Stochastic Relaxation exponential family

differentiable w.r.t. w and θ

2. Stochastic Natural Gradient + Adaptive Step-Size

wt+1 = wt + ✏tw\rwJ(wt,✓t)

✓t+1 = ✓t + ✏t✓F(✓t)�1 \r✓J(wt+1,✓t)

<latexit sha1_base64="HLe5ikCOIzFH1404FMTVwWAAgKc=">AAADSnicbVFdb9MwFHWyAaN8dfDIi0UFKhqUZCDBC9IEEkI8DYluk+q2chxntWo7kX2zqrLyA5H4A/wN3hAvOGkY69orWbo+95x77XuSQgoLUfQzCHd2b9y8tXe7c+fuvfsPuvsPT2xeGsaHLJe5OUuo5VJoPgQBkp8VhlOVSH6azD/W9dMLbqzI9TdYFnys6LkWmWAUPDTtfieLxcTBQVzhZ+/x6lLhA0x4YYX0DOexagK+JFI+o+CIpomkKxx/6beSF5gAQJ09rwjptJd/bdvSel8PNo0VhVmSuU9V/7LFxL30ys2RteRypm9+deq024sGURN4M4nbpIfaOJ7uB4KkOSsV18AktXYURwWMHTUgmORVh5SWF5TN6Tkf+VRTxe3YNTuv8FOPpDjLjT8acINeVTiqrF2qxDPrD9rrtRrcVhuVkL0bO6GLErhmq0FZKTHkuDYQp8JwBnLpE8qM8G/FbEYNZeBtXpuSKP+HwuQXfo0sV4rqtHHNOdI81/mlsnlFEuUWVbWFy9h/ruFpw2RbmbUxW9oSmHGgXuHNia9bsZmcHA7i14PDr296Rx9am/bQY/QE9VGM3qIj9BkdoyFiwatgGEyCafgj/BX+Dv+sqGHQah6htdjZ/Qt0fBdr</latexit> Natural Gradient

J(wt,✓t) < J(wt+1,✓t) < J(wt+1,✓t+1)<latexit sha1_base64="y6/6GQggAGnSQUJ1h9XWy4NyciY=">AAACyXicjVHLThsxFHWGtlBoS4AlGwtUCUQVzdAFLFhEdIPKhkoNIGVC5PHcEAs/pvYdQmrNil/od/RrumHRfkudCVULZNErWTo651zfV1ZI4TCO7xrR3LPnL+YXXi4uvXr9Zrm5snrqTGk5dLiRxp5nzIEUGjooUMJ5YYGpTMJZdvVhop9dg3XC6M84LqCn2KUWA8EZBqrfPPq4lY5GFx6rdzRFxAnapgf0D72T/IcQ8Ha/uRm34jroU5Dcg832Rrrz7a49PumvNESaG14q0Mglc66bxAX2PLMouIRqMS0dFIxfsUvoBqiZAtfz9cgVfRuYnA6MDU8jrdl/MzxTzo1VFpyK4dA91ibkLK1b4mC/54UuSgTNp4UGpaRo6GR/NBcWOMpxAIxbEXqlfMgs4xi2/KBKpsIMhTXXIgdulGI692F3lfdp3a7PZHBWaab8qKpmeDn/67WQ104+0xnOMOvbFIeALGSE4ySPT/EUnO62kvet3U/hSodkGgtknWyQLZKQPdImR+SEdAgn38kP8pP8io6jL9FN9HVqjRr3OWvkQUS3vwEFZee4</latexit>

Monotone Improvement

Under appropriate step-size

!5

Page 6: Adaptive Stochastic Natural Gradient Method for One-Shot ...11-16... · Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University

Results and Details• Faster & Competitive Accuracy to other one-shot NAS

!6

The detail will be explained at Poster #53