working in openai environments designing your own · designing your own mike rudd cs 885 guest...

Working in OpenAI Environments &

Designing Your OwnMike Rudd

CS 885 Guest Lecture

May 18, 2018

OpenAI*

• Not-for-profit, funded by private and corporate donations

• Employ small team of high-caliber researchers/advisors

• Promote research towards safe AGI

*https://openai.com/

OpenAI Gym

• Standard set of environments for evaluating RL agents

• Provide benchmark for most new algorithms

• Extended to more complex problems as solutions improve

*https://openai.com/

Recent Extensions

• Robotics• MuJoCo continuous control

tasks now “easily solvable”

• Harder set of continuouscontrol tasks

• Retro contest• Agents can overfit to their

environment

• Train agent that can transfer skills to new environments

Interacting with the EnvironmentStandardized Code Applicable Across Tasks

Sample Code

Building Your Own EnvironmentPractically more important than beating Gym benchmarks

Building Your Own Environment

• Not very difficult

• Just define a Python class with methods for:• Initialization• Step• Reset• Render

• Existing packages (physics engines) do most of the heavy lifting• Box2D• MuJoCo

Example: Teaching a Car to Self-Park

Challenge of Reward Definition

• Major difficulty is in creating reward function

• Algorithms can learn to exploit gaps in our logic, resulting in undesirable behaviours

• See e.g. Ng et al. (1999) for examples and theoretical analysis

Ng, A. Y., Harada, D., & Russell, S. (1999, June). Policy invariance under reward transformations: Theory and application to reward shaping. In ICML (Vol. 99, pp. 278-287).

Reward Shaping

• Theoretically correct reward is 1 for success and 0 otherwise

• This is sparse though, and in practice is very difficult to learn

• Reward shaping seeks to modify the reward function to speed up learning (with dense signal) but to leave the theoretically optimal policy unchanged

• Ng et al. (1999) show that only shaping function 𝐹 satisfying the following equation would guarantee that the optimal policy is preserved:

𝐹 𝑠, 𝑎, 𝑠′ = 𝛾Φ 𝑠′ −Φ 𝑠 ∀𝑠 ∈ 𝑆\{s0}

working in openai environments designing your own · designing your own mike rudd cs 885 guest...

Documents

designing your argument

deep reinforcement learning through policy optimization,...

designing your own font

designing your facebook page

designing your message

openai five model architecture - amazon s3

designing your exhibition

designing your web app

modularconsulting.net€¦ · pitching — building your...

designing your leadership development program designing your...

designing your success

designing your api

measuring the algorithmic efficiency of neural...

jonas schneider, head of engineering for robotics, openai

generative adversarial networks (gans) - ian goodfellow,...

designing your life

designing your social platform

designing your community

geometry-aware neural...

designing your own restaurant