Last week, I created a pull request (PR #103) improving documentation, which was merged thanks to Sergio Guadarrama and my mentor Oscar! This week, I focused on getting used to environment wrappers, and in this post I will recap my progress.
I also had a joy of having a conference call with Paige Bailey and Oscar. Paige is the Senior Developer Advocate of TensorFlow (“TensorFlow Mom”) and the lead organizer for GSoC TensorFlow. I will have biweekly (once every two weeks) check-in meetings with them, separate from the weekly meetings with my mentors Oscar Ramirez and Mandar Deshpande. Having all these mentors that I can ask questions to is amazing: one of my goals for GSoC is to ask as many thoughtful questions as I can think of!
By the way, here are some interesting statistics about GSoC 2019 published by Google Open Source.
Creating Environment Wrapper
As discussed in Issue #21, curiosity algorithms such as Random Network Distillation (RND) is best implemented as in the form of an environment wrapper. As the name suggests, an environment wrapper wraps the environment, modifying the way it behaves in some way. OpenAI Gym contains a few examples of these wrappers, which allow reward clipping, reward binning, or setting max timesteps.
Following my mentor Mandar’s suggestion, I decided to implement a simple
RewardClipWrapper that clips all rewards to some bound $[min, max]$.
OpenAI Gym Interface
The agent-environment interface of OpenAI Gym can be summarized by the following code snippet:
observation = env.reset() action = agent.get_action(observation) observation, reward, done, info = env.step(action)
There are two main functions:
The reset function
env.reset() resets the environment and returns the initial observation to the agent. The agent can use this observation to choose an action. The environment then receives the action, and returns the consequence of the action as four variables. It returns the observation after the action
observation, the reward from the timestep
reward, whether the episode is over
done, and any other extra information
The agent-environment interface of TF-Agents is written in a detailed manner in their Environments tutorial.
class PyEnvironment(object): def reset(self): """Return initial_time_step.""" self._current_time_step = self._reset() return self._current_time_step ... @abc.abstractmethod def _step(self, action): """Apply action and return new time_step.""" self._current_time_step = self._step(action) return self._current_time_step
TF-Agents also has
env.step() methods. However, they have a more consistent return types. Both methods return a named tuple
TimeStep, which consists of the step type
step_type, reward signal
reward, discount factor
discount, and next observation
# OpenAI Gym import numpy as np from gym import RewardWrapper class ClipReward(RewardWrapper): r""""Clip reward to [min, max]. """ def __init__(self, env, min_r, max_r): super(ClipReward, self).__init__(env) self.min_r = min_r self.max_r = max_r def reward(self, reward): return np.clip(reward, self.min_r, self.max_r)
Above is the
ClipReward class defined in OpenAI Gym. As such, environment wrappers only modify relevant parts of the environment. I used a similar approach to implement my
# TF-Agents class RewardClipWrapper(PyEnvironmentBaseWrapper): """Clip all reward signals to a specified [min, max].""" def __init__(self, env, min_r=-0.5, max_r=0.5): super(RewardClipWrapper, self).__init__(env) self.min_r = min_r self.max_r = max_r def _reset(self): return self._env.reset() def _step(self, action): time_step = self._env.step(action) reward = np.clip(time_step.reward, self.min_r, self.max_r) time_step = time_step._replace(reward=reward) return time_step
Above is my implementation. As you can see, the initialization
__init__ is identical, and both use the same NumPy clip function
np.clip to modify the reward. The core differences are what these classes inherit, and how they interact with the agent.
Using Environment Wrapper
To verify that this environment wrapper is working, we need to test it with an actual RL environment. Let’s modify tf_agents/agents/dqn/examples/v1/train_eval_gym.py, one of the example scripts, to use
In the script, the environment is initialized in the following line:
tf_env = tf_py_environment.TFPyEnvironment(suite_gym.load(env_name))
An OpenAI Gym environment is loaded and converted to a
PyEnvironment is wrapped into
TFPyEnvironment. While loading a Gym environment, it is possible to specify custom environment wrappers as a list or a tuple through the
tf_env = tf_py_environment.TFPyEnvironment(suite_gym.load( env_name, env_wrappers=[wrappers.RewardClipWrapper], )) eval_py_env = suite_gym.load(env_name)
tf_env is the environment used to train the agent, and
eval_py_env is the environment used to evaluate teh agent.)
Now we can run the modified script.
# Ran with RewardClipWrapper python tf_agents/agents/dqn/examples/v1/train_eval_gym.py \ --root_dir=$HOME/tmp/dqn_v1/gym/CartPole-v0/clip/ \ --alsologtostderr
I have also ran the unmodified script shown below to compare the results of the two.
# Ran without RewardClipWrapper python tf_agents/agents/dqn/examples/v1/train_eval_gym.py \ --root_dir=$HOME/tmp/dqn_v1/gym/CartPole-v0/no-clip/ \ --alsologtostderr
The results can be visualized via TensorBoard
tensorboard --logdir $HOME/tmp/dqn_v1/gym/CartPole-v0/ --port 2223
In CartPole, the agent receives $+1$ reward on every step until the episode terminates, so the episode length corresponds to total episode reward (a.k.a. return). With the reward clipped to $[-0.5, 0.5]$, the agent will receive $+0.5$ reward on every step, so the the return will equal half of episode length.
Gin is a “lightweight configuration framework for Python” used in TF-Agents. My mentor Oscar suggested to me to learn gin and referred to its documentation. By adding a
@gin.configurable decorator in front of a class, it is possible to modify parameters in a separate configuration file (ex.
@gin.configurable # TF-Agents class RewardClipWrapper(PyEnvironmentBaseWrapper): """Clip all reward signals to a specified [min, max].""" def __init__(self, env, min_r=-0.5, max_r=0.5): super(RewardClipWrapper, self).__init__(env) self.min_r = min_r self.max_r = max_r def _reset(self): return self._env.reset() def _step(self, action): time_step = self._env.step(action) reward = np.clip(time_step.reward, self.min_r, self.max_r) time_step = time_step._replace(reward=reward) return time_step
We also need to create a new
gin file. Like other
gin files, I have created a new file
#-*-Python-*- import tf_agents.environments.suite_gym import tf_agents.environments.wrappers ## Configure Environment ENVIRONMENT = @suite_gym.load() suite_gym.load.environment_name = %ENVIRONMENT_NAME suite_gym.load.env_wrappers = [ @wrappers.RewardClipWrapper, ] # Note: The ENVIRONMENT_NAME can be overridden by passing the command line flag: # --params="ENVIRONMENT_NAME='CartPole-v1'" ENVIRONMENT_NAME = 'CartPole-v0'
Finally, we need to tell the script to use the
gin file we just created. There is already an import statement at the beginning of tf_agents/agents/dqn/examples/v1/train_eval_gym.py, so we just need to modify the part where the agent is defined.
gin.parse_config_file('tf_agents/environments/configs/reward_clip.gin') tf_env = tf_py_environment.TFPyEnvironment(suite_gym.load()) eval_py_env = suite_gym.load()
We can now run the script again, and indeed, the reward is successfully clipped.
python tf_agents/agents/dqn/examples/v1/train_eval_gym.py \ --root_dir=$HOME/tmp/dqn_v1/gym/CartPole-v0/clip-with-gin/ \ --alsologtostderr
Now that the coding period has begun, I will start my main project: adding curiosity to TF-Agents. The first curiosity method I will add is Exploration by Random Network Distillation by Burda et al. I have reviewed the paper already, so I do not plan on reviewed the paper in detail again. Instead, I will dive straight into implement Random Network Distillation (RND). My goal is to implement RND by the end of the first phase (June 30th).
The core concept of Random Network Distillation (RND) is fairly simple, but the paper has many additional “tricks”, so it will take several weeks and iterations to implement it. My next post will be implementing the very core idea of RND, and in the later posts I will gradually add more “tricks”.