Since my last post, I have been implementing and testing Random Network Distillation (RND) by Burda et al. As I was able to implement the most important parts of RND, in this post I will give a brief summary of how I implemented RND.
Here is a quick overview of what I have finished implementing. Because the full code is too long to put in this post, I only selected the segments that are most relevant to the implementation.
But first, here is a TL;DR version of what has been implemented:
Computing Intrinsic Reward
The intrinsic reward for RND is defined as a MSE loss of two vectors: predictor and target. Both vectors are outputs of two networks with the same architecture - the RND predictor network and the RND target network.
The RND predictor network is defined in
train_eval_rnd.py where other networks are defined.
rnd_net = encoding_network.EncodingNetwork( tf_env.observation_spec(), fc_layer_params=value_fc_layers, name='PredictorRNDNetwork')
The RND target network is then defined by copying the RND predictor network. The copying is done during the initialization of the RNDPPO agent. (Therefore, in
self._rnd_network = rnd_network self._target_rnd_network = self._rnd_network.copy(name='TargetRNDNetwork') self._rnd_optimizer = rnd_optimizer self._rnd_loss_fn = rnd_loss_fn or mean_squared_loss
The RND loss function (MSE) is also specified during initialization, and is called during training.
# Prediction and Target have shape (# Env, # Timesteps, Final FC Layer) rnd_prediction, _ = self._rnd_network(time_steps.observation) rnd_target, _ = self._target_rnd_network(time_steps.observation) # rnd_losses have shape (# Env, # Timesteps) rnd_losses = self._rnd_loss_fn(rnd_prediction, rnd_target, axis=2) avg_rnd_loss = tf.reduce_mean(rnd_losses)
This loss is then used to compute returns and advantages:
returns, normalized_advantages = self.compute_return_and_advantage( next_time_steps, value_preds, intrinsic_rewards=avg_rnd_loss)
Note that the intrinsic reward is normalized before computing them:
if self._reward_normalizer: rewards = self._reward_normalizer.normalize( rewards, center_mean=False, clip_value=self._reward_norm_clipping)
Training RND Target Network
The purpose of intrinsic reward is to numerically represent the novelty of the input observation. Thus, by training the RND predictor network to minimize the MSE loss between the output of RND target network and RND predictor network, RND predictor network minimizes intrinsic rewards (MSE loss) for observation (inputs) that are frequently seen.
The training code is very simple thanks to TensorFlow, and the training is done after the policy network is trained.
rnd_variables_to_train = self._rnd_network.trainable_weights rnd_grads = tape.gradient(loss_info.extra.avg_rnd_loss, rnd_variables_to_train) # Tuple is used for py3, where zip is a generator producing values once. rnd_grads_and_vars = tuple(zip(rnd_grads, rnd_variables_to_train))
self._rnd_optimizer.apply_gradients( rnd_grads_and_vars, global_step=self.train_step_counter)
Verifying RND in LunarLander-v2
To check if RND is working correctly before testing it on benchmark environment, I ran RND on LunarLander-v2 and compared it with baseline: PPO.
The dark blue line denotes RND, and the light blue line denotes PPO. In this comparison, PPO was stuck in a local minima around zero, whereas RND was able to escape this local minima and receive a higher score.
Of course, a single-seed comparison cannot be used to argue the correctedness of my RND implementation. However, it gave me a peace of mind!
To Be Implemented
Although the basic idea of RND is simple, there are a lot of “tricks” that the authors mention that can further improve the performance of the agent. Some of them have been implemented already, but here are some that have not yet been implemented.
Section 3.5 discusses using an RNN policy to mitigate the problems of partially observable environments. However, the authors show in Figure 4 that using RNN did not improve performance, so I plan on implementing this later, although implementing recurrence for RND+PPO should be simple since TF-Agents already supports recurrence for PPO.
Dual Value Heads
Section 2.3 discusses Combining Intrinsic and Extrinsic Returns. They use separate value heads $V_E$ and $V_I$, which allows for different training settings for extrinsic and intrinsic rewards. Notably, when calculating $V_I$, the authors suggest reconsidering the environment as a non-episodic one. The authors also report better results when using higher discount factor of $\gamma_E = 0.999$ when computing $V_E$.
Currently, a single value head is used for both rewards. Figure 4 (above) shows that making intrinsic reward setting non-episodic does not raise performance, so this will be implemented later. However, Figure 5 (below) shows that using a different discount factor does raise performance, so I will implement this first.
Section 2.4 discusses Reward and Observation Normalization. The paper specifies that intrinsic rewards were normalized by “dividing it by a running estimate of the standard deviations of the intrinsic returns,” and the same was done for observations used as inputs to the RND target and predictor networks. The normalization parameters are initialized using a random agent before training. The normalized observation is also clipped to be between [-5, 5].
Currently, extrinsic rewards, intrinsic rewards, and observations are all normalized through
tensor_normalizer.StreamingTensorNormalizer. The observation is not clipped.
Like any other fields of deep learning, hyperparameter tuning is crucial for performance in deep reinforcement learning. To reproduce the results in the paper, I decided to list all the hyperparameters specified in the paper and compare it with the default hyperparameters in TF-Agents.
|Observation downsampling||(84, 84)|
|Extrinsic reward clipping
||[-1, 1]||[-10, 10]|
|Intrinsic reward clipping||False||False|
|Max frames per episode||18K|
|Terminal on loss of life||False|
|Sticky action probability||0.25||0.25|
|Number of parallel environment
|Proportion of experience used for training predictor||0.25||1|
|Total number of rollouts per env||30K|
|Number of minibatches||4||1|
|Number of optimization epochs
|Coefficient for extrinsic reward||2||1|
|Coefficient for intrinsic reward||1||1|
|Adaptive KL Penalty Coefficient
Using Venture instead of Montezuma’s Revenge
Initially, I was planning to use Montezuma’s Revenge as my benchmark, because of its popularity as a benchmark for exploration-based algorithms. However, while re-reading the original RND paper, I noticed that the environment where RND and PPO had the most striking difference was not Montezuma’s Revenge but Venture. PPO fails to get any reward in Venture, but both bonus-based exploration methods (RND, ICM) get nonzero reward.
In the paper, RND and PPO also shows big difference in final performance for Private Eye, but it took a long time for RND to get a nonzero reward, and the variance seemed too high, so I will be using Venture for my first benchmark.
Paige, the GSoC TensorFlow organizer, offered me Google Cloud Platform credit to use for my project! As a result, my RND agent is currently training on a VM instance of GCP. After some hyperparameter tuning, I hope to share benchmarks for Atari games, notably on Venture and Montezuma’s Revenge!