RL Weekly 21: The interplay between Experience Replay and Model-based RL

Published Jun 17, 2019 by Seungjae Ryan Lee

Search on the Replay Buffer

Benjamin Eysenbach¹², Ruslan Salakhutdinov¹, Sergey Levine²³

¹CMU, ²Google Brain, ³UC Berkeley

What it says

Goal-conditioned RL studies tasks where there exists a “goal”, and the epsiode ends when the agent is sufficiently close to it (Section 2.2). Many attempts have used reward shaping or demonstrations to guide the agent. Instead, the authors propose reducing this goal-reaching problem into several easier goal-reaching tasks.

To decompose the problem, the authors build a directed, weighted graph, where each node is an observation from the replay buffer, and the edges are the predicted “distance” between two observations (Section 2.3). Then, by using Dijkstra’s algorithm (Appendix A),it is possible to find the shortest path to the goal, allowing the agent to plan its trajectory.

The authors discuss different distance estimates, using distributional RL (Section 3.1) or value function ensembles (Section 3.2). Experiments show that these enhancements are crucial for the performance of SoRB (Section 5.4).

In a visual navigation task, SoRB outperforms Hindsight Experience Replay (HER), C51, Semi-parametric Topological Memory (SPTM), and Value Iteration Networks (VIN) (Section 5.2, 5.3). It is also shown to generalize well to new navigation environments (Section 5.5).

Read more

External Resources

Dijkstra’s algorithm (Wikipedia Entry)

Learning Powerful Policies by Using Consistent Dynamics Model

Shagun Sodhani¹, Anirudh Goyal¹, Tristan Deleu¹, Yoshua Bengio¹², Sergey Levine³, Jian Tang¹

¹Mila, ²CIFAR, ²UC Berkeley

What it says

Model-based RL relies on having a good model that sufficiently represents the real environment, since the error of the model is compounded as the model is unrolled for multiple steps. Thus, a multi-step dynamics model could deviate greatly from the real environment. To fix this problem, the authors propose an auxilary loss to match the imagined state from the model with the observed state from the real environment (Section 3.1). Compared to the baseline algorithms Mb-Mf and A2C, the Consistent Dynamics model shows superior performance in various Atari and MuJoCo tasks (Section 6).

Read more

Learning Powerful Policies by Using Consistent Dynamics Model (ArXiv Preprint)

External Resources

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning (ArXiv Preprint): Introduces Mb-Mf

When to use Parametric Models?

Hado van Hasselt¹, Matteo Hessel¹, John Aslanides¹

¹DeepMind

What it says

Model-based and replay-based agents have different computational properties. Parametric models typically require more computations than sampling from a replay buffer, but parametric models can achieve good accuracy with a finite memory requirement, whereas for replay memory past experiences are forgotten (Section 2.1). However, they also share similarities. Notably, with a perfect model, the experience generated from it will be indistinguishable from that from the real environment, so it has the same effect as experience replay. (Section 2.2).

Parametric models can be helpful when the agent must plan forward into the future to improve behavior, or when the agent plans backward to solve the credit assignment problem. (Section 2.3). However, these models can lead to catastrophic learning updates that are commonplace for algorithms that have the “deadly triads”: function approximation, bootstrapping, and off-policy learning (Section 3).

To compare the two approaches, the authors chose SimPLe for the model-based agent and data-efficient Rainbow DQN (Section 4.1) for the replay-based agent. Their results show that Rainbow was superior to SimPLe in 17 out of 26 Atari games (Appendix E), hinting that the hypothesized instability indeed exists in model-based agents.

Why it matters

Although replay-based agents such as Rainbow are generally categorized as model-free, the experience replay mechanism has many similarities with model-based RL. Just like how model-based agents use their models to improve the agent in between interactions with the real environment, replay-based agents also extensively use the replay buffer to train and improve the agent.

Many RL papers introduce new models for the sake of sample efficiency, but model-based RL is not a panacea that works in every environment and situation. It is important to understand both the power and the shortcoming of model-based approaches.

Read more

When to use parametric models in reinforcement learning? (ArXiv Preprint)

External Resources

More exciting news in RL:

Researchers at DeepMind show that agents with structured representations and structured policies outperform less-structured ones for physical construction tasks.
Researchers at Google Brain, X, and UC Berkeley proposed Watch, Try, Learn (WTL), a meta-learning agent that can learn from both demonstrations and experience from interactions.
Researchers at UC Berkeley and Intel proposed goal-GAIL, an algorithm that uses demonstrations to speed up agents on goal-conditioned tasks.
Marlos Machado’s thesis that discuss exploration through time-based representation learning has been published online.

endtoend.ai

RL Weekly 21: The interplay between Experience Replay and Model-based RL

Subscribe to RL Weekly

Search on the Replay Buffer

Learning Powerful Policies by Using Consistent Dynamics Model

When to use Parametric Models?

Related Posts

Explore →

endtoend.ai

RL Weekly 21: The interplay between Experience Replay and Model-based RL

Subscribe to RL Weekly

Search on the Replay Buffer

Learning Powerful Policies by Using Consistent Dynamics Model

When to use Parametric Models?

Related Posts

RL Weekly 20: Minecraft Competition, Off-policy Policy Evaluation via Classification, and Soft-attention Agent for Interpretability

RL Weekly 22: Unsupervised Learning for Atari, Model-based Policy Optimization, and Adaptive-TD

Explore →