RL Weekly 1: Soft Actor-Critic Code Release; Text-based RL Competition; Learning with Training Wheels
In this inaugural issue of the RL Weekly newsletter, we discuss Soft Actor-Critic (SAC) from BAIR, the new TextWorld competition by Microsoft Research, and AsDDPG from University of Oxford and Heriot-Watt University.
Slow Papers: Exploration by Random Network Distillation (Burda et al., 2018)
We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access the underlying state of the game, and occasionally completes the first level. This suggests that relatively simple methods that scale well can be sufficient to tackle challenging exploration problems.
Slow Papers: A Deeper Look at Experience Replay (Zhang and Sutton, 2017)
Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay. It introduces a new hyper-parameter, the memory buffer size, which needs carefully tuning. However unfortunately the importance of this new hyper-parameter has been underestimated in the community for a long time. In this paper we did a systematic empirical study of experience replay under various function representations. We showcase that a large replay buffer can significantly hurt the performance. Moreover, we propose a simple O(1) method to remedy the negative influence of a large replay buffer. We showcase its utility in both simple grid world and challenging domains like Atari games.
Slow Papers: Neural Fitted Q Iteration (Riedmiller, 2005)
This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based RL algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are neeed to generate control policies of high quality.
AI for Prosthetics Week 9 - 10: Unorthodox Approaches
We end the series by exploring possible unorthodox approaches for the competition. These are approaches that deviate from the popular policy gradient methods such as DDPG or PPO.
Notes from the ai.x 2018 Conference: Faster Reinforcement Learning via Transfer
SK T-Brain hosted the ai.x Conference on September 6th at Seoul, South Korea. At this conference, John Schulman (OpenAI) spoke about faster reinforcement learning via transfer.
Pommerman 1: Understanding the Competition
Pommerman is one of NIPS 2018 Competition tracks, where the participants seek to build agents to compete against other agents in a game of Bomberman. In this post, we simply explain the basics of Pommerman, leaving reinforcement learning to later posts.
AI for Prosthetics Week 6: General Techniques of RL
This week, we take a step back from the competition and study common techniques used in Reinforcement Learning.
AI for Prosthetics Week 5: Understanding the Reward
The goal of reinforcement learning is defined by the reward signal - to maximize the cumulative reward throughout an episode. In some ways, the reward is the most important aspect of the environment for the agent: even if it does not know about values of states or actions (like Evolutionary Strategies), if it can consistently get high return (cumulative reward), it is a great agent.
AI for Prosthetics Week 3-4: Understanding the Observation Space
The observation can be roughly divided into five components: the body parts, the joints, the muscles, the forces, and the center of mass. For each body part component, the agent observes its position, velocity, acceleration, rotation, rotational velocity, and rotational acceleration.
AI for Prosthetics Week 2: Understanding the Action Space
Last week, we saw how a valid action has 19 numbers, each between 0 and 1. The 19 numbers represented the amount of force to put to each muscle. I know barely anything about muscles, so I decided to manually go through all the muscles to understand the effects of each muscle...
AI for Prosthetics Week 1: Understanding the Challenge
The AI for Prosthetics challenge is one of NIPS 2018 Competition tracks. In this challenge, the participants seek to build an agent that can make a 3D model of human with prosthetics run. This challenge is a continuation of the Learning to Run challenge (shown below) that was part of NIPS 2017 Competition Track. The challenge was enhanced in three ways...
Jupyter Notebook extensions to enhance your efficiency
Jupyter Notebook is a great tool that allows you to integrate live code, equations, visualizations and narrative text into a document. It is used extensively in data science. However, for developers who have used IDEs with abundant features, the simplicity of Jupyter Notebook might be problematic.
Bias-variance Tradeoff in Reinforcement Learning
Bias-variance tradeoff is a familiar term to most people who learned machine learning. In the context of Machine Learning, bias and variance refers to the model: a model that underfits the data has high bias, whereas a model that overfits the data has high variance. In Reinforcement Learning, we consider another bias-variance tradeoff.
I learned DQNs with OpenAI competition
On April, OpenAI held a two-month-long competition called the Retro Contest where participants had to develop an agent that can achieve perform well on unseen custom-made stages of Sonic the Hedgehog. The agents were limited to 100 million steps per stage and 12 hours of time on a VM with 6 E5-2690v3 cores, 56GB of RAM, and a single K80 GPU.
Effective Data: Partition
To train a good model, you need lots of data. Luckily, over the last few decades, collecting data has become much easier. However, there is little value to data if you use it incorrectly. Even if you double or triple the dataset manually or through data augmentation, without proper partition of data, you will be left clueless on how helpful adding more data was.