# Reinforcement Learning Papers Accepted to NeurIPS 2019

Here is a list of all papers that use reinforcement learning accepted to NeurIPS 2019.

**Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement**

## Abstract

This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations. In contrast to Learning from Demonstration (LfD) that involves both action and state supervisions, LfO is more practical in leveraging previously inapplicable resources (e.g., videos), yet more challenging due to the incomplete expert guidance. In this paper, we investigate LfO and its difference with LfD in both theoretical and practical perspectives. We first prove that the gap between LfD and LfO actually lies in the disagreement of inverse dynamics models between the imitator and expert, if following the modeling approach of GAIL. More importantly, the upper bound of this gap is revealed by a negative causal entropy which can be minimized in a model-free way. We term our method as Inverse-Dynamics-Disagreement-Minimization (IDDM) which enhances the conventional LfO method through further bridging the gap to LfD. Considerable empirical results on challenging benchmarks indicate that our method attains consistent improvements over other LfO counterparts.

**Experience Replay for Continual Learning**

## Abstract

Interacting with a complex world involves continual learning, in which tasks and data distributions change over time. A continual learning system should demonstrate both plasticity (acquisition of new knowledge) and stability (preservation of old knowledge). Catastrophic forgetting is the failure of stability, in which new experience overwrites previous experience. In the brain, replay of past experience is widely believed to reduce forgetting, yet it has been largely overlooked as a solution to forgetting in deep reinforcement learning. Here, we introduce CLEAR, a replay-based method that greatly reduces catastrophic forgetting in multi-task reinforcement learning. CLEAR leverages off-policy learning and behavioral cloning from replay to enhance stability, as well as on-policy learning to preserve plasticity. We show that CLEAR performs better than state-of-the-art deep learning techniques for mitigating forgetting, despite being significantly less complicated and not requiring any knowledge of the individual tasks being learned.

**Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling**

## Abstract

From traditional Web search engines to virtual assistants and Web accelerators, services that rely on online information need to continually keep track of remote content changes by explicitly requesting content updates from remote sources (e.g., web pages). We propose a novel optimization objective for this setting that has several practically desirable properties, and efficient algorithms for it with optimality guarantees even in the face of mixed content change observability and initially unknown change model parameters. Experiments on 18.5M URLs crawled daily for 14 weeks show significant advantages of this approach over prior art.

**Trust Region-Guided Proximal Policy Optimization**

## Abstract

Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies heavily on the effectiveness of its exploratory policy search. In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region. We formally show that this method not only improves the exploration ability within the trust region but enjoys a better performance bound compared to the original PPO as well. Extensive experiments verify the advantage of the proposed method.

**Regret Minimization for Reinforcement Learning on Multi-Objective Online Markov Decision Processes**

## Abstract

We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. The procedure involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the objectives.

**Reconciling λ-Returns with Experience Replay**

## Abstract

Modern deep reinforcement learning methods have departed from the incremental learning required for eligibility traces, rendering the implementation of the λ-return difficult in this context. In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of λ-returns. Yet replay-based methods are often the most sample-efficient, and incorporating λ-returns into them is a viable way to achieve new state-of-the-art performance. Towards this, we propose the first method to enable practical use of λ-returns in arbitrary replay-based methods without relying on other forms of decorrelation such as asynchronous gradient updates. By promoting short sequences of past transitions into a small cache within the replay memory, adjacent λ-returns can be efficiently precomputed by sharing Q-values. Computation is not wasted on experiences that are never sampled, and stored λ-returns behave as stable temporal-difference (TD) targets that replace the target network. Additionally, our method grants the unique ability to observe TD errors prior to sampling; for the first time, transitions can be prioritized by their true significance rather than by a proxy to it. Furthermore, we propose the novel use of the TD error to dynamically select λ-values that facilitate faster learning. We show that these innovations greatly enhance the performance of DQN when playing Atari 2600 games, even under partial observability. While our work specifically focuses on λ-returns, these ideas are applicable to any multi-step return estimator.

**Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning**

## Abstract

We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume each agent is located at a node of a communication network and can exchange information only with its neighbors. Using softmax temporal consistency, we derive a primal-dual decentralized optimization method and obtain a principled and data-efficient iterative algorithm named {\em value propagation}. We prove a non-asymptotic convergence rate of $\mathcal{O}(1/T)$ with nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with a convergence guarantee in the control, off-policy, non-linear function approximation, fully decentralized setting.

**Finding Friend and Foe in Multi-Agent Games**

## Abstract

Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have seen great strides in recent years. Yet none of these games address the real-life challenge of cooperation in the presence of unknown and uncertain teammates. This challenge is a key game mechanism in hidden role games. Here we develop the DeepRole algorithm, a multi-agent reinforcement learning agent that we test on "The Resistance: Avalon", the most popular hidden role game. DeepRole combines counterfactual regret minimization (CFR) with deep value networks trained through self-play. Our algorithm integrates deductive reasoning into vector-form CFR to reason about joint beliefs and deduce partially observable actions. We augment deep value networks with constraints that yield interpretable representations of win probabilities. These innovations enable DeepRole to scale to the full Avalon game. Empirical game-theoretic methods show that DeepRole outperforms other hand-crafted and learned agents in five-player Avalon. DeepRole played with and against human players on the web in hybrid human-agent teams. We find that DeepRole outperforms human players as both a cooperator and a competitor.

**Distributional Policy Optimization: An Alternative Approach for Continuous Control**

## Abstract

We identify a fundamental problem in policy gradient-based methods in continuous control. As policy gradient methods require the agent's underlying probability distribution, they limit policy representation to parametric distribution classes. We show that optimizing over such sets results in local movement in the action space and thus convergence to sub-optimal solutions. We suggest a novel distributional framework, able to represent arbitrary distribution functions over the continuous action space. Using this framework, we construct a generative scheme, trained using an off-policy actor-critic paradigm, which we call the Generative Actor Critic (GAC). Compared to policy gradient methods, GAC does not require knowledge of the underlying probability distribution, thereby overcoming these limitations. Empirical evaluation shows that our approach is comparable and often surpasses current state-of-the-art baselines in continuous domains.

**Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards**

## Abstract

Hierarchical Reinforcement Learning (HRL) is a promising approach to solving long-horizon problems with sparse and delayed rewards. Many existing HRL algorithms either use pre-trained low-level skills that are unadaptable, or require domain-specific information to define low-level rewards. In this paper, we aim to adapt low-level skills to downstream tasks while maintaining the generality of reward design. We propose an HRL framework which sets auxiliary rewards for low-level skill training based on the advantage function of the high-level policy. This auxiliary reward enables efficient, simultaneous learning of the high-level policy and low-level skills without using task-specific knowledge. In addition, we also theoretically prove that optimizing low-level skills with this auxiliary reward will increase the task return for the joint policy. Experimental results show that our algorithm dramatically outperforms other state-of-the-art HRL methods in Mujoco domains. We also find both low-level and high-level policies trained by our algorithm transferable.

**Multi-View Reinforcement Learning**

## Abstract

This paper is concerned with multi-view reinforcement learning (MVRL), which allows for decision making when agents share common dynamics but adhere to different observation models. We define the MVRL framework by extending partially observable Markov decision processes (POMDPs) to support more than one observation model and propose a solution method through cross-view policy transfer. We empirically evaluate our method and demonstrate the effectiveness in a various set of environments. Specifically, we show reductions in sample complexity and computational time for acquiring policies that handle multi-view environments.

**Better Exploration with Optimistic Actor Critic**

## Abstract

Actor-critic methods, a type of model-free Reinforcement Learning, have been successfully applied to challenging tasks in continuous control, often achieving state-of-the art performance. However, wide-scale adoption of these methods in real-world domains is made difficult by their poor sample efficiency. We address this problem both theoretically and empirically. On the theoretical side, we identify two phenomena preventing efficient exploration in existing state-of-the-art algorithms such as Soft Actor Critic. First, combining a greedy actor update with a pessimistic estimate of the critic leads to the avoidance of actions that the agent does not know about, a phenomenon we call pessimistic underexploration. Second, current algorithms are directionally uninformed, sampling actions with equal probability in opposite directions from the current mean. This is wasteful, since we typically need actions taken along certain directions much more than others. To address both of these phenomena, we introduce a new algorithm, Optimistic Actor Critic, which approximates a lower and upper confidence bound on the state-action value function. This allows us to apply the principle of optimism in the face of uncertainty to perform directed exploration using the upper bound while still using the lower bound to avoid overestimation. We evaluate OAC in several challenging continuous control tasks, achieving state-of the art sample efficiency.

**Importance Resampling for Off-policy Prediction**

## Abstract

Importance sampling (IS) is a common reweighting strategy for off-policy prediction in reinforcement learning. While it is consistent and unbiased, it can result in high variance updates to the weights for the value function. In this work, we explore a resampling strategy as an alternative to reweighting. We propose Importance Resampling (IR) for off-policy prediction, which resamples experience from a replay buffer and applies standard on-policy updates. The approach avoids using importance sampling ratios in the update, instead correcting the distribution before the update. We characterize the bias and consistency of IR, particularly compared to Weighted IS (WIS). We demonstrate in several microworlds that IR has improved sample efficiency and lower variance updates, as compared to IS and several variance-reduced IS strategies, including variants of WIS and V-trace which clips IS ratios. We also provide a demonstration showing IR improves over IS for learning a value function from images in a racing car simulator.

**Generalized Off-Policy Actor-Critic**

## Abstract

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

**DAC: The Double Actor-Critic Architecture for Learning Options**

## Abstract

We reformulate the option framework as two parallel augmented MDPs. Under this novel formulation, all policy optimization algorithms can be used off the shelf to learn intra-option policies, option termination conditions, and a master policy over options. We apply an actor-critic algorithm on each augmented MDP, yielding the Double Actor-Critic (DAC) architecture. Furthermore, we show that, when state-value functions are used as critics, one critic can be expressed in terms of the other, and hence only one critic is necessary. Our experiments on challenging robot simulation tasks demonstrate that DAC outperforms previous gradient-based option learning algorithms by a large margin and significantly outperforms its hierarchy-free counterparts in a transfer learning setting.

**Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update**

## Abstract

We propose Episodic Backward Update (EBU) – a novel deep reinforcement learning algorithm with a direct value propagation. In contrast to the conventional use of the experience replay with uniform random sampling, our agent samples a whole episode and successively propagates the value of a state to its previous states. Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode. We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively.

**Large Scale Markov Decision Processes with Changing Rewards**

## Abstract

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of $O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))$, where $S$ is the state space, $A$ is the action space, $\tau$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $|S|$ and $|A|$ per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll|S|$, we propose a modified algorithm with computational complexity polynomial in $d$. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first $\tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards.

**Information-Theoretic Confidence Bounds for Reinforcement Learning**

## Abstract

We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent's uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.

**Third-Person Visual Imitation Learning via Decoupled Hierarchical Control**

## Abstract

We study the general setup of learning from demonstration with of goal of building an agent that is capable of imitating a single video of human demonstration to perform the task with novel objects in new scenarios. In order to accomplish this goal, our agent not only should be able to understand the intent of the demonstrated third-person video in its own context, but also be able to perform the intended task with its own environment configuration. Our main insight is to instill the structure6in the learning progress by decoupling what to achieve (intended task) from how to perform it (controller). We learn a hierarchical setup comprising of a high-level module to generate the series of first-person sub-goals conditioned on a third-person video demonstration, and a low-level controller to output actions to achieve those sub-goals. We show results on a real robotic platform using Baxter for manipulation task of pouring and placing objects in a box. The robot videos and demos are available on the project website https://sites.google.com/view/htpi.

**Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function**

## Abstract

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SATH})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored. } for MDP with S states and A actions, in the case that an upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. This result outperforms the best previous regret bounds $\tilde{O}(HS\sqrt{AT})$\cite{bartlett2009regal} by a factor of $\sqrt{SH}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SATH})$\cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{DSAT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{DSAT})$\cite{jaksch2010near}.

**Epsilon-Best-Arm Identification in Pay-Per-Reward Multi-Armed Bandits**

## Abstract

We study epsilon-best-arm identification, in a setting where during the exploration phase, the cost of each arm pull is proportional to the expected future reward of that arm. We term this setting Pay-Per-Reward. We provide an algorithm for this setting, that with a high probability returns an epsilon-best arm, while incurring a cost that depends only linearly on the total expected reward of all arms, and does not depend at all on the number of arms. Under mild assumptions, the algorithm can be applied also to problems with infinitely many arms.

**Safe Exploration for Interactive Machine Learning**

## Abstract

In Interactive machine learning (IML), we iteratively make decisions and obtain noisy observations of an unknown function. While IML methods, e.g., Bayesian optimization and active learning, have been successful in applications, on real-world systems they must provably avoid unsafe decisions. To this end, safe IML algorithms must carefully learn about a priori unknown constraints without making unsafe decisions. Existing algorithms for this problem learn about the safety of all decisions to ensure convergence. This is sample-inefficient, as it explores decisions that are not relevant for the original IML objective. In this paper, we introduce a novel framework that renders any existing unsafe IML algorithm safe. Our method works as an add-on module that takes suggested decisions as input and exploits regularity assumptions in terms of a Gaussian process prior in order to efficiently learn about their safety. As a result, we only explore the safe set when necessary for the IML problem. We apply our framework to safe Bayesian optimization and to safe exploration in deterministic Markov Decision Processes (MDP), which have been analyzed separately before, and show that our method outperforms other algorithms empirically.

**Real-Time Reinforcement Learning**

## Abstract

Markov Decision Processes (MDPs) -- the mathematical framework underlying most algorithms in Reinforcement Learning (RL), are often used in a way that wrongfully assumes that the state of an agent's environment does not change during action selection. As RL systems based on MDPs begin to find application in real-world safety critical situations this mismatch between the assumptions underling classical MDPs and the reality of real-time computation may lead to undesirable outcomes. In this paper we introduce a new framework, in which states and actions evolve simultaneously, we show how it is related to the classical MDP formulation. We analyze existing algorithms under the new real-time formulation and show why the might be suboptimal when used in real-time. We then use those insights to create an new algorithm Real-Time Actor Critic (RTAC) that outperforms the existing state-of-the-art continuous control algorithm Soft Actor Critic both in real-time and non-real-time settings.

**Robust Multi-agent Counterfactual Prediction**

## Abstract

We consider the problem of using logged data to make predictions about what would happen if we changed the `rules of the game' in a multi-agent system. This task is difficult because in many cases we observe actions individuals take but not their private information or their full reward functions. In addition, agents are strategic, so when the rules change, they will also change their actions. Existing methods (e.g. structural estimation, inverse reinforcement learning) assume that agents' behavior comes from optimizing some utility or that the system is in equilibrium. They make counterfactual predictions by using observed actions to learn the underlying utility function (a.k.a. type) and then solving for the equilibrium of the counterfactual environment. This approach imposes heavy assumptions such as the rationality of the agents being observed and a correct model of the environment and agents' utility functions. We propose a method for analyzing the sensitivity of counterfactual conclusions to violations of these assumptions, which we call robust multi-agent counterfactual prediction (RMAC). We provide a first-order method for computing RMAC bounds. We apply RMAC to classic environments in market design: auctions, school choice, and social choice.

**Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits**

## Abstract

We study agents communicating over an underlying network by exchanging messages, in order to optimize their individual regret on a common nonstochastic multi-armed bandit problem. We derive regret minimization algorithms that guarantee for each agent $v$ an individual expected regret of \[ \widetilde{O}\left(\sqrt{\left(1+\frac{K}{\left|\mathcal{N}\left(v\right)\right|}\right)T}\right), \] where $T$ is the number of time steps, $K$ is the number of actions and $\mathcal{N}\left(v\right)$ is the set of neighbors of agent $v$ in the communication graph. We present algorithms both for the case that the communication graph is known to all the agents, and for the case that the graph is unknown. When the communication graph is unknown, each agent knows only the set of its neighbors and an upper bound on the total number of agents. The individual regret between the models differ only by a logarithmic factor. Our work resolves an open problem from [Cesa-Bianchi et al., 2019b].

**Convergent Policy Optimization for Safe Reinforcement Learning**

## Abstract

We study the safe reinforcement learning problem with nonlinear function approximation, where policy optimization is formulated as a constrained optimization problem with both the objective and the constraint being nonconvex functions. For such a problem, we construct a sequence of surrogate convex constrained optimization problems by replacing the nonconvex functions locally with convex quadratic functions obtained from policy gradient estimators. We prove that the solutions to these surrogate problems converge to a stationary point of the original nonconvex problem. Furthermore, to extend our theoretical results, we apply our algorithm to examples of optimal control and multi-agent reinforcement learning with safety constraints.

**Thompson Sampling for Multinomial Logit Contextual Bandits**

## Abstract

We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments that maximizes the expected cumulative revenue, or alternatively, minimize the expected regret. The feedback here is the item that the user picks from the assortment. The distinguishing feature in this work is that this feedback has a multinomial logistic distribution. The utility of each item is a dynamic function of contextual information of both the item and the user. We propose two Thompson sampling algorithms for this multinomial logit contextual bandit. Our first algorithm maintains a posterior distribution of the true parameter and establishes $\tilde{O}(d\sqrt{T})$ Bayesian regret over $T$ rounds with $d$ dimensional context vector. The worst-case computational complexity of this algorithm could be high when the prior distribution is not a conjugate. The second algorithm approximates the posterior by a Gaussian distribution, and uses a new optimistic sampling procedure to address the issues that arise in worst-case regret analysis. This algorithm achieves $\tilde{O}(d^{3/2}\sqrt{T})$ worst-case (frequentist) regret bound. The numerical experiments show that the practical performance of both methods is in line with the theoretical guarantees.

**Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control**

## Abstract

Multi-agent reinforcement learning (MARL) has recently received considerable attention due to its applicability to a wide range of real-world applications. However, achieving efficient communication among agents has always been an overarching problem in MARL. In this work, we propose Variance Based Control (VBC), a simple yet efficient technique to improve communication efficiency in MARL. By limiting the variance of the exchanged messages between agents during the training phase, the noisy component in the messages can be eliminated effectively, while the useful part can be preserved and utilized by the agents for better performance. Our evaluation using a challenging set of StarCraft II benchmarks indicates that our method achieves $2-10\times$ lower in communication overhead than state-of-the-art MARL algorithms, while allowing agents to better collaborate by developing sophisticated strategies.

**Neural Lyapunov Control**

## Abstract

We propose new methods for learning control policies and neural network Lyapunov functions for nonlinear control problems, with provable guarantee of stability. The framework consists of a learner that attempts to find the control and Lyapunov functions, and a falsifier that finds counterexamples to quickly guide the learner towards solutions. The procedure terminates when no counterexample is found by the falsifier, in which case the controlled nonlinear system is provably stable. The approach significantly simplifies the process of Lyapunov control design, provides end-to-end correctness guarantee, and can obtain much larger regions of attraction than existing methods such as LQR and SOS/SDP. We show experiments on how the new methods obtain high-quality solutions for challenging robot control problems such as humanoid robot balancing and wheeled vehicle path-following.

**Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning**

## Abstract

Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.

**MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies**

## Abstract

Humans are able to perform a myriad of sophisticated tasks by drawing upon skills acquired through prior experience. For autonomous agents to have this capability, they must be able to extract reusable skills from past experience that can be recombined in new ways for subsequent tasks. Furthermore, when controlling complex high-dimensional morphologies, such as humanoid bodies, tasks often require coordination of multiple skills simultaneously. Learning discrete primitives for every combination of skills quickly becomes prohibitive. Composable primitives that can be recombined to create a large variety of behaviors can be more suitable for modeling this combinatorial explosion. In this work, we propose multiplicative compositional policies (MCP), a method for learning reusable motor skills that can be composed to produce a range of complex behaviors. Our method factorizes an agent's skills into a collection of primitives, where multiple primitives can be activated simultaneously via multiplicative composition. This flexibility allows the primitives to be transferred and recombined to elicit new behaviors as necessary for novel tasks. We demonstrate that MCP is able to extract composable skills for highly complex simulated characters from pre-training tasks, such as motion imitation, and then reuse these skills to solve challenging continuous control tasks, such as dribbling a soccer ball to a goal, and picking up an object and transporting it to a target location.

**Learning Latent Process from High-Dimensional Event Sequences via Efficient Sampling**

## Abstract

This paper targets modeling the temporal dynamics in high-dimension marked event sequences without any given causal network of markers. Such problem has been rarely studied by previous works which would have fundamental difficulty to handle the arisen challenges: 1) the high-dimensional markers and unknown causal network among them pose intractable obstacles for modeling the latent dynamic process; 2) one observed event sequence may concurrently contain several different causal chains; 3) it is hard to well define the distance between two high-dimension event sequences. To these ends, in this paper, we propose a seminal adversarial imitation learning framework for high-dimension event sequence generation which could be decomposed into: 1) a latent structural intensity model that estimates the adjacent nodes without explicit networks and learns to capture the temporal dynamics in the latent space of markers over observed sequence; 2) an efficient random walk based generation model that aims at imitating the generation process of high-dimension event sequences from a bottom-up view; 3) a discriminator specified as a seq2seq network optimizing the rewards to help the generator output event sequences as real as possible. Experimental results on both synthetic and real-world datasets demonstrate that the proposed method could effectively detect the hidden causal network among markers and make decent prediction for future marked events, even when the number of markers scales to million level.

**Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints**

## Abstract

Inverse reinforcement learning (IRL) enables an agent to learn complex behavior by observing demonstrations from a (near-)optimal policy. The typical assumption is that the learner's goal is to match the teacher’s demonstrated behavior. In this paper, we consider the setting where the learner has her own preferences that she additionally takes into consideration. These preferences can for example capture behavioral biases, mismatched worldviews, or physical constraints. We study two teaching approaches: learner-agnostic teaching, where the teacher provides demonstrations from an optimal policy ignoring the learner's preferences, and learner-aware teaching, where the teacher accounts for the learner’s preferences. We design learner-aware teaching algorithms and show that significant performance improvements can be achieved over learner-agnostic teaching.

**Propagating Uncertainty in Reinforcement Learning via Wasserstein Barycenters**

## Abstract

How does the uncertainty of the value function propagate when performing temporal difference learning? In this paper, we address this question by proposing a Bayesian framework in which we employ approximate posterior distributions to model the uncertainty of the value function and Wasserstein barycenters to propagate it across state-action pairs. Leveraging on these tools, we present an algorithm, Wasserstein Q-Learning (WQL), starting in the tabular case and then, we show how it can be extended to deal with continuous domains. Furthermore, we prove that, under mild assumptions, a slight variation of WQL enjoys desirable theoretical properties in the tabular setting. Finally, we present an experimental campaign to show the effectiveness of WQL on finite problems, compared to several RL algorithms, some of which are specifically designed for exploration, along with some preliminary results on Atari games.

**A Geometric Perspective on Optimal Representations for Reinforcement Learning**

## Abstract

We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. From there, we provide formal evidence regarding the usefulness of value functions as auxiliary tasks in reinforcement learning. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain.

**LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning**

## Abstract

A great challenge in cooperative decentralized multi-agent reinforcement learning (MARL) is generating diversified behaviors for each individual agent when receiving only a team reward. Prior studies have paid much effort on reward shaping or designing a centralized critic that can discriminatively credit the agents. In this paper, we propose to merge the two directions and learn each agent an intrinsic reward function which diversely stimulates the agents at each time step. Specifically, the intrinsic reward for a specific agent will be involved in computing a distinct proxy critic for the agent to direct the updating of its individual policy. Meanwhile, the parameterized intrinsic reward function will be updated towards maximizing the expected accumulated team reward from the environment so that the objective is consistent with the original MARL problem. The proposed method is referred to as learning individual intrinsic reward (LIIR) in MARL. We compare LIIR with a number of state-of-the-art MARL methods on battle games in StarCraft II. The results demonstrate the effectiveness of LIIR, and we show LIIR can assign each individual agent an insightful intrinsic reward per time step.

**No-Press Diplomacy: Modeling Multi-Agent Gameplay**

## Abstract

Diplomacy is a seven-player non-stochastic, non-cooperative game, where agents acquire resources through a mix of teamwork and betrayal. Reliance on trust and coordination makes Diplomacy the first non-cooperative multi-agent benchmark for complex sequential social dilemmas in a rich environment. In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players. We present DipNet, a neural-network-based policy model for No Press Diplomacy. The model was trained on a new dataset of more than 150,000 human games. Our model is trained by supervised learning (SL) from expert trajectories, which is then used to initialize a reinforcement learning (RL) agent trained through self-play. Both the SL and the RL agent demonstrate state-of-the-art No Press performance by beating popular rule-based bots.

**State Aggregation Learning from Markov Transition Data**

## Abstract

State aggregation is a popular model reduction method rooted in optimal control. It reduces the complexity of engineering systems by mapping the system’s states into a small number of meta-states. The choice of aggregation map often depends on the data analysts’s knowledge and is largely ad hoc. In this paper, we propose a tractable algorithm that estimates the probabilistic aggregation map from the system’s trajectory. We adopt a soft-aggregation model, where each meta-state has a signature raw state, called an anchor state. This model includes several common state aggregation models as special cases. Our proposed method is a simple two- step algorithm: The first step is spectral decomposition of empirical transition matrix, and the second step conducts a linear transformation of singular vectors to find their approximate convex hull. It outputs the aggregation distributions and disaggregation distributions for each meta-state in explicit forms, which are not obtainable by classical spectral methods. On the theoretical side, we prove sharp error bounds for estimating the aggregation and disaggregation distributions and for identifying anchor states. The analysis relies on a new entry-wise deviation bound for singular vectors of the empirical transition matrix of a Markov process, which is of independent interest and cannot be deduced from existing literature. The application of our method to Manhattan traffic data successfully generates a data-driven state aggregation map with nice interpretations.

**Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning**

## Abstract

Posterior sampling for reinforcement learning (PSRL) is an effective method of balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation fail to satisfy the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.

**Decentralized Cooperative Stochastic Bandits**

## Abstract

We study a decentralized cooperative stochastic multi-armed bandit problem with K arms on a network of N agents. In our model, the reward distribution of each arm is the same for each agent and the actual rewards are drawn independently across the agents and time steps. In each round, each agent chooses an arm to play and subsequently sends a message to her neighbors. The goal is to minimize the overall regret of the entire network. We design a fully decentralized algorithm that uses an accelerated consensus procedure to compute (delayed) estimates of the average of rewards obtained by all the agents for each arm, and then uses an upper confidence bound (UCB) algorithm that accounts for the delay and error of the estimates. We analyze the regret of our algorithm and also provide a lower bound. Our algorithm is simpler to analyze than those proposed in prior work and it achieves better regret bounds, while requiring less information about the underlying network. It also performs better empirically.

**Symmetry-Based Disentangled Representation Learning requires Interaction with Environments**

## Abstract

Finding a generally accepted formal definition of a disentangled representation in the context of an agent behaving in an environment is an important challenge towards the construction of data-efficient autonomous agents. Higgins et al. recently proposed Symmetry-Based Disentangled Representation Learning, a definition based on a characterization of symmetries in the environment using group theory. We build on their work and make observations, theoretical and empirical, that lead us to argue that Symmetry-Based Disentangled Representation Learning cannot only be based on fixed data samples. Agents should interact with the environment to discover its symmetries. Our experiments can be reproduced on Colab: http://bit.do/eKpqv

**Fast Efficient Hyperparameter Tuning for Policy Gradient Methods**

## Abstract

The performance of policy gradient methods is sensitive to hyperparameter settings that must be tuned for any new application. Widely used grid search methods for tuning hyperparameters are sample inefficient and computationally expensive. More advanced methods like Population Based Training that learn optimal schedules for hyperparameters instead of fixed settings can yield better results, but are also sample inefficient and computationally expensive. In this paper, we propose Hyperparameter Optimisation on the Fly (HOOF), a gradient-free meta-learning algorithm that requires no more than one training run to automatically adapt the hyperparameter that affect the policy update directly through the gradient. The main idea is to use existing trajectories sampled by the policy gradient method to optimise a one-step improvement objective, yielding a sample and computationally efficient algorithm that is easy to implement. Our experimental results across multiple domains and algorithms show that using HOOF to learn these hyperparameter schedules leads to faster learning with improved performance.

**Making the Cut: A Bandit-based Approach to Tiered Interviewing**

## Abstract

Given a huge set of applicants, how should a firm allocate sequential resume screenings, phone interviews, and in-person site visits? In a tiered interview process, later stages (e.g., in-person visits) are more informative, but also more expensive than earlier stages (e.g., resume screenings). Using accepted hiring models and the concept of structured interviews, a best practice in human resources, we cast tiered hiring as a combinatorial pure exploration (CPE) problem in the stochastic multi-armed bandit setting. The goal is to select a subset of arms (in our case, applicants) with some combinatorial structure. We present new algorithms in both the probably approximately correct (PAC) and fixed-budget settings that select a near-optimal cohort with provable guarantees. We show on real data from one of the largest US-based computer science graduate programs that our algorithms make better hiring decisions or use less budget than the status quo.

**Game Design for Eliciting Distinguishable Behavior**

## Abstract

Inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually constructed experiments or games. However, these traditional games are limited as they are typically designed based on heuristics. We formulate this task of designing \emph{behavior diagnostic} games that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using Prospect Theory to model varying player traits, and Markov Decision Processes to parameterize the games. We demonstrate our approach empirically, showing that our designed games successfully distinguish players with different traits, outperforming manually-designed ones by a large margin.

**Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning**

## Abstract

We study two time-scale linear stochastic approximation algorithms, which can be used to model well-known reinforcement learning algorithms such as GTD, GTD2, and TDC. We present finite-time performance bounds for the case where the learning rate is fixed. The key idea in obtaining these bounds is to use a Lyapunov function motivated by singular perturbation theory for linear differential equations. We use the bound to design an adaptive learning rate scheme which significantly improves the convergence rate over the known optimal polynomial decay rule in our experiments, and can be used to potentially improve the performance of any other schedule where the learning rate is changed at pre-determined time instants.

**Online Learning for Auxiliary Task Weighting for Reinforcement Learning**

## Abstract

Reinforcement learning is known to be sample inefficient, preventing its application to many real-world problems, especially with high dimensional observations like images. Learning auxiliary tasks along with the reinforcement learning objective could be a powerful tool to improve the learning efficiency. However, the usage of auxiliary tasks has been limited so far due to the difficulty in selecting and combining different auxiliary tasks. In this work, we propose a principled online learning algorithm that dynamically combines different auxiliary tasks to speed up training for reinforcement learning. We argue that good auxiliary tasks should provide gradient directions that, in the long term, help to decrease the loss of the main task. We show that our algorithm can effectively combine a variety of different auxiliary tasks and achieves about a 3x speedup compared to using no auxiliary tasks in a set of robotic manipulation environments.

**Blocking Bandits**

## Abstract

We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c \log T + o(\log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' \log T+ \omega(\log T)$.

**Policy Evaluation with Latent Confounders via Optimal Balance**

## Abstract

Evaluating novel contextual bandit policies using logged data is crucial in applications where exploration is costly, such as medicine. But it usually relies on the assumption of no unobserved confounders, which is bound to fail in practice. We study the question of policy evaluation when we instead have proxies for the latent confounders and develop an importance weighting method that avoids fitting a latent outcome regression model. Surprisingly, we show that there exist no single set of weights that give unbiased evaluation regardless of outcome model, unlike the case with no unobserved confounders where density ratios are sufficient. Instead, we propose an adversarial objective and weights that minimize it, ensuring sufficient balance in the latent confounders regardless of outcome model. We develop theory characterizing the consistency of our method and tractable algorithms for it. Empirical results validate the power of our method when confounders are latent.

**Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs**

## Abstract

The exploration bonus is an effective approach to manage the exploration-exploitation trade-off in Markov Decision Processes (MDPs). While it has been analyzed in infinite-horizon discounted and finite-horizon problems, we focus on designing and analysing the exploration bonus in the more challenging infinite-horizon undiscounted setting. We first introduce SCAL+, a variant of SCAL (Fruit et al. 2018), that uses a suitable exploration bonus to solve any discrete unknown weakly-communicating MDP for which an upper bound $c$ on the span of the optimal bias function is known. We prove that SCAL+ enjoys the same regret guarantees as SCAL, which relies on the less efficient extended value iteration approach. Furthermore, we leverage the flexibility provided by the exploration bonus scheme to generalize SCAL+ to smooth MDPs with continuous state space and discrete actions. We show that the resulting algorithm (SCCAL+) achieves the same regret bound as UCCRL (Ortner and Ryabko, 2012) while being the first implementable algorithm for this setting.

**Learning Mean-Field Games**

## Abstract

This paper presents a general mean-field game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. It first establishes the existence of a unique Nash Equilibrium to this GMFG, and explains that naively combining Q-learning with the fixed-point approach in classical MFGs yields unstable algorithms. It then proposes a Q-learning algorithm with Boltzmann policy (GMF-Q), with analysis of convergence property and computational complexity. The experiments on repeated Ad auction problems demonstrate that this GMF-Q algorithm is efficient and robust in terms of convergence and learning accuracy. Moreover, its performance is superior in convergence, stability, and learning ability, when compared with existing algorithms for multi-agent reinforcement learning.

**Deep imitation learning for molecular inverse problems**

## Abstract

Many measurement modalities arise from well-understood physical processes and result in information-rich but difficult-to-interpret data. Much of this data still requires laborious human interpretation. This is the case in nuclear magnetic resonance (NMR) spectroscopy, where the observed spectrum of a molecule provides a distinguishing fingerprint of its bond structure. Here we solve the resulting inverse problem: given a molecular formula and a spectrum, can we infer the chemical structure? We show for a wide variety of molecules we can quickly compute the correct molecular structure, and can detect with reasonable certainty when our method cannot. We treat this as a problem of graph-structured prediction, where armed with per-vertex information on a subset of the vertices, we infer the edges and edge types. We frame the problem as a Markov decision process (MDP) and incrementally construct molecules one bond at a time, training a deep neural network via imitation learning, where we learn to imitate a subisomorphic oracle which knows which remaining bonds are correct. Our method is fast, accurate, and is the first among recent chemical-graph generation approaches to exploit per-vertex information and generate graphs with vertex constraints. Our method points the way towards automation of molecular structure identification and potentially active learning for spectroscopy.

**Learning in Generalized Linear Contextual Bandits with Stochastic Delays**

## Abstract

In this paper, we consider online learning in generalized linear contextual bandits where rewards are not immediately observed. Instead, rewards are available to the decision maker only after some delay, which is unknown and stochastic. We study the performance of two well-known algorithms adapted to this delayed setting: one based on upper confidence bounds, and the other based on Thompson sampling. We describe modifications on how these two algorithms should be adapted to handle delays and give regret characterizations for both algorithms. Our results contribute to the broad landscape of contextual bandits literature by establishing that both algorithms are robust to delays, thereby helping clarify and reaffirm the empirical success of these two algorithms, which are widely deployed in modern recommendation engines.

**Weight Agnostic Neural Networks**

## Abstract

Not all neural network architectures are created equal, some perform much better than others for certain tasks. But how important are the weight parameters of a neural network compared to its architecture? In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight training. To evaluate these networks, we populate the connections with a single shared weight parameter sampled from a uniform random distribution, and measure the expected performance. We demonstrate that our method can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. On supervised learning domain, we find network architectures that can achieve much higher than chance accuracy on MNIST using random weights. Supplementary visualizations of results at https://anonwann.github.io/

**Learning to Predict Without Looking Ahead: World Models Without Forward Prediction**

## Abstract

Much of model-based reinforcement learning involves learning a model of an agent's world, and training an agent to leverage this model to perform a task more efficiently. While these models are demonstrably useful for agents, every naturally occurring model of the world of which we are aware---e.g., a brain---arose as the byproduct of competing evolutionary pressures for survival, not minimization of a supervised forward-predictive loss via gradient descent. That useful models can arise out of the messy and slow optimization process of evolution suggests that forward-predictive modeling can arise as a side-effect of optimization under the right circumstances. Crucially, this optimization process need not explicitly be a forward-predictive loss. In this work, we introduce a modification to traditional reinforcement learning which we call observational dropout, whereby we limit the agents ability to observe the real environment at each timestep. In doing so, we can coerce an agent into learning a world model to fill in the observation gaps during reinforcement learning. We show that the emerged world model, while not explicitly trained to predict the future, can help the agent learn key skills required to perform well in its environment. Dynamics of the emerged world models can be visualized at https://worldanon.github.io/

**Policy Learning for Fairness in Ranking**

## Abstract

Conventional Learning-to-Rank (LTR) methods optimize the utility of the rankings to the users, but they are oblivious to their impact on the ranked items. However, there has been a growing understanding that the latter is important to consider for a wide range of ranking applications (e.g. online marketplaces, job placement, admissions). To address this need, we propose a general LTR framework that can optimize a wide range of utility metrics (e.g. NDCG) while satisfying fairness of exposure constraints with respect to the items. This framework expands the class of learnable ranking functions to stochastic ranking policies, which provides a language for rigorously expressing fairness specifications. Furthermore, we provide a new LTR algorithm called Fair-PG-Rank for directly searching the space of fair ranking policies via a policy-gradient approach. Beyond the theoretical evidence in deriving the framework and the algorithm, we provide empirical results on simulated and real-world datasets verifying the effectiveness of the approach in individual and group-fairness settings.

**Off-Policy Evaluation of Generalization for Deep Q-Learning in Binary Reward Tasks**

## Abstract

In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE of value-based methods, which are of particular interest in deep RL with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Furthermore, existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.

**Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model**

## Abstract

This paper focuses on the problem of computing an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP) provided that we can access the reward and transition function through a generative model. We propose an algorithm that is initially agnostic to the MDP but that can leverage the specific MDP structure, expressed in terms of variances of the rewards and next-state value function, and gaps in the optimal action-value function to reduce the sample complexity needed to find a good policy, precisely highlighting the contribution of each state-action pair to the final sample complexity. A key feature of our analysis is that it removes all horizon dependencies in the sample complexity of suboptimal actions except for the intrinsic scaling of the value function and a constant additive term.

**A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning**

## Abstract

In this paper we consider the problem of how a reinforcement learning agent that is tasked with solving a sequence of reinforcement learning problems (a sequence of Markov decision processes) can use knowledge acquired early in its lifetime to improve its ability to solve new problems. We argue that previous experience with similar problems can provide an agent with information about how it should explore when facing a new but related problem. We show that the search for an optimal exploration strategy can be formulated as a reinforcement learning problem itself and demonstrate that such strategy can leverage patterns found in the structure of related problems. We conclude with experiments that show the benefits of optimizing an exploration strategy using our proposed framework.

**Variance Reduced Policy Evaluation with Smooth Function Approximation**

## Abstract

Policy evaluation with smooth and nonlinear function approximation has shown great potential for reinforcement learning. Compared to linear function approxi- mation, it allows for using a richer class of approximation functions such as the neural networks. Traditional algorithms are based on two timescales stochastic approximation whose convergence rate is often slow. This paper focuses on an offline setting where a trajectory of $m$ state-action pairs are observed. We formulate the policy evaluation problem as a non-convex primal-dual, finite-sum optimization problem, whose primal sub-problem is non-convex and dual sub-problem is strongly concave. We suggest a single-timescale primal-dual gradient algorithm with variance reduction, and show that it converges to an $\epsilon$-stationary point using $O(m/\epsilon)$ calls (in expectation) to a gradient oracle.

**Addressing Sample Complexity in Visual Tasks Using HER and Hallucinatory GANs**

## Abstract

Reinforcement Learning (RL) algorithms typically require millions of environment interactions to learn successful policies in sparse reward settings. Hindsight Experience Replay (HER) was introduced as a technique to increase sample efficiency by re-imagining unsuccessful trajectories as successful ones by changing the originally intended goals. However, this approach cannot be directly applied to visual environments where goal states are characterized by the presence of distinct visual features. In this work, we show how visual trajectories can be hallucinated to appear successful by altering agent observations using a generative model trained on relatively few snapshots of the goal. We then use this model in combination with HER to train RL agents in visual settings. We validate our approach on 3D navigation tasks and a simulated robotics application and show marked improvement over standard RL algorithms and baselines derived from previous work.

**Doubly-Robust Lasso Bandit**

## Abstract

Contextual multi-armed bandit algorithms are widely used in sequential decision tasks such as news article recommendation systems, web page ad placement algorithms, and mobile health. Most of the existing algorithms have regret proportional to a polynomial function of the context dimension, $d$. In many applications however, it is often the case that contexts are high-dimensional with only a sparse subset of size $s_0 (\ll d)$ being correlated with the reward. We propose a novel algorithm, namely the Doubly-Robust Lasso Bandit algorithm, which exploits the sparse structure as in Lasso, while blending the doubly-robust technique used in missing data literature. The high-probability upper bound of the regret incurred by the proposed algorithm does not depend on the number of arms, has better dependency on $s_0$ than previous works, and scales with $\mathrm{log}(d)$ instead of a polynomial function of $d$. The proposed algorithm shows good performance when contexts of different arms are correlated and requires less tuning parameters than existing methods.

**Equipping Experts/Bandits with Long-term Memory**

## Abstract

We propose the first black-box approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret. Specifically, for the classical expert problem with $K$ actions and $T$ rounds, using our general framework we develop various algorithms with a regret bound of order $\order(\sqrt{T(S\ln T + n \ln K)})$ compared to any sequence of experts with $S-1$ switches among $n \leq \min\{S, K\}$ distinct experts. In addition, by plugging specific adaptive algorithms into our framework we also achieve the best of both stochastic and adversarial environments simultaneously, which resolves an open problem of Warmuth and Koolen 2014. Furthermore, we extend our results to the sparse multi-armed bandit setting and show both negative and positive results for long-term memory guarantees. As a side result, our lower bound also implies that sparse losses do not help improve the worst-case regret for contextual bandit, a sharp contrast with the non-contextual case.

**A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning**

## Abstract

We propose and study a general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term. The extant entropy-regularized MDPs can be cast into our framework. Moreover, under our framework, many regularization terms can bring multi-modality and sparsity, which are potentially useful in reinforcement learning. In particular, we present sufficient and necessary conditions that induce a sparse optimal policy. We also conduct a full mathematical analysis of the proposed regularized MDPs, including the optimality condition, performance error, and sparseness control. We provide a generic method to devise regularization forms and propose off-policy actor critic algorithms in complex environment settings. We empirically analyze the numerical properties of optimal policies and compare the performance of different sparse regularization forms in discrete and continuous environments.

**Divergence-Augmented Policy Optimization**

## Abstract

In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to control the degree of off-policyness. Empirical experiments on Atari games show that in the data scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.

**Fully Parameterized Quantile Function for Distributional Reinforcement Learning**

## Abstract

Distributional Reinforcement Learning (RL) differs from traditional RL in that, rather than the expectation of total returns, it estimates distributions and has achieved state-of-the-art performance on Atari Games. The key challenge in practical distributional RL algorithms lies in how to parameterize estimated distributions so as to better approximate the true continuous distribution. Existing distributional RL algorithms parameterize either the probability side or the return value side of the distribution function or quantile function, leaving the other side uniformly fixed as in C51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully parameterized quantile function that parameterizes both the probability side and the value side, for distributional RL. Our algorithm contains a probability proposal network that generates a discrete set of probabilities and a quantile network that gives corresponding quantile values. The two networks are jointly trained to better approximate the true distribution. Experiments on 55 Atari Games show that our algorithm significantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment.

**Distributional Reward Decomposition for Reinforcement Learning**

## Abstract

Many reinforcement learning (RL) tasks have specific properties that can be leveraged to modify existing RL algorithms to adapt to those tasks and further improve performance, and a general class of such properties is the multiple reward channel. In those environments the full reward can be decomposed into sub-rewards obtained from different channels. Existing work on reward decomposition either requires prior knowledge of the environment to decompose the full reward, or decomposes reward without prior knowledge but with degraded performance. In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting. Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels.

**Nonstochastic Multiarmed Bandits with Unrestricted Delays**

## Abstract

We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that the "delayed" Exp3 achieves the $O(\sqrt{(KT + D)\ln K})$ regret bound conjectured by Cesa-Bianchi et al. [2016], in the case of variable, but bounded delays. Here, $K$ is the number of actions and $D$ is the total delay over $T$ rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of $D$ and $T$. For this algorithm we then construct a novel doubling scheme that forgoes this requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order $\min_\beta (|S_\beta|+\beta \ln K + (KT + D_\beta)\/\beta)$, where $|S_\beta|$ is the number of observations with delay exceeding $\beta$, and $D_\beta$ is the total delay of observations with delay below $\beta$. The bound relaxes to $O(\sqrt{(KT + D)\ln K})$, but we also provide examples where $D_\beta \ll D$ and the oracle bound has a polynomially better dependence on the problem parameters.

**Efficient Pure Exploration in Adaptive Round model**

## Abstract

In the adaptive setting, many multi-armed bandit applications allow the learner to adaptively draw samples and adjust sampling strategy in rounds. In many real applications, not only the query complexity but also the round complexity need to be optimized. In this paper, we study both PAC and exact top-$k$ arm identification problems and design efficient algorithms considering both round complexity and query complexity. For PAC problem, we achieve optimal query complexity and use only $O(\log_{\frac{k}{\delta}}^*(n))$ rounds, which matches the lower bound of round complexity, while most of existing works need $\Theta(\log \frac{n}{k})$ rounds. For exact top-$k$ arm identification, we improve the round complexity factor from $\log n$ to $\log_{\frac{1}{\delta}}^*(n)$, and achieve near optimal query complexity. In experiments, our algorithms conduct far fewer rounds, and outperform state of the art by orders of magnitude with respect to query cost.

**Interval timing in deep reinforcement learning agents**

## Abstract

The measurement of time is central to intelligent behavior. We know that both animals and artificial agents can successfully use temporal dependencies to select actions. In artificial agents, little work has directly addressed (1) which architectural components are necessary for successful development of this ability, (2) how this timing ability comes to be represented in the units and actions of the agent, and (3) whether the resulting behavior of the system converges on solutions similar to those of biology. Here we studied interval timing abilities in deep reinforcement learning agents trained end-to-end on an interval reproduction paradigm inspired by experimental literature on mechanisms of timing. We characterize the strategies developed by recurrent and feedforward agents, which both succeed at temporal reproduction using distinct mechanisms, some of which bear specific and intriguing similarities to biological systems. These findings advance our understanding of how agents come to represent time, and they highlight the value of experimentally inspired approaches to characterizing agent abilities.

**Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs**

## Abstract

Robust MDPs (RMDPs) can be used to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution are determined by the ambiguity set---the set of plausible transition probabilities---which is usually constructed as a multi-dimensional confidence region. Existing methods construct ambiguity sets as confidence regions using concentration inequalities which leads to overly conservative solutions. This paper proposes a new paradigm that can achieve better solutions with the same robustness guarantees without using confidence regions as ambiguity sets. To incorporate prior knowledge, our algorithms optimize the size and position of ambiguity sets using Bayesian inference. Our theoretical analysis shows the safety of the proposed method, and the empirical results demonstrate its practical promise.

**The bias of the sample mean in multi-armed bandits can be positive or negative**

## Abstract

It is well known that in stochastic multi-armed bandits (MAB), the sample mean of an arm is typically not an unbiased estimator of its true mean. In this paper, we decouple three different sources of this selection bias: adaptive \emph{sampling} of arms, adaptive \emph{stopping} of the experiment, and adaptively \emph{choosing} which arm to study. Through a new notion called ``optimism'' that captures certain natural monotonic behaviors of algorithms, we provide a clean and unified analysis of how optimistic rules affect the sign of the bias. The main takeaway message is that optimistic sampling induces a negative bias, but optimistic stopping and optimistic choosing both induce a positive bias. These results are derived in a general stochastic MAB setup that is entirely agnostic to the final aim of the experiment (regret minimization or best-arm identification or anything else). We provide examples of optimistic rules of each type, demonstrate that simulations confirm our theoretical predictions, and pose some natural but hard open problems.

**On the Correctness and Sample Complexity of Inverse Reinforcement Learning**

## Abstract

Inverse reinforcement learning (IRL) is the problem of finding a reward function that generates a given optimal policy for a given Markov Decision Process. This paper looks at an algorithmic-independent geometric analysis of the IRL problem with finite states and actions. A L1-regularized Support Vector Machine formulation of the IRL problem motivated by the geometric analysis is then proposed with the basic objective of the inverse reinforcement problem in mind: to find a reward function that generates a specified optimal policy. The paper further analyzes the proposed formulation of inverse reinforcement learning with $n$ states and $k$ actions, and shows a sample complexity of $O(n^2 \log (nk))$ for recovering a reward function that generates a policy that satisfies Bellman's optimality condition with respect to the true transition probabilities.

**VIREL: A Variational Inference Framework for Reinforcement Learning**

## Abstract

Applying probabilistic models to reinforcement learning (RL) enables the uses of powerful optimisation tools such as variational inference in RL. However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour in pseudo-likelihood methods, difficulties learning deterministic policies in maximum entropy RL based approaches, and a lack of analysis when function approximators are used. We propose VIREL, a theoretically grounded probabilistic inference framework for RL that utilises a parametrised action-value function to summarise future dynamics of the underlying MDP, generalising existing approaches. VIREL also benefits from a mode-seeking form of KL divergence, the ability to learn deterministic optimal polices naturally from inference, and the ability to optimise value functions and policies in separate, iterative steps. In applying variational expectation-maximisation to VIREL, we thus show that the actor-critic algorithm can be reduced to expectation-maximisation, with policy improvement equivalent to an E-step and policy evaluation to an M-step. We then derive a family of actor-critic methods fromVIREL, including a scheme for adaptive exploration. Finally, we demonstrate that actor-critic algorithms from this family outperform state-of-the-art methods based on soft value functions in several domains.

**Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning**

## Abstract

This work tackles the problem of robust zero-shot planning in non-stationary stochastic environments. We study Markov Decision Processes (MDPs) evolving over time and consider Model-Based Reinforcement Learning algorithms in this setting. We make two hypotheses: 1) the environment evolves continuously with a bounded evolution rate; 2) a current model is known at each decision epoch but not its evolution. Our contribution can be presented in four points. 1) we define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We introduce the notion of regular evolution by making an hypothesis of Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we consider a planning agent using the current model of the environment but unaware of its future evolution. This leads us to consider a worst-case method where the environment is seen as an adversarial agent; 3) following this approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot Model-Based method similar to Minimax search; 4) we illustrate the benefits brought by RATS empirically and compare its performance with reference Model-Based algorithms.

**Explicit Planning for Efficient Exploration in Reinforcement Learning**

## Abstract

Efficient exploration is crucial to achieving good performance in reinforcement learning. Existing systematic exploration strategies (R-MAX, MBIE, UCRL, etc.), despite being promising theoretically, are essentially greedy strategies that follow some predefined heuristics. When the heuristics do not match the dynamics of Markov decision processes (MDPs) well, an excessive amount of time can be wasted in travelling through already-explored states, lowering the overall efficiency. We argue that explicit planning for exploration can help alleviate such problem, and propose Value Iteration for Exploration Cost (VIEC) algorithm which computes the optimal exploration scheme by solving an augmented MDP. We then present a detailed analysis of the exploration behaviour of some popular strategies, showing how these strategies can fail and spend O(n^2 md) or O(n^2 m + nmd) steps to collect sufficient data in some tower-shaped MDPs, while the optimal exploration scheme, which can be found out by VIEC, only needs O(nmd), where n, m are the numbers of states and actions and d is the data demand. The analysis not only points out the weakness of existing heuristic-based strategies, but also suggests a remarkable potential in explicit planning for exploration.

**Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints**

## Abstract

We consider the classical stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. We prove matching upper and lower bounds on regret and provide near-optimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multi-armed bandit problem, there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the trade-off between regret and incurred switching cost in the stochastic multi-armed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.

**Constrained Reinforcement Learning: A Dual Approach**

## Abstract

Autonomous agents must often deal with conflicting requirements, such as completing tasks using the least amount of time/energy, learning multiple tasks, or dealing with multiple opponents. In the context of reinforcement learning~(RL), these problems are addressed by (i)~designing a reward function that simultaneously describes all requirements or (ii)~combining modular value functions that encode them individually. Though effective, these methods have critical downsides. Designing good reward functions that balance different objectives is challenging, especially as the number of objectives grows. Moreover, implicit interference between goals may lead to performance plateaus as they compete for resources, particularly when training on-policy. Similarly, selecting parameters to combine value functions is at least as hard as designing an all-encompassing reward, given that the effect of their values on the overall policy is not straightforward. This work addresses this issue by formulating the conflicting requirement as a constrained RL problem. Despite its non-convexity, we prove that this problem has zero duality gap, i.e., it can be solved exactly in the dual domain, where it becomes convex. Finally, we show this result basically holds if the policy is described by a good parametrization~(e.g., neural networks) and we connect this result with primal-dual algorithms present in the literature and we establish the convergence to the optimal solution.

**MAVEN: Multi-Agent Variational Exploration**

## Abstract

Centralised training with decentralised execution is an important setting for cooperative deep multi-agent reinforcement learning due to communication constraints during execution and computational tractability in training. In this paper, we analyse value-based methods that are known to have superior performance in complex environments. We specifically focus on QMIX, the current state-of-the-art in this domain. We show that the representation constraints on the joint action-values introduced by QMIX and similar methods lead to provably poor exploration and suboptimality. Furthermore, we propose a novel approach called MAVEN that hybridises value and policy-based methods by introducing a latent space for hierarchical control. The value-based agents condition their behaviour on the shared latent variable controlled by a hierarchical policy. This allows MAVEN to achieve committed, temporally extended exploration, which is key to solving complex multi-agent tasks. Our experimental results show that MAVEN achieves significant performance improvements on the challenging SMAC domain.

**Maximum Expected Hitting Cost of a Markov Decision Process and Informativeness of Rewards**

## Abstract

We propose a new complexity measure for Markov decision processes (MDP), the maximum expected hitting cost (MEHC). This measure tightens the closely related notion of diameter [JOA10] by accounting for the reward structure. We show that this parameter replaces diameter in the upper bound on the optimal value span of an extended MDP, thus refining the associated upper bounds on the regret of several UCRL2-like algorithms. Furthermore, we show that potential-based reward shaping [NHR99] can induce equivalent reward functions with varying informativeness, as measured by MEHC. By analyzing the change in the maximum expected hitting cost, this work presents a formal understanding of the effect of potential-based reward shaping on regret (and sample complexity) in the undiscounted average reward setting. We further establish that shaping can reduce or increase MEHC by at most a factor of two in a large class of MDPs with finite MEHC and unsaturated optimal average rewards.

**A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment**

## Abstract

Empowerment is an information-theoretic method that can be used to intrinsically motivate learning agents. It attempts to maximize an agent's control over the environment by encouraging visiting states with a large number of reachable next states. Empowered learning has been shown to lead to complex behaviors, without requiring an explicit reward signal. In this paper, we investigate the use of empowerment in the presence of an extrinsic reward signal. We hypothesize that empowerment can guide reinforcement learning (RL) agents to find good early behavioral solutions by encouraging highly empowered states. We propose a unified Bellman optimality principle for empowered reward maximization. Our empowered reward maximization approach generalizes both Bellman’s optimality principle as well as recent information-theoretical extensions to it. We prove uniqueness of the empowered values and show convergence to the optimal solution. We then apply this idea to develop off-policy actor-critic RL algorithms for high-dimensional continuous domains. We experimentally validate our methods in robotics domains (MuJoCo). Our methods demonstrate improved initial and competitive final performance compared to model-free state-of-the-art techniques.

**SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies**

## Abstract

Imitation Learning (IL) has been successfully applied to complex sequential decision-making problems where standard Reinforcement Learning (RL) algorithms fail. A number of recent methods extend IL to few-shot learning scenarios, where a meta-trained policy learns to quickly master new tasks using limited demonstrations. However, although Inverse Reinforcement Learning (IRL) often outperforms Behavioral Cloning (BC) in terms of imitation quality, most of these approaches build on BC due to its simple optimization objective. In this work, we propose SMILe, a scalable framework for Meta Inverse Reinforcement Learning (Meta-IRL) based on maximum entropy IRL, which can learn high-quality policies from few demonstrations. We examine the efficacy of our method on a variety of high-dimensional simulated continuous control tasks and observe that SMILE significantly outperforms Meta-BC. To our knowledge, our approach is the first efficient method for Meta-IRL that scales to the intractable function approximator setting.

**Provably Efficient Q-Learning with Low Switching Cost**

## Abstract

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of \emph{local switching cost}. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for $H$-step episodic MDP that achieves sublinear regret whose local switching cost in $K$ episodes is $O(H^3SA\log K)$, and we provide a lower bound of $\Omega(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting \citep{guo2015concurrent}, which yields nontrivial results that improve upon prior work in certain aspects.

**Difference Maximization Q-learning: Provably Efficient Q-learning with Function Approximation**

## Abstract

$Q$-learning with function approximation is one of the most popular methods in reinforcement learning. Though the idea of using function approximation was proposed at least 60 years ago [Samuel,1959], even in the simplest setup, i.e, approximating $Q$-functions with linear functions, it is still an open problem how to design a provably efficient algorithm that learns a near-optimal policy. The key challenges are how to efficiently explore the state space and how to decide when to stop exploring \emph{in conjunction with} the function approximation scheme. The current paper presents the first provably efficient algorithm for $Q$-learning with linear function approximation. Our algorithm, Difference Maximization $Q$-learning (DMQ), combined with linear function approximation, returns an $\epsilon$-suboptimal policy using $\poly(H,K,d,1/\gamma,1/\epsilon)$ number of trajectories, where $H$ is the planning horizon, $K$ is the number of actions, $d$ is the feature dimension and $\gamma$ is the smallest gap between mean reward-to-go of the optimal action and the rest. Our algorithm introduces a new notion, the Distribution Shift Error Checking (DSEC) oracle. This oracle tests whether there exists a function in the function class that predicts well on a distribution $\dist_1$, but predicts poorly on another distribution $\dist_2$, where $\dist_1$ and $\dist_2$ are distributions over states induced by two different exploration policies. For the linear function class, this oracle is equivalent to solving a top eigenvalue problem. We believe our algorithmic insights, especially the DSEC oracle, are also useful in designing and analyzing reinforcement learning algorithms with general function approximation.

**A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning**

## Abstract

Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with 5 times more agents and tasks than those seen during training.

**Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning**

## Abstract

Gradient-based methods for optimisation of objectives in stochastic settings with unknown or intractable dynamics require estimators of derivatives. We derive an objective that, under automatic differentiation, produces low-variance unbiased estimators of derivatives at any order. Our objective is compatible with arbitrary advantage estimators, which allows the control of the bias and variance of any-order derivatives when using function approximation. Furthermore, we propose a method to trade off bias and variance of higher order derivatives by discounting the impact of more distant causal dependencies. We demonstrate the correctness and utility of our estimator in analytically tractable MDPs and in meta-reinforcement-learning for continuous control.

**Learning-In-The-Loop Optimization: End-To-End Control And Co-Design Of Soft Robots Through Learned Deep Latent Representations**

## Abstract

Soft robots have continuum solid bodies that can deform in an infinite number of ways. Controlling soft robots is very challenging as there are no closed form solutions. We present a learning-in-the-loop co-optimization algorithm in which a latent state representation is learned as the robot figures out how to solve the task. Our solution marries hybrid particle-grid-based simulation with deep, variational convolutional autoencoder architectures that can capture salient features of robot dynamics with high efficacy. We demonstrate our dynamics-aware feature learning algorithm on both 2D and 3D soft robots, and show that it is more robust and faster converging than the dynamics-oblivious baseline. We validate the behavior of our algorithm with visualizations of the learned representation.

**Provably Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost**

## Abstract

Despite the empirical success of the actor-critic algorithm, its theoretical understanding lags behind. In a broader context, actor-critic can be viewed as an online alternating update algorithm for bilevel optimization, whose convergence is known to be fragile. To understand the instability of actor-critic, we focus on its application to linear quadratic regulators, a simple yet fundamental setting of reinforcement learning. We establish a nonasymptotic convergence analysis of actor- critic in this setting. In particular, we prove that actor-critic finds a globally optimal pair of actor (policy) and critic (action-value function) at a linear rate of convergence. Our analysis may serve as a preliminary step towards a complete theoretical understanding of bilevel optimization with nonconvex subproblems, which is NP-hard in the worst case and is often solved using heuristics.

**Learning from Trajectories via Subgoal Discovery**

## Abstract

Learning to solve complex goal-oriented tasks with sparse terminal-only rewards often requires an enormous number of samples. In such cases, using a set of expert trajectories could help to learn faster. However, Imitation Learning (IL) via supervised pre-training with these trajectories may not perform as well and generally requires additional finetuning with expert-in-the-loop. In this paper, we propose an approach which uses the expert trajectories and learns to decompose the complex main task into smaller sub-goals. We learn a function which partitions the state-space into sub-goals, which can then be used to design an extrinsic reward function. We follow a strategy where the agent first learns from the trajectories using IL and then switches to Reinforcement Learning (RL) using the identified sub-goals, to alleviate the errors in the IL step. To deal with states which are under-represented by the trajectory set, we also learn a function to modulate the sub-goal predictions. We show that our method is able to solve complex goal-oriented tasks, which other RL, IL or their combinations in literature are not able to solve.

**Characterizing the exact behaviors of temporal difference learning algorithms using Markov jump linear system theory**

## Abstract

In this paper, we provide a unified analysis of temporal difference learning algorithms with linear function approximators by exploiting their connections to Markov jump linear systems (MJLS). We tailor the MJLS theory developed in the control community to characterize the exact behaviors of the first and second order moments of a large family of temporal difference learning algorithms. For both the IID and Markov noise cases, we show that the evolution of some augmented versions of the mean and covariance matrix of TD learning exactly follows the trajectory of a deterministic linear time-invariant (LTI) dynamical system. Applying the well-known LTI system theory, we obtain closed-form expressions for the mean and covariance matrix of TD learning at any time step. We provide a tight matrix spectral radius condition to guarantee the convergence of the covariance matrix of TD learning. We perform a perturbation analysis to show the dependence of the behaviors of TD learning on learning rate. In addition, for the IID case, we provide an exact formula characterizing how the mean and covariance matrix of TD learning converge to the steady state values at a linear rate. For the Markov case, we use our formulas to explain how the behaviors of TD learning algorithms are affected by learning rate and various properties of the underlying Markov chain.

**Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator**

## Abstract

We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. Our analysis quantifies the tension between policy improvement and policy evaluation, and suggests that policy evaluation is the dominant factor in terms of sample complexity. Specifically, we show that to obtain a controller that is within $\varepsilon$ of the optimal LQR controller, each step of policy evaluation requires at most $(n+d)^3/\varepsilon^2$ samples, where $n$ is the dimension of the state vector and $d$ is the dimension of the input vector. On the other hand, only $\log(1/\varepsilon)$ policy improvement steps suffice, resulting in an overall sample complexity of $(n+d)^3 \varepsilon^{-2} \log(1/\varepsilon)$. We furthermore build on our analysis and construct a simple adaptive procedure based on $\varepsilon$-greedy exploration which relies on approximate PI as a sub-routine and obtains $T^{2/3}$ regret, improving upon a recent result of Abbasi-Yadkori et al. 2019.

**Finite-Sample Analysis for SARSA with Linear Function Approximation**

## Abstract

SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement learning. We investigate the SARSA algorithm with linear function approximation under the non-i.i.d.\ setting, where a single sample trajectory is available. With a Lipschitz continuous policy improvement operator that is smooth enough, SARSA has been shown to converge asymptotically. However, its non-asymptotic analysis is challenging and remains unsolved due to the non-i.i.d. samples, and the fact that the behavior policy changes dynamically with time. In this paper, we develop a novel technique to explicitly characterize the stochastic bias of a type of stochastic approximation procedures with time-varying Markov transition kernels. Our approach enables non-asymptotic convergence analyses of this type of stochastic approximation algorithms, which may be of independent interest. Using our bias characterization technique and a gradient descent type of analysis, we further provide the finite-sample analysis on the mean square error of the SARSA algorithm. In the end, we present a fitted SARSA algorithm, which includes the original SARSA algorithm and its variant as special cases. This fitted SARSA algorithm provides a framework for \textit{iterative} on-policy fitted policy iteration, which is more memory and computationally efficient. For this fitted SARSA algorithm, we also present its finite-sample analysis.

**Unsupervised State Representation Learning in Atari**

## Abstract

State representation learning, or the ability to capture latent generative factors of an environment is crucial for building intelligent agents that can perform a wide variety of tasks. Learning such representations in an unsupervised manner without supervision from rewards is an open problem. We introduce a method that tries to learn better state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations. We also introduce a new benchmark based on Atari 2600 games where we evaluate representations based on how well they capture the ground truth state. We believe this new framework for evaluating representation learning models will be crucial for future representation learning research. Finally, we compare our technique with other state-of-the-art generative and contrastive representation learning methods.

**Surrogate Objectives for Batch Policy Optimization in One-step Decision Making**

## Abstract

We investigate batch policy optimization for cost-sensitive classification and contextual bandits---two related tasks that obviate exploration but require generalizing from observed rewards to action selections in unseen contexts. When rewards are fully observed, we show that the expected reward objective exhibits suboptimal plateaus and exponentially many local optima in the worst case. To overcome the poor landscape, we develop a convex surrogate that is calibrated with respect to entropy regularized expected reward. We then consider the partially observed case, where rewards are recorded for only a subset of actions. Here we generalize the surrogate to partially observed data, and uncover novel objectives for batch contextual bandit training. We find that surrogate objectives remain provably sound in this setting and empirically demonstrate state-of-the-art performance.

**Regret Bounds for Thompson Sampling in Restless Bandit Problems**

## Abstract

Restless bandit problems are instances of non-stationary multi-armed bandits. These problems have been studied well from the optimization perspective, where we aim to efficiently find a near-optimal policy when system parameters are known. However, very few papers adopt a learning perspective, where the parameters are unknown. In this paper, we analyze the performance of Thompson sampling in restless bandits with unknown parameters. We consider a general policy map to define our competitor and prove an $\tilde{O}(\sqrt{T})$ Bayesian regret bound. Our competitor is flexible enough to represent various benchmarks including the best fixed action policy, the optimal policy, the Whittle index policy, or the myopic policy. We also present empirical results that support our theoretical findings.

**Better Transfer Learning Through Inferred Successor Maps**

## Abstract

Humans and animals show remarkable flexibility in adjusting their behaviour when their goals, or rewards in the environment change. While such flexibility is a hallmark of intelligent behaviour, these multi-task scenarios remain an important challenge for machine learning algorithms and neurobiological models alike. Factored representations can enable flexible behaviour by abstracting away general aspects of a task from those prone to change, while nonparametric methods provide a principled way of using similarity to past experiences to guide current behaviour. Here we combine the successor representation (SR), that factors the value of actions into expected outcomes and corresponding rewards, with nonparametric inference and clustering of the space of rewards. We propose an algorithm that improves SR's transfer capabilities, while explaining important signatures of place cell representations in the hippocampus . Our method dynamically samples from a flexible number of distinct SR maps using inference about the current reward context, and outperforms competing algorithms in settings with both known and unsignalled rewards changes. It reproduces the "flickering" behaviour of hippocampal maps seen when rodents navigate to changing reward locations, and gives a quantitative account of trajectory-dependent hippocampal representations (so-called splitter cells). We thus provide a novel algorithmic approach for multi-task learning, as well as a common normative framework that links together these different characteristics of the brain's spatial representation.

**Online Continuous Submodular Maximization: From Full-Information to Bandit Feedback**

## Abstract

In this paper, we propose three online algorithms for submodular maximisation. The first one, Mono-Frank-Wolfe, reduces the number of per-function gradient evaluations from $T^{1/2}$ [Chen2018Online] and $T^{3/2}$ [chen2018projection] to 1, and achieves a $(1-1/e)$-regret bound of $O(T^{4/5})$. The second one, Bandit-Frank-Wolfe, is the first bandit algorithm for continnuous DR-submodular maximization, which achieves a $(1-1/e)$-regret bound of $O(T^{8/9})$. Finally, we extend Bandit-Frank-Wolfe to a bandit algorithm for discrete submodular maximization, Responsive-Frank-Wolfe, which attains a $(1-1/e)$-regret bound of $O(T^{8/9})$ in the responsive bandit setting.

**Sampling Networks and Aggregate Simulation for Online POMDP Planning**

## Abstract

The paper introduces a new algorithm for planning in partially observable Markov decision processes (POMDP) based on the idea of aggregate simulation. The algorithm uses product distributions to approximate the belief state and shows how to build a representation graph of an approximate action-value function over belief space. The graph captures the result of simulating the model in aggregate under independence assumptions, giving a symbolic representation of the value function. The algorithm supports large observation spaces using sampling networks, a representation of the process of sampling values of observations, which is integrated into the graph representation. Following previous work in MDPs this approach enables action selection in POMDPs through gradient optimization over the graph representation. This approach complements recent algorithms for POMDPs which are based on particle representations of belief states and an explicit search for action selection. Our approach enables scaling to large factored action spaces in addition to large state spaces and observation spaces. An experimental evaluation demonstrates that the algorithm provides excellent performance relative to state of the art in large POMDP problems.

**Linear Stochastic Bandits Under Safety Constraints**

## Abstract

Bandit algorithms have various application in safety-critical systems, where it is important to respect the system constraints that rely on the bandit's unknown parameters at every round. In this paper, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend (linearly) on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that her actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose a new UCB-based algorithm called Safe-LUCB, which includes necessary modifications to respect safety constraints. The algorithm has two phases. During the pure exploration phase the learner chooses her actions at random from a restricted set of safe actions with the goal of learning a good approximation of the entire unknown safe set. Once this goal is achieved, the algorithm begins a safe exploration-exploitation phase where the learner gradually expands their estimate of the set of safe actions while controlling the growth of regret. We provide a general regret bound for the algorithm, as well as a problem dependent bound that is connected to the location of the optimal action within the safe set. We then propose a modified heuristic that exploits our problem dependent analysis to improve the regret.

**Budgeted Reinforcement Learning in Continuous State Space**

## Abstract

A Budgeted Markov Decision Process (BMDP) is an extension of a Markov Decision Process to critical applications requiring safety constraints. It relies on a notion of risk implemented in the shape of an upper bound on a constrains violation signal that -- importantly -- can be modified in real-time. So far, BMDPs could only be solved in the case of finite state spaces with known dynamics. This work extends the state-of-the-art to continuous spaces environments and unknown dynamics. We show that the solution to a BMDP is the fixed point of a novel Budgeted Bellman Optimality operator. This observation allows us to introduce natural extensions of Deep Reinforcement Learning algorithms to address large-scale BMDPs. We validate our approach on two simulated applications: spoken dialogue and autonomous driving.

**Explicit Explore-Exploit Algorithms in Continuous State Spaces**

## Abstract

We present a new model-based algorithm for reinforcement learning (RL) which consists of explicit exploration and exploitation phases, and is applicable in large or infinite state spaces. The algorithm maintains a set of dynamics models consis- tent with current experience and explores by finding policies which induce high disagreement between their state predictions. It then exploits using the refined set of models or experience gathered during exploration. We show that under realizability and optimal planning assumptions, our algorithm provably finds a near-optimal policy with a number of samples which is polynomial in terms of a structural complexity measure which we show to be low in several natural settings. We then give a practical approximation using neural networks and gradient-based optimization, and demonstrate its performance and sample efficiency in practice.

**Language as an Abstraction for Hierarchical Deep Reinforcement Learning**

## Abstract

Solving complex, temporally-extended tasks is a long-standing problem in reinforcement learning (RL). We hypothesize that one critical element of solving such problems is the notion of compositionality. With the ability to learn sub-skills that can be composed to solve longer tasks, i.e. hierarchical RL, we can acquire temporally-extended behaviors. However, acquiring effective yet general abstractions for hierarchical RL is remarkably challenging. In this paper, we propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalization, while retaining tremendous flexibility, making it suitable for a variety of problems. Our approach learns an instruction-following low-level policy and a high-level policy that can reuse abstractions across tasks, in essence, permitting agents to reason using structured language. To study compositional task learning, we introduce an open-source object interaction environment built using the MuJoCo physics engine and the CLEVR engine. We find that, using our approach, agents can learn to solve to diverse, temporally-extended tasks such as object sorting and multi-object rearrangement, including from raw pixel observations. Our analysis find that the compositional nature of language is critical for learning and systematically generalizing sub-skills in comparison to non-compositional abstractions that use the same supervision.

**Non-Cooperative Inverse Reinforcement Learning**

## Abstract

Making decisions in the presence of a strategic opponent requires one to take into account the opponent’s ability to actively mask its intended objective. To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. The N-CIRL problem consists of two agents with completely misaligned objectives, where only one of the agents knows the true reward function. Formally, we model the N-CIRL problem as a zero-sum Markov game with one-sided incomplete information. Through interacting with the more informed player, the less informed player attempts to infer the true reward function. As a result of the one-sided incomplete information, the multi-stage game can be decomposed into a sequence of single-stage games. The theoretical results serve as a basis for the design of efficient algorithms for computing equilibrium strategies. The N-CIRL formalism has natural applications in cyber-security where a defender attempts to defend a system without perfect knowledge of the attacker’s intent.

**Maximum Entropy Monte-Carlo Planning**

## Abstract

We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT. The idea is to augment Monte-Carlo Tree Search (MCTS) with maximum entropy policy optimization, evaluating each search node by softmax values back-propagated from simulation. To establish the effectiveness of this approach, we first investigate the single-step decision problem, stochastic softmax bandits, and show that softmax values can be estimated at an optimal convergence rate in terms of mean squared error. We then extend this approach to general sequential decision making by developing a general MCTS algorithm, \emph{Maximum Entropy for Tree Search} (MENTS). We prove that the probability of MENTS failing to identify the best decision at the root decays exponentially, which fundamentally improves the polynomial convergence rate of UCT. Our experimental results also demonstrate that MENTS is more sample efficient than UCT in both synthetic problems and Atari 2600 games.

**Guided Meta-Policy Search**

## Abstract

Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks so as to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the \emph{meta-training} process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. Since the central meta-learner learns by imitating the solutions to the individual tasks, it can accommodate either the standard meta-RL problem setting, or a hybrid setting where some or all tasks are provided with example demonstrations. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide. Across a number of continuous control meta-RL problems, we demonstrate significant improvements in meta-RL sample efficiency in comparison to prior work as well as the ability to scale to domains with visual observations.

**Marginalized Off-Policy Evaluation for Reinforcement Learning**

## Abstract

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) --- the problem of evaluating a new policy using the historical data obtained by different behavior policies --- under the model of nonstationary episodic Markov Decision Processes with a long horizon and large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $O(H^2R_{\max}^2\sum_{t=1}^H\E_\mu[(w_{\pi,\mu}(s_t,a_t))^2]/n)$ for large $n$, where $w_{\pi,\mu}(s_t,a_t)$ is the ratio of the marginal distribution of $t$th step under $\pi$ and $\mu$, $H$ is the horizon, $R_{\max}$ is the maximal rewards, and $n$ is the sample size. The result nearly matches the Cramer-Rao lower bounds for DAG MDP in \citet{jiang2016doubly} for most non-trivial regimes. To the best of our knowledge, this is the first OPE estimator with provably optimal dependence in $H$ and the second moments of the importance weight. Besides theoretical optimality, we empirically demonstrate the superiority of our method in time-varying, partially observable, and long-horizon RL environments.

**Contextual Bandits with Cross-Learning**

## Abstract

In the classical contextual bandits problem, in each round $t$, a learner observes some context $c$, chooses some action $a$ to perform, and receives some reward $r_{a,t}(c)$. We consider the variant of this problem where in addition to receiving the reward $r_{a,t}(c)$, the learner also learns the values of $r_{a,t}(c')$ for all other contexts $c'$; i.e., the rewards that would have been achieved by performing that action under different contexts. This variant arises in several strategic settings, such as learning how to bid in non-truthful repeated auctions (in this setting the context is the decision maker's private valuation for each auction). We call this problem the contextual bandits problem with cross-learning. The best algorithms for the classical contextual bandits problem achieve $\tilde{O}(\sqrt{CKT})$ regret against all stationary policies, where $C$ is the number of contexts, $K$ the number of actions, and $T$ the number of rounds. We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on $C$ and achieve regret $\tilde{O}(\sqrt{KT})$ (when contexts are stochastic with known distribution), $\tilde{O}(K^{1/3}T^{2/3})$ (when contexts are stochastic with unknown distribution), and $\tilde{O}(\sqrt{KT})$ (when contexts are adversarial but rewards are stochastic). We simulate our algorithms on real auction data from an ad exchange running first-price auctions (showing that they outperform traditional contextual bandit algorithms).

**A Bayesian Theory of Conformity in Collective Decision Making**

## Abstract

In collective decision making, members of a group need to coordinate their actions in order to achieve a desirable outcome. When there is no direct communication between group members, one should decide based on inferring others' intentions from their actions. The inference of others' intentions is called "theory of mind" and can involve different levels of reasoning, from a single inference on a hidden variable to considering others partially or fully optimal and reasoning about their actions conditioned on one's own actions (levels of “theory of mind”). In this paper, we present a new Bayesian theory of collective decision making based on a simple yet most commonly observed behavior: conformity. We show that such a Bayesian framework allows one to achieve any level of theory of mind in collective decision making. The viability of our framework is demonstrated on two different experiments, a consensus task with 120 subjects and a volunteer's dilemma task with 29 subjects, each with multiple conditions.

**Multi-Agent Common Knowledge Reinforcement Learning**

## Abstract

Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledge arises naturally in a large number of decentralised cooperative multi-agent tasks, for example, when agents can reconstruct parts of each others' observations. Since agents can independently agree on their common knowledge, they can execute complex coordinated policies that condition on this knowledge in a fully decentralised fashion. We propose multi-agent common knowledge reinforcement learning (MACKRL), a novel stochastic actor-critic algorithm that learns a hierarchical policy tree. Higher levels in the hierarchy coordinate groups of agents by conditioning on their common knowledge, or delegate to lower levels with smaller subgroups but potentially richer common knowledge. The entire policy tree can be executed in a fully decentralised fashion. As the lowest policy tree level consists of independent policies for each agent, MACKRL reduces to independently learnt decentralised policies as a special case. We demonstrate that our method can exploit common knowledge for superior performance on complex decentralised coordination tasks, including a stochastic matrix game and challenging problems in StarCraft II unit micromanagement.

**Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning**

## Abstract

Model-free deep reinforcement learning (RL) algorithms have been widely used for a range of complex control tasks. However, slow convergence and sample inefficiency remain challenging problems in RL, especially when handling continuous and high-dimensional state spaces. To tackle this problem, we propose a general acceleration method for model-free, off-policy deep RL algorithms by drawing the idea underlying regularized Anderson acceleration (RAA), which is an effective approach to accelerating the solving of fixed point problems with perturbations. Specifically, we first explain how policy iteration can be applied directly with Anderson acceleration. Then we extend RAA to the case of deep RL by introducing a regularization term to control the impact of perturbation induced by function approximation errors. We further propose two strategies, i.e., progressive update and adaptive restart, to enhance the performance. The effectiveness of our method is evaluated on a variety of benchmark tasks, including Atari 2600 and MuJoCo. Experimental results show that our approach substantially improves both the learning speed and final performance of state-of-the-art deep RL algorithms.

**Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards**

## Abstract

While using shaped rewards can be beneficial when solving sparse reward tasks, their successful application often requires careful engineering and is problem specific. We introduce a simple and effective model-free method to learn from shaped distance-to-goal rewards on tasks where success depends on reaching a goal state. Our method introduces an auxiliary distance-based reward based on pairs of rollouts to encourage diverse exploration. This approach effectively destabilizes local optima induced by the naive distance-to-goal reward shaping while enabling policies to efficiently solve the sparse reward task. Our augmented objective does not require any additional reward engineering or domain expertise to implement. We demonstrate that our method successfully solves a variety of hard-exploration tasks [including maze navigation and 3D construction in a Minecraft environment], where naive distance-based reward shaping otherwise fails, and curiosity and reward relabeling strategies exhibit poor performance.

**Bandits with Feedback Graphs and Switching Costs**

## Abstract

We study the adversarial multi-armed bandit problem where partial observations are available and where, in addition to the loss incurred for each action, a switching cost is incurred for shifting to a new action. All previously known results incur a factor proportional to the independence number of the feedback graph. We give a new algorithm whose regret guarantee depends only on the domination number of the graph. We further supplement that result with a lower bound. Finally, we also give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available.

**Unsupervised Curricula for Visual Meta-Reinforcement Learning**

## Abstract

Meta-reinforcement learning algorithms leverage experience across many tasks to learn fast and effective reinforcement learning (RL) algorithms. However, current meta-RL methods depend critically on a manually-defined distribution of meta-training tasks, and hand-crafting these task distributions is challenging and time-consuming. We develop an unsupervised algorithm for inducing an adaptive meta-training task distribution, i.e. an automatic curriculum, by modeling unsupervised interaction in a visual environment. Crucially, the task distribution is scaffolded by the meta-learner's behavior, with density-based exploration driving the evolution of the task distribution. We formulate unsupervised meta-RL with an information-theoretic objective optimized via expectation-maximization over trajectory-level latent variables. Repeating this procedure leads to iterative reorganization of behavior, allowing the task distribution to adapt as the meta-learner becomes more competent. In our experiments on vision-based navigation and manipulation domains, we show that our algorithm allows for unsupervised meta-learning of skills that transfer to downstream tasks specified by human-provided reward functions, as well as pre-training for more efficient meta-learning on user-defined task distributions. To understand the nature of the curricula, we provide visualizations and analysis of the task distributions discovered throughout the learning process, finding that the emergent tasks span a range of environment-specific exploratory and exploitative behavior.

**Neural Proximal Policy Optimization Attains Optimal Policy**

## Abstract

Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the global convergence of PPO and TRPO remains less understood, which separates theory from practice. In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are realized by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate.

**Oracle-Efficient Algorithms for Online Linear Optimization with Bandit Feedback**

## Abstract

We propose computationally efficient algorithms for \textit{online linear optimization with bandit feedback}, in which a player chooses an \textit{action vector} from a given (possibly infinite) set $\mathcal{A} \subseteq \mathbb{R}^d$, and then suffers a loss that can be expressed as a linear function in action vectors. Though there exist algorithms that achieve an optimal regret bound of $\tilde{O}(\sqrt{T})$ for $T$ rounds (ignoring factors of $\mathrm{poly} (d, \log T)$), computationally efficient ways of implementing them have not yet been made clear, in particular when $|\mathcal{A}|$ is not bounded by a polynomial size in $d$. One standard way to pursue computational efficiency is to assume that we have an efficient algorithm referred to as \textit{oracle} that solves (offline) linear optimization problems over $\mathcal{A}$. Under this assumption, the computational efficiency of a bandit algorithm can then be measured in terms of \textit{oracle complexity}, i.e., the number of oracle calls. Our contribution is to propose algorithms that offer optimal regret bounds of $\tilde{O}(\sqrt{T})$ as well as low oracle complexity for both \textit{non-stochastic settings} and \textit{stochastic settings}. Our algorithm for non-stochastic settings has an oracle complexity of $\tilde{O}( T )$ and is the first algorithm that achieves both a regret bound of $\tilde{O}( \sqrt{T} )$ and an oracle complexity of $\tilde{O} ( \mathrm{poly} ( T ) )$, given only linear optimization oracles. Our algorithm for stochastic settings calls the oracle only $O( \mathrm{poly} (d, \log T))$ times, which is smaller than the current best oracle complexity of $O( T )$ if $T$ is sufficiently large.

**Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples**

## Abstract

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d.\ Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(log t/t^(2/3)) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.

**Unsupervised Learning of Object Keypoints for Perception and Control**

## Abstract

The study of object representations in computer vision has primarily focused on developing representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates. Our methods learns in a fully unsupervised manner from raw video frames by transporting learnt image features between video frames using a keypoint bottleneck. The discovered keypoints track objects and object parts across long time-horizons more accurately than recent works on unsupervised learning of object keypoints. Further, consistent long-term tracking enables two notable results in control domains -- (1) using the keypoint co-ordinates and corresponding image features as inputs enables highly sample-efficient reinforcement learning; (2) learning to explore by controlling keypoint locations drastically reduces the search space, enabling deep exploration (leading to states unreachable through random action exploration) without any extrinsic rewards.

**InteractiveRecGAN: a Model Based Reinforcement Learning Method with Adversarial Training for Online Recommendation**

## Abstract

Reinforcement learning is effective in getting policies for recommender systems. However, current works focus on model-free approaches. These approaches require frequent interactions with real environments which are expensive especially for recommender systems with a large number of users. Efficient offline evaluation methods like importance sampling can alleviate such problems, but they usually request a large amount of online log data with initial actions taken which is also hard to get. In this work, we propose a model-based reinforcement learning method that models user-agent interactions. At each step, users' behaviors are decided by interactions among the agent and environment model. During the training, we simulate data by the interactive model. Both given and simulated data are used for model updates. Moreover, to reduce biases we utilize a discriminator to distinguish the quality of simulated sequences and rescale rewards. We do theoretical analysis and conduct experiments on real data to show that our method can effectively catch patterns from given data and evaluate policies based on the data.

**Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards**

## Abstract

Classical multi-armed bandit problems use the expected value of an arm as a metric to evaluate its goodness. However, the expected value is a risk-neutral metric. In many applications like finance, one is interested in balancing the expected return of an arm (or portfolio) with the risk associated with that return. In this paper, we consider the problem of selecting the arm that optimizes a linear combination of the expected reward and the associated Conditional Value at Risk (CVaR) in a fixed budget best-arm identification framework. We allow the reward distributions to be unbounded or even heavy-tailed. For this problem, our goal is to devise algorithms that are entirely distribution oblivious, i.e., the algorithm is not aware of any information on the reward distributions, including bounds on the moments/tails, or the suboptimality gaps across arms. In this paper, we provide a class of such algorithms with provable upper bounds on the probability of incorrect identification. In the process, we develop a novel estimator for the CVaR of unbounded (including heavy-tailed) random variables and prove a concentration inequality for the same, which could be of independent interest. We also compare the error bounds for our distribution oblivious algorithms with those corresponding to standard non-oblivious algorithms. Finally, numerical experiments reveal that our algorithms perform competitively when compared with non-oblivious algorithms, suggesting that distribution obliviousness can be realised in practice without incurring a significant loss of performance.

**Neural Temporal-Difference Learning Converges to Global Optima**

## Abstract

Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.

**Privacy-Preserving Q-Learning with Functional Noise in Continuous Spaces**

## Abstract

We consider differentially private algorithms for reinforcement learning in continuous state spaces, such that neighboring reward functions are indistinguishable. Existing studies that guarantee differential privacy are not extendable to infinite state spaces, since the noise level to ensure privacy will scale accordingly to infinity. Our aim is to protect the privacy for the value function approximator, without regard to the number of states queried to the function. We add functional noise to the value function iteratively in the training. We show rigorous privacy guarantees by a series of analyses on the kernel of the noise space, the probabilistic bound of such noise samples, and the composition of the noise. We gain insight into the utility analysis by proving the algorithm's approximate optimality, under the discrete state space setting. Experiments corroborate our theoretical findings and show improvement over existing methods.

**Online EXP3 Learning in Adversarial Bandits with Delayed Feedback**

## Abstract

Consider a player that in each of T rounds needs to choose one of K arms. An adversary chooses the cost of each arm in a bounded interval. After picking arm a_{t} at round t, the player receives the cost of playing this arm d_{t} rounds later. In case t+d_{t}>T, this feedback is simply missing. We prove that the EXP3 algorithm (that uses the feedback upon their arrival) achieves a regret of O\left(\sqrt{\ln K\left(KT+\sum_{t=1}^{T}d_{t}\right)}\right). We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the “no-regret property”, (e.g., where d_{t}=O\left(t\log t\right)) the ergodic average of the strategy profile still converges to the set of Nash equilibrium of the game. The result is made possible by choosing an adaptive learning rate \eta_{t} that is not summable but is square summable, and proving a “weighted regret bound” for this general case.

**Many-Armed Bandits with High-Dimensional Contexts under a Low-Rank Structure**

## Abstract

We consider the k-armed stochastic contextual bandit problem with d dimensional features, when both k and d can be large. To the best of our knowledge, all existing algorithm for this problem have a regret bound that scale as polynomials of degree at least two in k and d. The main contribution of this paper is to introduce and theoretically analyze a new algorithm (REAL Bandit) with a regret that scales by r^2(k+d) when r is rank of the k by d matrix of unknown parameters. REAL Bandit relies on ideas from low-rank matrix estimation literature and a new row-enhancement subroutine that yields sharper bounds for estimating each row of the parameter matrix that may be of independent interest.

**Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games**

## Abstract

In this paper, we study the global convergence of policy optimization for solving zero-sum linear quadratic (LQ) games. In particular, we first investigate the landscape of LQ games, viewing it as a nonconvex-nonconcave saddle-point problem in the policy space. We have shown that despite its nonconvexity and nonconcavity, zero-sum LQ games have the property that the stationary point of the objective with respect to the feedback control policies constitutes the Nash equilibrium (NE) of the game. Building upon this, we develop three projected nested-gradient methods that are guaranteed to converge to the NE of the games, with global sublinear rate, and local linear rate. Simulation results are then provided to validate the proposed algorithms. To the best of our knowledge, our work appears to be the first that investigates the optimization landscape of LQ games, and provably shows the convergence of policy optimization methods to the Nash equilibria. We believe the results set theoretical foundations for developing model-free policy-based reinforcement learning algorithms for zero-sum LQ games.

**Thresholding Bandit with Optimal Aggregate Regret**

## Abstract

We consider the thresholding bandit problem, whose goal is to find arms of mean rewards above a given threshold $\theta$, with a fixed budget of $T$ trials. We introduce LSA, a new, simple and anytime algorithm that aims to minimize the aggregate regret (or the expected number of mis-classified arms). We prove that our algorithm is instance-wise asymptotically optimal. We also provide comprehensive empirical results to demonstrate the algorithm's superior performance over existing algorithms under a variety of different scenarios.

**Causal Misidentification in Imitation Learning**

## Abstract

Behavioral cloning reduces policy learning to supervised learning by training a discriminative model to predict expert actions given observations. Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment. We point out that ignoring causality is particularly damaging because of the distributional shift in imitation learning. In particular, it leads to a counter-intuitive "causal misidentification" phenomenon: access to more information can yield worse performance. We investigate how this problem arises, and propose a solution to combat it through targeted interventions---either environment interaction or expert queries---to determine the correct causal model. We show that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and validate our solution against DAgger and other baselines and ablations.

**Meta-Inverse Reinforcement Learning with Probabilistic Context Variables**

## Abstract

Reinforcement learning demands a reward function, which is often difficult to provide or design in real world applications. While inverse reinforcement learning (IRL) holds promise for automatically learning reward functions from demonstrations, several major challenges remain. First, existing IRL methods learn reward functions from scratch, requiring large numbers of demonstrations to correctly infer the reward for each task the agent may need to perform. Second, and more subtly, existing methods typically assume demonstrations for one, isolated behavior or task, while in practice, it is significantly more natural and scalable to provide datasets of heterogeneous behaviors. To this end, we propose a deep latent variable model that is capable of learning rewards from unstructured, multi-task demonstration data, and critically, use this experience to infer robust rewards for new, structurally-similar tasks from a single demonstration. Our experiments on multiple continuous control tasks demonstrate the effectiveness of our approach compared to state-of-the-art imitation and inverse reinforcement learning methods.

**Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction**

## Abstract

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify \emph{bootstrapping error} as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random data and suboptimal demonstrations, on a range of continuous control tasks.

**Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates**

## Abstract

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.

**Weighted Linear Bandits for Non-Stationary Environments**

## Abstract

We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments. We provide theoretical guarantees on the behavior of D-LinUCB in both slowly-varying and abruptly-changing environments. We obtain an upper bound on the dynamic regret that is of order d B_T^{1/3}T^{2/3}, where B_T is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We also illustrate the empirical performance of D-LinUCB and compare it with recently proposed alternatives in simulated environments.

**Improved Regret Bounds for Bandit Combinatorial Optimization**

## Abstract

\textit{Bandit combinatorial optimization} is a bandit framework in which a player chooses an action in a given finite set $\mathcal{A} \subseteq \{ 0, 1 \}^d$ and suffers a loss that is the inner product of the chosen action and an unobservable loss vector in $\mathbb{R} ^ d$ in each round. This paper aims to reveal what property makes the bandit combinatorial optimization hard. Recently, Cohen et al.~\citep{cohen2017tight} showed a lower bound $\Omega(\sqrt{d k^3 T / \log T})$ of the regret, where $k$ is the maximum $\ell_1$-norm of action vectors, and $T$ is the number of rounds. Their lower bound was constructed via continuous strongly-correlated distribution of losses. Our main result is to improve their bound to $\Omega( \sqrt{d k ^3 T} )$ by a factor of $\sqrt{\log T}$, which can be done by means of strongly-correlated losses with \textit{binary} values. The bound derives better regret bounds for three specific examples of bandit combinatorial optimization: the multitask bandit, the bandit ranking and the multiple-play bandit. In particular, our bound for the bandit ranking answers to an open problem posed in [Cohen et al. COLT2017]. In addition, we show that the problem becomes easier without correlations among entries of loss vectors. In fact, if each entry of loss vectors is an independent random variable, then one can achieve a regret of $\tilde{O}(\sqrt{d k^2 T})$, which is $\sqrt{k}$ times smaller than the lower bound shown above. Our results indicate that correlation among losses is essential to having a large regret.

**SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits**

## Abstract

Motivated by cognitive radio networks, we consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and collisions occur if one of them is pulled by several players at the same stage. We present a decentralized algorithm that achieves the same performance as a centralized one, contradicting the existing lower bounds for that problem. This is possible by ``hacking'' the standard model by constructing a communication protocol between players that deliberately enforces collisions, allowing them to share their information at a negligible cost. This motivates the introduction of a more appropriate dynamic setting without sensing, where similar communication protocols are no longer possible. However, we show that the logarithmic growth of the regret is still achievable for this model with a new algorithm.

**Modelling the Dynamics of Multiagent Q-Learning in Repeated Symmetric Games: a Mean Field Theoretic Approach**

## Abstract

The development of models to describe the dynamics of multiagent Q-learning has attracted much attention. However, the previous models generally focus on the two-agent setting. In this paper, we consider a population of n agents, where n tends to infinity. At each time step, agents are randomly paired up with some other agents to play symmetric games. Using mean field theory, we approximate the effects of other agents on a single agent by an averaged effect, so that a differential equation universally governing the Q-learning process of each agent can be derived. We also derive the Fokker-Planck equation that describes how the distribution of Q-values in an agent population evolves over time. We verify our model through comparisons with agent-based simulations on typical symmetric games.

**Bootstrapping Upper Confidence Bound**

## Abstract

Upper Confidence Bound (UCB) method is arguably the most celebrated one used in online decision making with partial information feedback. Existing techniques for constructing confidence bounds are typically built upon various concentration inequalities, which thus lead to over-exploration. In this paper, we propose a non-parametric and data-dependent UCB algorithm based on the multiplier bootstrap. To improve its finite sample performance, we further incorporate second-order correction into the above construction. In theory, we derive both problem-dependent and problem-independent regret bounds for multi-armed bandits under a much weaker tail assumption than the standard sub-Gaussianity. Numerical results demonstrate significant regret reductions by our method, in comparison with several baselines in a range of multi-armed and linear bandit problems.

**Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies**

## Abstract

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

**Multiagent Evaluation under Incomplete Information**

## Abstract

This paper investigates the evaluation of learned multiagent strategies in the incomplete information setting, which plays a critical role in ranking and training of agents. Traditionally, researchers have relied on Elo ratings for this purpose, with recent works also using methods based on Nash equilibria. Unfortunately, Elo is unable to handle intransitive agent interactions, and other techniques are restricted to zero-sum, two-player settings or are limited by the fact that the Nash equilibrium is intractable to compute. Recently, a ranking method called $\alpha$-Rank, relying on a new graph-based game-theoretic solution concept, was shown to tractably apply to general games. However, evaluations based on Elo or $\alpha$-Rank typically assume noise-free game outcomes, despite the data often being collected from noisy simulations, making this assumption unrealistic in practice. This paper investigates multiagent evaluation in the incomplete information regime, involving general-sum many-player games with noisy outcomes. We derive sample complexity guarantees required to confidently rank agents in this setting. We propose adaptive algorithms for accurate ranking, provide correctness and sample complexity guarantees, then introduce a means of connecting uncertainties in noisy match outcomes to uncertainties in rankings. We evaluate the performance of these approaches in several domains, including Bernoulli games, a soccer meta-game, and Kuhn poker.

**Towards Interpretable Reinforcement Learning Using Attention Augmented Agents**

## Abstract

Inspired by recent work in attention models for image captioning and question answering, we present a soft attention model for the reinforcement learning domain. This model bottlenecks the view of an agent by a soft, top-down attention mechanism, forcing the agent to focus on task-relevant information by sequentially querying its view of the environment. The output of the attention mechanism allows direct observation of the information used by the agent to select its actions, enabling easier interpretation of this model than of traditional models. We analyze the different strategies the agents learn and show that a handful of strategies arise repeatedly across different games. We also show that the model learns to query separately about space and content (``where'' vs. ``what''). We demonstrate that an agent using this mechanism can achieve performance competitive with state-of-the-art models on ATARI tasks while still being interpretable.

**Planning in Entropy-Regularized Markov Decision Processes and Games**

## Abstract

We propose a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. We make use of the smoothness of the Bellman operator introduced by the regularization to provide an algorithm which has a problem-independent sample complexity of order $\mathcal{O}(1/\epsilon^4)$ for a desired accuracy $\epsilon$, whereas non-regularized problems may not enjoy polynomial sample complexity in a worst-case sense.

**Generalization of Reinforcement Learners with Working and Episodic Memory**

## Abstract

Memory is an important aspect of intelligence and plays a role in many deep reinforcement learning models. However, little progress has been made in understanding when specific memory systems help more than others and how well they generalize. The field also has yet to see a prevalent consistent and rigorous approach for evaluating agent performance on holdout data. In this paper, we aim to develop a comprehensive methodology to test different kinds of memory in an agent and assess how well the agent can apply what it learns in training to a holdout set that differs from the training set along dimensions that we suggest are relevant for evaluating memory-specific generalization. To that end, we first construct a diverse set of memory tasks that allow us to evaluate test-time generalization across multiple dimensions. Second, we develop and perform multiple ablations on an agent architecture that combines multiple memory systems, observe its baseline models, and investigate its performance against the task suite.

**Hindsight Credit Assignment**

## Abstract

We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions can be rewritten through this lens, yielding a new family of algorithms. We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks.

**When to Trust Your Model: Model-Based Policy Optimization**

## Abstract

Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of model-generated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.

**Control What You Can: Intrinsically Motivated Task-Planning Agent**

## Abstract

We present a novel intrinsically motivated agent that learns how to control the environment in the fastest possible manner by optimizing learning progress. It learns what can be controlled, how to allocate time and attention, and the relations between objects using surprise based motivation. The effectiveness of our method is demonstrated in a synthetic as well as a robotic manipulation environment yielding considerably improved performance and smaller sample complexity. In a nutshell, our work combines several task-level planning agent structures (backtracking search on task graph, probabilistic road-maps, allocation of search efforts) with intrinsic motivation to achieve learning from scratch.

**Curriculum-guided Hindsight Experience Replay**

## Abstract

In off-policy deep reinforcement learning, it is usually hard to collect sufficient successful experiences with positive reward to learn from. Hindsight experience replay (HER) enables an agent to also learn from failures by treating the achieved state of a failed experience as a pseudo goal. However, not all the failed experiences are equally useful in different learning stages, and it is not efficient to replay all of them or subsample them uniformly in HER. In this paper, we propose to 1) adaptively select the failed experiences for replay according to its proximity to the true goal and the curiosity of exploration over diverse pseudo goals, and 2) smoothly vary the proportion of the proximity and the curiosity/diversity from earlier to later learning episodes. We use a strategy imitating human learning that enforces more curiosity in earlier stages changes to more proximity later. This ``Goal-and-Curiosity-driven Curriculum (GCC) Learning'' leads to ``Curriculum-guided HER (CHER)'', which adaptively and dynamically controls the exploration-exploitation trade-off during the learning process. In experiments of manipulation tasks on robots, we show that CHER is significantly more efficient than HER in practice.

**Regret Bounds for Learning State Representations in Reinforcement Learning**

## Abstract

We consider the problem of online learning in reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent. At least one of these representations is assumed to induce a Markov decision process (MDP), and the performance of the agent is measured in terms of cumulative regret against the optimal policy giving the highest average reward in this MDP representation. We propose an algorithm (UCB-MS) with O(sqrt(T)) regret in any communicating Markov decision process. The regret bound shows that UCB-MS automatically adapts to the Markov model. This improves over the currently known best results in the literature that gave regret bounds of order O(T^(2/3)).

**A Composable Specification Language for Reinforcement Learning Tasks**

## Abstract

Reinforcement learning is a promising approach for learning control policies for robot tasks. However, specifying complex tasks (e.g., with multiple objectives and safety constraints) can be challenging, since the user must design a reward function that encodes the entire task. Furthermore, the user often needs to manually shape the reward to ensure convergence of the learning algorithm. We propose a language for specifying complex control tasks, along with an algorithm that compiles specifications in our language into a reward function and automatically performs reward shaping. We implement our approach in a tool called SPECTRL, and show that it outperforms several state-of-the-art baselines.

**The Option Keyboard: Combining Skills in Reinforcement Learning**

## Abstract

The ability to combine known skills to create new ones may be crucial in the solution of complex reinforcement learning problems that unfold over extended periods. We argue that a robust way of combining skills is to define and manipulate them in the space of pseudo-rewards (or "cumulants"). Based on this premise, we propose a framework for combining skills using the formalism of options. We show that every deterministic option can be unambiguously represented as a cumulant defined in an extended domain. Building on this insight and on previous results on transfer learning, we show how to approximate options whose cumulants are linear combinations of the cumulants of known options. This means that, once we have learned options associated with a set of cumulants, we can instantaneously synthesise options induced by any linear combination of them, without any learning involved. We describe how this framework provides a hierarchical interface to the environment whose abstract actions correspond to combinations of basic skills. We demonstrate the practical benefits of our approach in a resource management problem and a navigation task involving a quadrupedal simulated robot.

**Biases for Emergent Communication in Multi-agent Reinforcement Learning**

## Abstract

We study the problem of emergent communication, in which language arises because speakers and listeners must communicate information in order to solve tasks. In temporally extended reinforcement learning domains, it has proved hard to learn such communication without centralized training of agents, due in part to a difficult joint exploration problem. We introduce inductive biases for positive signalling and positive listening, which ease this problem. In a simple one-step environment, we demonstrate how these biases ease the learning problem. We also apply our methods to a more extended environment, showing that agents with these inductive biases achieve better performance, and analyse the resulting communications protocols.

**Modeling Conceptual Understanding in Image Reference Games**

## Abstract

An AI system interacting with a wide population of other agents needs to be aware that there may be variations in the understanding that other agents have of the environment. Furthermore, not only can there be variation in agents' understanding, the machinery which they use to perceive the world may be inherently different, as is the case between humans and machines. In this work, we propose an image reference game played between a speaker and a population of listeners as an example of a setting where reasoning about which concepts other agents can comprehend is necessary. Our experiments on three benchmark image/attribute datasets indeed suggest that our learner encodes information directly pertaining to the understanding of other agents.

**Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning**

## Abstract

Multi-simulator training has contributed to the recent success of Deep Reinforcement Learning (Deep RL) by stabilizing learning and allowing for higher training throughputs. In this work, we propose Gossip-based Actor-Learner Architectures (GALA) where several actor-learners (such as A2C agents) are organized in a peer-to-peer communication topology, and exchange information through asynchronous gossip in order to take advantage of a large number of distributed simulators. We prove that GALA agents remain within an epsilon-ball of one-another during training when using loosely coupled asynchronous communication. By reducing the amount of synchronization between agents, GALA is more computationally efficient and scalable compared to A2C, its fully-synchronous counterpart. GALA also outperforms A2C, being more robust and sample efficient. We show that we can run several loosely coupled GALA agents in parallel on a single GPU and achieve significantly higher hardware utilization and frame-rates than vanilla A2C at comparable power draws.

**Metalearned Neural Memory**

## Abstract

We augment recurrent neural networks with an external memory mechanism that builds upon recent progress in metalearning. We conceptualize this memory as a rapidly adaptable function that we parameterize as a deep neural network. Reading from the neural memory function amounts to pushing an input (the key vector) through the function to produce an output (the value vector). Writing to memory means changing the function; specifically, updating the parameters of the neural network to encode desired information. We leverage training and algorithmic techniques from metalearning to update the neural memory function in one shot. The proposed memory-augmented model achieves strong performance on a variety of learning problems, from supervised question answering to reinforcement learning.

**Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes**

## Abstract

A dynamic treatment regime (DTR) consists of a sequence of decision rules, one per stage of intervention, that dictates how to personalize treatments to patients, based on evolving treatments and covariates' history. These regimes are particularly effective for managing chronic disorders and fit well into the larger theme of personalized decision-making (e.g., precision medicine). In this paper, we investigate the online reinforcement learning (RL) problem for selecting optimal DTRs provided that observational data is available. First, we present a RL algorithm that achieves near-optimal regret in DTRs in online settings, i.e., without any access to historical data. We then derive informative bounds on the system dynamics of the underlying DTR from confounded observational data. Finally, we combine these approaches and develop a novel RL algorithm that efficiently learns the optimal DTR while leveraging the abundant, yet imperfect confounded observations.

**Exploration via Hindsight Goal Generation**

## Abstract

Goal-oriented reinforcement learning has recently been a practical framework for robotic manipulation tasks, in which an agent is required to reach a certain goal defined by a function on the state space. However, the sparsity of such reward definition makes traditional reinforcement learning algorithms very inefficient. Hindsight Experience Replay (HER), a recent advance, has greatly improved sample efficiency and practical applicability for such problems. It exploits previous replays by constructing imaginary goals in a simple heuristic way, acting like an implicit curriculum to alleviate the challenge of sparse reward signal. In this paper, we introduce Hindsight Goal Generation (HGG), a novel algorithmic framework that generates valuable hindsight goals which are easy for an agent to achieve in the short term and are also potential for guiding the agent to reach the actual goal in the long term. We have extensively evaluated our goal generation algorithm on a number of robotic manipulation tasks and demonstrated substantially improvement over the original HER in terms of sample efficiency.

**Shaping Belief States with Generative Environment Models for RL**

## Abstract

When agents interact with a complex environment, they must form and maintain beliefs about the relevant aspects of that environment. We propose a way to efficiently train expressive generative models in complex environments. We show that a predictive algorithm with an expressive generative model can form stable belief-states in visually rich and dynamic 3D environments. More precisely, we show that the learned representation captures the layout of the environment as well as the position and orientation of the agent. Our experiments show that the model substantially improves data-efficiency on a number of reinforcement learning (RL) tasks compared with strong model-free baseline agents. We find that predicting multiple steps into the future (overshooting), in combination with an expressive generative model, is critical for stable representations to emerge. In practice, using expressive generative models in RL is computationally expensive and we propose a scheme to reduce this computational burden, allowing us to build agents that are competitive with model-free baselines.

**RUDDER: Return Decomposition for Delayed Rewards**

## Abstract

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards.

**No-Regret Learning in Unknown Games with Correlated Payoffs**

## Abstract

We consider the problem of learning to play a repeated multi-agent game with an unknown reward function. Single player online learning algorithms attain strong regret bounds when provided with full information feedback, which unfortunately is unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing outcomes only for the selected action, yields substantially worse performance. In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents' actions. This feedback model, together with a regularity assumption on the reward function, allows us to exploit the correlations among different game outcomes by means of Gaussian processes (GPs). We propose a novel confidence-bound based bandit algorithm GP-MW, which utilizes the GP model for the reward function and runs a multiplicative weight (MW) method. We obtain novel kernel-dependent regret bounds that are comparable to the known bounds in the full information setting, while substantially improving upon the existing bandit results. We experimentally demonstrate the effectiveness of GP-MW in random matrix games, as well as real-world problems of traffic routing and movie recommendation. In our experiments, GP-MW consistently outperforms several baselines, while its performance is often comparable to methods that have access to full information feedback.

**A neurally plausible model learns successor representations in partially observable environments**

## Abstract

Animals need to devise strategies to maximize returns while interacting with their environment based on incoming noisy sensory observations. Task-relevant states, such as the agent's location within an environment or the presence of a predator, are often not directly observable but must be inferred using available sensory information. Successor representations (SR) have been proposed as a middle-ground between model-based and model-free reinforcement learning strategies, allowing for fast value computation and rapid adaptation to changes in the reward function or goal locations. Indeed, recent studies suggest that features of neural responses are consistent with the SR framework. However, it is not clear how such representations might be learned and computed in partially observed, noisy environments. Here, we introduce a neurally plausible model using \emph{distributional successor features}, which builds on the distributed distributional code for the representation and computation of uncertainty, and which allows for efficient value function computation in partially observed environments via the successor representation. We show that distributional successor features can support reinforcement learning in noisy environments in which direct learning of successful policies is infeasible.

**Learning Fairness in Multi-Agent Systems**

## Abstract

Fairness is essential for human society, contributing to stability and productivity. Similarly, fairness is also the key for many multi-agent systems. Taking fairness into multi-agent learning could help multi-agent systems become both efficient and stable. However, learning efficiency and fairness simultaneously is a complex, multi-objective, joint-policy optimization. To tackle these difficulties, we propose FEN, a novel hierarchical reinforcement learning model. We first decompose fairness for each agent and propose fair-efficient reward that each agent learns its own policy to optimize. To avoid multi-objective conflict, we design a hierarchy consisting of a controller and several sub-policies, where the controller maximizes the fair-efficient reward by switching among the sub-policies that provides diverse behaviors to interact with the environment. FEN can be trained in a fully decentralized way, making it easy to be deployed in real-world applications. Empirically, we show that FEN easily learns both fairness and efficiency and significantly outperforms baselines in a variety of multi-agent scenarios.

**Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck**

## Abstract

The ability for policies to generalize to new environments is key to the broad application of RL agents. A promising approach to prevent an agent's policy from overfitting to a limited set of training environments is to apply regularization techniques originally developed for supervised learning. However, there are stark differences between supervised learning and RL. We discuss those differences and propose modifications to existing regularization techniques in order to better adapt them to RL. In particular, we focus on regularization techniques relying on the injection of noise into the learned function, a family that includes some of the most widely used approaches such as Dropout and Batch Normalization. To adapt them to RL, we propose Selective Noise Injection (SNI), which maintains the regularizing effect the injected noise has, while mitigating the adverse effects it has on the gradient quality. Furthermore, we demonstrate that the Information Bottleneck (IB) is a particularly well suited regularization technique for RL as it is effective in the low-data regime encountered early on in training RL agents. Combining the IB with SNI, we show state of the art results, including on the recently proposed generalization benchmark Coinrun.

**Reinforcement Learning with Convex Constraints **

## Abstract

In standard reinforcement learning (RL), a learning agent seeks to optimize the overall reward. However, many key aspects of a desired behavior are more naturally expressed as constraints. For instance, the designer may want to limit the use of unsafe actions, increase the diversity of trajectories to enable exploration, or approximate expert trajectories when rewards are sparse. In this paper, we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks: specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set. This captures previously studied constraints (such as safety and proximity to an expert), but also enables new classes of constraints (such as diversity). Our approach comes with rigorous theoretical guarantees and only relies on the ability to approximately solve standard RL tasks. As a result, it can be easily adapted to work with any model-free or model-based RL. In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms do not incorporate, such as diversity.

**Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning**

## Abstract

In an effort to better understand the different ways in which the discount factor affects the optimization process in reinforcement learning, we designed a set of experiments to study each effect in isolation. Our analysis reveals that the common perception that this is caused by (too) small action-gaps requires revision. We propose an alternative hypothesis, which states that the size-difference of the action-gap across the state-space is the primary cause. We then introduce a new method that enables more homogeneous action-gaps by mapping value estimates to a logarithmic space. We prove convergence for this method under standard assumptions and demonstrate empirically that it indeed enables lower discount factors for approximate reinforcement-learning methods. This allows tackling a class of reinforcement-learning problems that are challenging to solve with traditional methods.

**Recovering Bandits**

## Abstract

We study the recovering bandits problem, a variant of the stochastic multi-armed bandit problem where the expected reward of each arm varies according to some unknown function of the time since the arm was last played. While being a natural extension of the classical bandit problem that arises in many real-world settings, this variation is accompanied by significant difficulties. In particular, methods need to plan ahead and estimate many more quantities than in the classical bandit setting. In this work, we explore the use of Gaussian processes to tackle the estimation and planing problem. We also discuss different regret definitions that let us quantify the performance of the methods. To improve computational efficiency of the methods, we provide an optimistic planning approximation. We complement these discussions with regret bounds and empirical studies.

**Correlation Priors for Reinforcement Learning**

## Abstract

Many decision-making problems naturally exhibit pronounced structures inherited from the underlying characteristics of the environment. In a Markov decision process model, for example, two distinct states can have inherently related semantics or encode resembling physical state configurations, often implying locally correlated transition dynamics among the states. In order to complete a certain task, an agent acting in such environments needs to execute a series of temporally and spatially correlated actions. Though there exists a variety of approaches to account for correlations in continuous state-action domains, a principled solution for discrete environments is missing. In this work, we present a Bayesian learning framework based on Pólya-Gamma augmentation that enables an analogous reasoning in such cases. We demonstrate the framework on a number of common decision-making related tasks, such as reinforcement learning, imitation learning and system identification. By explicitly modeling the underlying correlation structures, the proposed approach yields superior predictive performance compared to correlation-agnostic models, even when trained on data sets that are up to an order of magnitude smaller in size.

**When to use parametric models in reinforcement learning?**

## Abstract

We examine the question of when and how parametric models are useful in reinforcement learning (RL). In particular, we look at commonalities and differences between parametric models and experience replay. Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and behaviour. We discuss when to expect benefits from either approach, as well as recent work in this context. We hypothesize that, under suitable conditions, replay-based algorithms should be competitive to or better than model-based algorithms if the model is used only to generate fictional transitions from observed states for an update rule that is otherwise model-free. We validated this hypothesis on Atari 2600 video games. Our replay-based algorithm attained state-of-the-art data efficiency, improving prior results with parametric models.

**Categorized Bandits**

## Abstract

We introduce a new stochastic multi-armed bandit setting where arms are grouped inside ``ordered'' categories. The motivating example comes from e-commerce, where a customer typically has a greater appetence for items of a specific well-identified but unknown category than any other one. We introduce three concepts of ordering between categories, inspired by stochastic dominance between random variables, which are gradually weaker so that more and more bandit scenarios satisfy at least one of them. We first prove instance-dependent lower bounds on the cumulative regret for each of these models, indicating how the complexity of the bandit problems increases with the generality of the ordering concept considered. We also provide algorithms that fully leverage the structure of our model with their associated theoretical guarantees. Finally, we have conducted an analysis on real data to highlight that this ordered categories actually exist in practice.

**Non-Asymptotic Pure Exploration by Solving Games**

## Abstract

Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem.

**Censored Semi-Bandits: A Framework for Resource Allocation with Censored Feedback**

## Abstract

In this paper, we study Censored Semi-Bandits, a novel variant of the semi-bandits problem. The learner is assumed to have a fixed amount of resources, which it allocates to the arms at each time step. The loss observed from an arm is random and depends on the amount of resource allocated to it. More specifically, the loss equals zero if the allocation for the arm exceeds a constant (but unknown) threshold that can be dependent on the arm. Our goal is to learn a feasible allocation that minimizes the expected loss. The problem is challenging because the loss distribution and threshold value of each arm are unknown. We study this novel setting by establishing their `equivalence' to multiple-play multi-armed bandits (MP-MAB) and combinatorial semi-bandits. Exploiting these equivalences, we derive optimal algorithms for our setting using existing algorithms for MP-MAB and combinatorial semi-bandits. Experiments on synthetically generated data validate performance guarantees of the proposed algorithms.

**Policy Poisoning in Batch Reinforcement Learning and Control**

## Abstract

We study a security threat to batch reinforcement learning and control where the attacker aims to poison the learned policy. The victim is a reinforcement learner / controller which first estimates the dynamics from a batch data set, and then solves for the optimal policy with respect to the estimated dynamics. The attacker can modify the data set slightly before learning happens, and wants to force the learner into a target policy chosen by the attacker. We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. We provide analysis on attack feasibility and attack cost. Experiments show the effectiveness of policy poisoning attacks.

**Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees**

## Abstract

We propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. The predictor is based on a feature space discretization induced by a full-fledged k-d tree with randomly picked directions and a recursive Bayesian distribution, which allows to automatically learn the most relevant feature scales characterizing the conditional distribution. We prove its pointwise universality, i.e., it achieves a normalized log loss performance asymptotically as good as the true conditional entropy of the labels given the features. The time complexity to process the n-th sample point is O(log n) in probability with respect to the distribution generating the data points, whereas other exact nonparametric methods require to process all past observations. Experiments on challenging datasets show the computational and statistical efficiency of our algorithm in comparison to standard and state-of-the-art methods.

**Compiler Auto-Vectorization using Imitation Learning**

## Abstract

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit fine-grained data level parallelism. To exploit this parallelism, compilers must make decisions on which instructions should be ``packed'' to run in parallel while compiling a high-level language code. Current compilers employ auto-vectorization techniques that use heuristics to discover vectorization opportunities. These heuristics are local and typically only present one vectorization strategy. Recently, goSLP formulated the instruction packing problem by leveraging an integer linear programming (ILP) solver, achieving superior performances at the expense of compilation time. In this work, we explore whether one can use imitation learning to fit a graph neural network policy that imitates the optimal decisions made by the ILP solver. Our main finding is that while our neural-network agent is not able to match the optimal results given by the ILP solver, it nonetheless significantly outperforms the compiler heuristics while running at a fraction of the time spent by the ILP solver.

**A Generalized Algorithm for Multi-Objective RL and Policy Adaptation**

## Abstract

We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After this initial learning phase, our agent can quickly adapt to any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

**Learning Compositional Neural Programs with Recursive Tree Search and Planning**

## Abstract

We propose a novel reinforcement learning algorithm, AlphaNPI, that incorpo- rates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaNPI extends the guided tree search of AlphaZero to enable recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. As a result, AlphaNPI effectively eliminates the need for strong supervision when training NPI models. Indeed, the experiments show that AlphaNPI can sort as well as previous NPI variants, which required strong supervision in the form of full program execution traces. The AlphaNPI agent is also trained on the Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disks, both empirically and theoretically.

**Nonparametric Contextual Bandits in Metric Spaces with Unknown Metric**

## Abstract

Consider a nonparametric contextual multi-arm bandit problem where each arm $a \in [K]$ is associated to a nonparametric reward function $f_a: [0,1] \to \mathbb{R}$ mapping from contexts to the expected reward. Suppose that there is a large set of arms, yet there is a simple but unknown structure amongst the arm reward functions, e.g. finite types or smooth with respect to an unknown metric space. We present a novel algorithm which learns data-driven similarities amongst the arms, in order to implement adaptive partitioning of the context-arm space for more efficient learning. We provide regret bounds along with simulations that highlight the algorithm's dependence on the local geometry of the reward functions.

**Model Selection for Contextual Bandits**

## Abstract

We introduce the general problem of model selection for contextual bandits, wherein a learner must adapt to the complexity of the optimal policy while balancing exploration and exploitation. Our main result is a new model selection guarantee for linear contextual bandits. We work in the stochastic realizable setting with a sequence of linear policy classes of dimension $d_1 < d_2 < \ldots$, where the $m^\star$-th class contains the optimal policy, and we design an algorithm that achieves $\tilde{O}(d^{1/3}_{m^\star}T^{2/3})$ regret with no prior knowledge of the optimal dimension $d_{m^\star}$. The algorithm also achieves regret $\tilde{O}(T^{3/4} + \sqrt{d_{m^\star}T})$, which is optimal for $d_{m^{\star}}\geq{}\sqrt{T}$. These are the first model selection results that give non-vacuous regret for all values of $d_{m^\star}$ and, to the best of our knowledge, are the first such guarantees in any contextual bandit setting. The core of the algorithm is a new estimator for the gap between the best loss achievable by two classes, which we show admits convergence rates faster than would be required to actually learn these classes.

**Planning with Goal-Conditioned Policies**

## Abstract

Planning methods can solve temporally extended sequential decision making problems by composing simple behaviors. However, planning requires suitable abstractions for the states and transitions, which typically need to be designed by hand. In contrast, reinforcement learning (RL) can acquire behaviors from low-level inputs directly, but struggles with temporally extended tasks. Can we utilize reinforcement learning to automatically form the abstractions needed for planning, thus obtaining the best of both approaches? We show that goal-conditioned policies learned with RL can be incorporated into planning, such that a planner can focus on which states to reach, rather than how those states are reached. However, with complex state observations such as images, not all inputs represent valid states. We therefore also propose using a latent variable model to compactly represent the set of valid states for the planner, such that the policies provide an abstraction of actions, and the latent variable model provides an abstraction of states. We compare our method with planning-based and model-free methods and find that our method significantly outperforms prior work when evaluated on image-based tasks that require non-greedy, multi-staged behavior.

**Online Optimal Control with Linear Dynamics and Predictions: Algorithms and Regret Analysis**

## Abstract

This paper studies the online optimal control problem with time-varying convex stage costs for a time-invariant linear dynamical system, where a finite look-ahead window with accurate predictions of the stage costs is available at each time. We design online algorithms, Receding Horizon Gradient-based Control (RHGC), that utilizes the predictions through finite steps of gradient computations. We study the algorithm performance measured by \textit{dynamic regret}: the online performance minus the optimal performance in hindsight. It is shown that the dynamic regret of RHGC decays exponentially with the size of the look-ahead window. In addition, we provide a fundamental limit of the dynamic regret for any online algorithms by considering linear quadratic tracking problems. The regret upper bound of one RHGC method almost reaches the fundamental limit, demonstrating the effectiveness of the algorithm. Finally, we numerically test our algorithms for both linear and nonlinear systems to show the effectiveness and generality of our RHGC.

**Semi-Parametric Efficient Policy Learning with Continuous Actions**

## Abstract

We consider off-policy evaluation and optimization with continuous action spaces. We focus on observational data where the data collection policy is unknown and needs to be estimated from data. We take a semi-parametric approach where the value function takes a known parametric form in the treatment, but we are agnostic on how it depends on the observed contexts. We propose a doubly robust off-policy estimate for this setting and show that off-policy optimization based on this doubly robust estimate is robust to estimation errors of the policy function or the regression model. We also show that the variance of our off-policy estimate achieves the semi-parametric efficiency bound. Our results also apply if the model does not satisfy our semi-parametric form but rather we measure regret in terms of the best projection of the true value function to this functional space. Our work extends prior approaches of policy optimization from observational data that only considered discrete actions. We provide an experimental evaluation of our method in a synthetic data example motivated by optimal personalized pricing.

**Fast Agent Resetting in Training**

## Abstract

We study reinforcement learning with access to state observations from a demonstrator in addition to a reward signal. In this setting the demonstrator only supplies sequences of observations, and we leverage these samples to improve the learning efficiency of the agent. Our key insight is that in most environments expert policies only visit a tiny fraction of the total available states. We develop a simple technique, e-stops, to exploit this phenomenon. Using e-stops significantly improves sample complexity by reducing the amount of required exploration, while retaining a performance bound that trades off the rate of convergence with a small asymptotic suboptimality gap. We analyze the regret behavior of e-stops and present empirical results demonstrating that our reset mechanism provides order-of-magnitude speedups over classic reinforcement learning methods.

**Constraint Augmented Reinforcement Learning for Text-based Recommendation and Generation**

## Abstract

Text-based interactive recommendation provides richer user preferences and has demonstrated its advantage over traditional interactive recommender systems. However, recommendations can easily violate the preferences of users from their past natural-language feedbacks, since the recommender needs to explore new items for further improvement. To alleviate this issue, we propose a novel constraint augmented reinforcement learning framework to efficiently incorporate user preferences over time. Specifically, we leverage a discriminator to detect recommendations violating user historical preferences. Our key idea is to express the expected return objective as a weighted sum of two terms: an expectation over the constraint violation penalty and a separate expectation over user rewards. Besides, our proposed framework is general and can be further extended to the constrained text generation. Empirical results show that our proposed method leads to consistent improvement compared with standard reinforcement learning.

**Search on the Replay Buffer: Bridging Planning and Reinforcement Learning**

## Abstract

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and relative values of states, but fails to plan over long horizons. Despite the successes of each method on various tasks, long horizon, sparse reward tasks with high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid injecting bias through reward shaping. We introduce a general-purpose control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our main idea is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a particular subgoal. We use goal-conditioned RL to learn a policy to reach each waypoint and to learn a distance metric for search. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over hundreds of steps, and generalizes substantially better than standard RL algorithms.

**Goal-conditioned Imitation Learning**

## Abstract

Designing rewards for Reinforcement Learning (RL) is challenging because it needs to convey the desired task, be efficient to optimize, and be easy to compute. The latter is particularly problematic when applying RL to robotics, where detecting whether the desired configuration is reached might require considerable supervision and instrumentation. Furthermore, we are often interested in being able to reach a wide range of configurations, hence setting up a different reward every time might be unpractical. Methods like Hindsight Experience Replay (HER) have recently shown promise to learn policies able to reach many goals, without the need of a reward. Unfortunately, without tricks like resetting to points along the trajectory, HER might take a very long time to discover how to reach certain areas of the state-space. In this work we investigate different approaches to incorporate demonstrations to drastically speed up the convergence to a policy able to reach any goal, also surpassing the performance of an agent trained with other Imitation Learning algorithms.

**Robust exploration in linear quadratic reinforcement learning **

## Abstract

Learning to make decisions in an uncertain and dynamic environment is a task of fundamental performance in a number of domains. This paper concerns the problem of learning control policies for an unknown linear dynamical system so as to minimize a quadratic cost function. We present a method, based on convex optimization, that accomplishes this task ‘robustly’, i.e., the worst-case cost, accounting for system uncertainty given the observed data, is minimized. The method balances exploitation and exploration, exciting the system in such a way so as to reduce uncertainty in the model parameters to which the worst-case cost is most sensitive. Numerical simulations and application to a hardware-in-the-loop servo-mechanism are used to demonstrate the approach, with appreciable performance and robustness gains over alternative methods observed in both.

**A Kernel Loss for Solving the Bellman Equation**

## Abstract

Value function learning plays a central role in many state-of-the-art reinforcement learning algorithms. However, many standard algorithms like Q-learning lose convergence guarantees when function approximation is used, as is often observed in practice. In this paper, we propose a novel loss function, the minimization of which results in the true value function. The key advantage of this new loss is that its gradient can be easily approximated by using sampled transitions, avoiding the double-sample issue faced by prior algorithms like residual gradient. In practice, our approach may be combined with general (differentiable) function classes such as neural networks, and is shown to work reliably and effectively in several benchmarks.

**Learning Reward Machines for Partially Observable Reinforcement Learning**

## Abstract

Reward Machines (RMs), originally proposed for specifying problems in Reinforcement Learning (RL), provide a structured, automata-based representation of a reward function that allows an agent to decompose problems into subproblems that can be efficiently learned using off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.

**A Family of Robust Stochastic Operators for Reinforcement Learning**

## Abstract

We consider a new family of stochastic operators for reinforcement learning with the goal of alleviating negative effects and becoming more robust to approximation or estimation errors. Various theoretical results are established, which include showing that our family of operators preserve optimality and increase the action gap in a stochastic sense. Our empirical results illustrate the strong benefits of our robust stochastic operators, significantly outperforming the classical Bellman operator and recently proposed operators.

**Imitation-Projected Policy Gradient for Programmatic Reinforcement Learning**

## Abstract

We present Imitation-Projected Policy Gradient (IPPG), an algorithmic framework for learning policies that are parsimoniously represented in a structured programming language. Such programmatic policies can be more interpretable, generalizable, and amenable to formal verification than neural policies; however, designing rigorous learning approaches for programmatic policies remains a challenge. IPPG, our response to this challenge, is based on three insights. First, we view our learning task as optimization in policy space, modulo the constraint that the desired policy has a programmatic representation, and solve this optimization problem using a "lift-and-project" perspective that takes a gradient step into the unconstrained policy space and then projects back onto the constrained space. Second, we view the unconstrained policy space as mixing neural and programmatic representations, which enables employing state-of-the-art deep policy gradient approaches. Third, we cast the projection step as program synthesis via imitation learning, and exploit contemporary combinatorial methods for this task. We present theoretical convergence results for IPPG, as well as an empirical evaluation in three continuous control domains. The experiments show that IPPG can significantly outperform the state-of-the-art approach for learning programmatic policies.