Loading [MathJax]/jax/output/HTML-CSS/jax.js

Foundations of RL

We develop state-of-the-art methods that provide robust, sample-efficient and adaptive solutions, focusing on the Q-Learning framework. Checkout our work on Constrained Q-learning and constrained imitation via Inverse Q-Learning with Constraints or on long-term value decomposition by short-term predictions via consecutive bootstrapping in Composite Q-learning. We further developed an uncertainty-driven version of Dyna as Model-assisted Bootstrapped DDPG.

Gabriel Kalweit*, Maria Huegle*, Moritz Werling and Joschka Boedecker

NeurIPS 2020: Deep Inverse Q-Learning with Constraints

Paper (arxiv) Cite

Introduction

Popular Maximum Entropy Inverse Reinforcement Learning approaches require the computation of expected state visitation frequencies for the optimal policy under an estimate of the reward function. This usually requires intermediate value estimation in the inner loop of the algorithm, slowing down convergence considerably. In this work, we introduce a novel class of algorithms that only needs to solve the MDP underlying the demonstrated behavior once to recover the expert policy. This is possible through a formulation that exploits a probabilistic behavior assumption for the demonstrations within the structure of Q-learning. We propose Inverse Action-value Iteration which is able to fully recover an underlying reward of an external agent in closed-form analytically. We further provide an accompanying class of sampling-based variants which do not depend on a model of the environment. We show how to extend this class of algorithms to continuous state-spaces via function approximation and how to estimate a corresponding action-value function, leading to a policy as close as possible to the policy of the external agent, while optionally satisfying a list of predefined hard constraints. We evaluate the resulting algorithms called Inverse Action-value Iteration, Inverse Q-learning and Deep Inverse Q-learning on the Objectworld benchmark, showing a speedup of up to several orders of magnitude compared to (Deep) Max-Entropy algorithms. We further apply Deep Constrained Inverse Q-learning on the task of learning autonomous lane-changes in the open-source simulator SUMO achieving competent driving after training on data corresponding to 30 minutes of demonstrations. The scheme of Deep Constrained Inverse Q-learning can be seen in Fig. 1.


Fig. 1: Scheme of Deep Constrained Inverse Q-learning for the constrained transfer of unconstrained driving demonstrations, leading to optimal constrained imitation.

Technical Approach

Inverse Action-value Iteration

We start with the derivation of model-based Inverse Action-value Iteration (IAVI), where we first establish a relationship between Q-values of a state action-pair (s,a) and the Q-values of all other actions in this state. We assume that the trajectories are collected by an agent following a stochastic policy with an underlying Boltzmann distribution according to its unknown optimal value function. For a given optimal action-value function Q, the corresponding distribution is: πE(a|s):=exp(Q(s,a))AAexp(Q(s,A)), for all actions aA. Rearranging gives: AAexp(Q(s,A))=exp(Q(s,a))πE(a|s) and exp(Q(s,a))=πE(a|s)AAexp(Q(s,A)). Denoting the set of actions excluding action a as Aˉa, we can express the Q-values for an action a in terms of Q-values for any action bAˉa in state s and the probabilities for taking these actions: exp(Q(s,a))=πE(a|s)AAexp(Q(s,A))=πE(a|s)πE(b|s)exp(Q(s,b)). Taking the log then leads to Q(s,a)=Q(s,b)+log(πE(a|s))log(πE(b|s)). Thus, we can relate the optimal value of action a with the optimal values of all other actions in the same state by including the respective log-probabilities: (n1)Q(s,a)=(n1)log(πE(a|s))+bAˉaQ(s,b)log(πE(b|s)). The optimal action-values for state s and action a are composed of immediate reward r(s,a) and the optimal action-value for the next state as given by the transition model, i.e. Q(s,a)=r(s,a)+γmaxaEsM(s,a,s)[Q(s,a)]. Using this definition to replace the Q-values and defining the difference between the log-probability and the discounted value of the next state as: ηas=:log(πE(a|s))γmaxaEsM(s,a,s)[Q(s,a)], we can then solve for the immediate reward: r(s,a)=ηas+1n1bAˉar(s,b)ηbs. Formulating this equation for all actions aiA results in a system of linear equations XA(s)RA(s)=YA(s), with reward vector RA(s), coefficient matrix XA(s) and target vector YA(s): [11n11n11n111n11n11n11][r(s,a1)r(s,a2)r(s,an)]=[ηa1s1n1bA¯a1ηsbηa2s1n1bA¯a2ηsbηans1n1bA¯anηsb].

Tabular Inverse Q-learning

To relax the assumption of an existing transition model and action probabilities, we extend the Inverse Action-value Iteration to a sampling-based algorithm. For every transition (s,a,s), we update the reward function using stochastic approximation: r(s,a)(1αr)r(s,a)+αr(ηas+1n1bAˉar(s,b)ηbs), with learning rate αr. Additionally, we need a state-action visitation counter ρ(s,a) for state-action pairs to calculate their respective log-probabilities: ˜πE(a|s)=:ρ(s,a)/AAρ(s,A). Therefore, we approximate ηas by ˜ηas=:log(˜πE(a|s))γmaxaEsM(s,a,s)[Q(s,a)]. In order to avoid the need of a model M, we evaluate all other actions via Shifted Q-functions as in Kalweit et al.: QSh(s,a)=:γmaxaEsM(s,a,s)[Q(s,a)], i.e. QSh(s,a) skips the immediate reward for taking a in s and only considers the discounted Q-value of the next state s for the maximizing action a. We then formalize ˜ηas as: ˜ηas=:log(˜πE(a|s))γmaxaEsM(s,a,s)[Q(s,a)]=log(˜πE(a|s))QSh(s,a). Combining this with the count-based approximation ˜πE(a|s) and updating QSh, r, and Q via stochastic approximation yields the model-free Tabular Inverse Q-learning algorithm (IQL).

Deep Inverse Q-learning

To cope with continuous state-spaces, we now introduce a variant of IQL with function approximation. We estimate reward function r with function approximator r(,|θr), parameterized by θr. The same holds for Q and QSh, represented by Q(,|θQ) and QSh(,|θSh) with parameters θQ and θSh. To alleviate the problem of moving targets, we further introduce target networks for r(,|θr), Q(,|θQ) and QSh(,|θSh), denoted by r(,|θr), Q(,|θQ) and QSh(,|θSh) and parameterized by θr, θQ and θSh, respectively. Each collected transition (st,at,st+1), either online or in a fixed batch, is stored in replay buffer D. We then sample minibatches (si,ai,si+1)1im from D to update the parameters of our function approximators. First, we calculate the target for the Shifted Q-function: yShi=:γmaxaQ(si+1,a|θQ), and then apply one step of gradient descent on the mean squared error to the respective predictions, i.e. L(θSh)=1mi(QSh(si,ai|θSh)yShi)2. We approximate the state-action visitation by classifier ρ(,|θρ), parameterized by θρ and with linear output. Applying the softmax on the outputs of ρ(,|θρ) then maps each state s to a probability distribution over actions aj|1jn. Classifier ρ(,|θρ) is trained to minimize the cross entropy between its induced probability distribution and the corresponding targets, i.e.: L(θρ)=1miρ(si,ai)+logjiexpρ(si,aj). Given the predictions of ρ(,|θρ), we can calculate targets yri for reward estimation r(,|θr) by: yri=:˜ηaisi+1n1bA¯air(si,b|θr)˜ηbsi, and apply gradient descent on the mean squared error L(θr)=1mi(r(si,ai|θr)yri)2. Lastly, we can perform a gradient step on loss L(θQ)=1mi(Q(si,ai|θQ)yQi)2, with targets: yQi=r(si,ai|θr)+γmaxaQ(si+1,a|θQ), to update the parameters θQ of Q(,|θQ). We update target networks by Polyak averaging, i.e. θSh(1τ)θSh+τθSh, θr(1τ)θr+τθr and θQ(1τ)θQ+τθQ.

Deep Constrained Inverse Q-learning

Following the definition of Constrained Q-learning in Kalweit & Huegle et al., we extend IQL to incorporate a set of constraints C={ci:S×AR|1iC} shaping the space of safe actions in each state. We define the safe set for constraint ci as Sci(s)={aA| ci(s,a)βci}, where βci is a constraint-specific threshold, and SC(s) as the intersection of all safe sets. In addition to the Q-function in IQL, we estimate a constrained Q-function QC by: QC(s,a)(1αQC)QC(s,a)+αQC(r(s,a)+γmaxaSC(s)QC(s,a)). For policy extraction from QC after Q-learning, only the action-values of the constraint-satisfying actions must be considered. As shown in Kalweit & Huegle et al., this formulation of the Q-function optimizes constraint satisfaction on the long-term and yields the optimal action-values for the induced constrained MDP. Put differently, including constraints directly in IQL leads to optimal constrained imitation from unconstrained demonstrations. Analogously to Deep Inverse Q-learning, we can approximate QC with function approximator QC(,|θC) and associated target network QC(,|θC). We call this algorithm Deep Constrained Inverse Q-learning.

Results

In Fig. 2 and Fig. 3, we evaluate the performance of IAVI, IQL and DIQL on the common IRL Objectworld benchmark and compare to MaxEnt IRL (closest to IAVI of all entropy-based IRL approaches) and Deep MaxEnt IRL (which is able to recover non-linear reward functions):


Fig. 2: Results for the Objectworld environment, given a data set with an action
distribution equivalent to the true optimal Boltzmann distribution.
Visualization of the true and learned state-value functions. Resulting expected
value difference and time needed until convergence, mean and standard
deviation over 5 training runs on a 3.00 GHz CPU.


Fig. 3: Mean EVD and SD over 5 runs for different numbers of demonstrations
in Objectworld.

In Fig. 4 we show the potential of constrained imitation by DCIQL on a more complex highway scenario in the open-source traffic simulator SUMO:


Fig. 4: Results for DCIQL in the autonomous driving task.

Demonstration

Video explaining Deep Inverse Q-learning with Constraints.

BibTeX

			
	@inproceedings{DBLP:conf/nips/KalweitHWB20,
	  author    = {Gabriel Kalweit and
	               Maria H{\"{u}}gle and
	               Moritz Werling and
	               Joschka Boedecker},
	  title     = {Deep Inverse Q-learning with Constraints},
	  booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference
	               on Neural Information Processing Systems 2020, NeurIPS 2020, December
	               6-12, 2020, virtual},
	  year      = {2020},
	  url       = {https://proceedings.neurips.cc/paper/2020/hash/a4c42bfd5f5130ddf96e34a036c75e0a-Abstract.html},
	  timestamp = {Tue, 19 Jan 2021 15:57:00 +0100},
	  biburl    = {https://dblp.org/rec/conf/nips/KalweitHWB20.bib},
	  bibsource = {dblp computer science bibliography, https://dblp.org}
	}
							

Gabriel Kalweit, Maria Huegle and Joschka Boedecker

Composite Q-learning: Multi-scale Q-function Decomposition and
Separable Optimization

Paper (arxiv) Cite

Introduction

In the past few years, off-policy reinforcement learning methods have shown promising results in their application for robot control. Deep Q-learning, however, still suffers from poor data-efficiency and is susceptible to stochasticity in the environment or reward functions which is limiting with regard to real-world applications. We alleviate these problems by proposing two novel off-policy Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first n steps of a target-policy rollout w.r.t. the full action-value and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. This decomposition allows us to optimize both parts with their individual learning rates, achieving significant learning speedup. We show that Truncated and Shifted Q-functions can be combined to the full return, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and compare Deep Composite Q-learning with TD3 and TD3(Δ), which we introduce as an off-policy variant of TD(Δ). Moreover, we show that Composite TD3 outperforms TD3 as well as state-of-the-art compositional Q-learning approaches significantly in terms of data-efficiency in multiple simulated robot tasks and that Composite Q-learning is robust to stochastic environments and reward functions. For an overview, see Fig. 1 below.


Fig. 1: Structure of Composite Q-learning. Qi denote the Truncated and Qi: the Shifted Q-functions at step i. Q is the complete Composite Q-function. Directed incoming edges yield the targets for the corresponding value-function. Edges denoted by γ are discounted.

Technical Approach

Truncated Q-functions

In order to decompose the action-value into multiple truncated returns, assume that n(T1) and that (T1t)modn=0 for task horizon T. We make use of the following observation: Qπ(st,at)=Eπ,M[rt+γrt+1+γ2rt+2+γ3rt+3++γT1rT1]=Eπ,M[(t+n1j=tγjtrj)+γn(t+2n1j=t+nγj(t+n)rj)++γTn(T1j=Tnγj(Tn)rj)]. That is, we can define the action-value as a combination of partial sums of length n. We can then define the Truncated Q-function as Qπn(st,at)=Eπ,M[t+n1j=tγjtrj], which we plug into the equation above: Qπ(st,at)=Eπ,M[Qπn(st,at)+γnQπn(st+n,at+n)++γTnQπn(sTn,aTn)]. We approximate Qn(st,at) off-policy via consecutive bootstrapping. In order to limit the prediction to horizon n, we estimate n different truncated value functions belonging to increasing horizons, with the first one estimating the immediate reward function. All consecutive value functions then bootstrap from the prediction of the preceding value function evaluating the target-policy of full return Q for a fixed horizon. Analogous to vanilla Q-learning, the update procedure is given by: Q1(st,at)(1αTr)Q1(st,at)+αTrrt and Qi>1(st,at)(1αTr)Qi(st,at)+αTr(rt+γQi1(st+1,argmaxaQ(st+1,a))), with learning rate αTr. Please note that in order to estimate the full return based on Truncated Q-functions alone, the dynamics model would be needed to get st+cn of a rollout starting in st. In the following, we describe an approach to achieve an estimation of the full return model-free.

Shifted Q-functions

To get an estimation for the remainder of the rollout Qπn:=Eπ,M[γnQ(st+n,at+n)] after n steps, we use a consecutive bootstrapping formulation of the Q-prediction as a means to skip the first n rewards of a target-policy rollout: Q1:(st,at)(1αSh)Q1:(st,at)+αSh(γmaxaQ(st+1,a)) andQ(i>1):(st,at)(1αSh)Qi:(st,at)+αSh(γQ(i1):(st+1,argmaxaQ(st+1,a))), with learning rate αSh. We hypothesize that Shifted Q-functions allow for higher learning rates, since they only have to account for the variance of the underlying dynamics and not of the immediate reward function, as opposed to conventional value functions.

Composition

Following the definitions of Truncated and Shifted Q-functions, we can compose the full return: Q(st,at)(1αQ)Q(st,at)+αQ(rt+γ(Qn(st+1,argmaxaQ(st+1,a))+Qn:(st+1,argmaxaQ(st+1,a)))). The incorporation of truncated returns breaks down the time scale of the long-term prediction by the Shifted Q-function. We call this algorithm Composite Q-learning. Please note that Composite Q-learning is equivalent to Q-learning when setting αTr=αSh=αQ. However, a notable advantage is that Composite Q-learning offers an independent optimization for the different learning rates corresponding to different temporal horizons.

Deep Composite Q-learning

We apply Composite Q-learning within TD3. Let QTr(,|θTr) denote a function approximator with parameters θTr and n outputs, subsequently called heads, estimating Truncated Q-functions Qi. Each output QTri bootstraps from the prediction of the preceding head, with the first approximating the immediate reward function. The targets are therefore given by: yTrj,1=rj and yTrj,i>1=rj+γQTri1(sj+1,μ(sj+1|θμ)|θTr), where μ corresponds to the target-actor maximizing the full Q-value and QTri1 the output of the respective Q-target-network. That is, QTr represents evaluations of μ at different stages of truncation and yTrj,i<n serve as intermediate predictions to get yTrj,n. We then only use QTrn, which implements the full n-step return, as the first part of the composition of the Q-target, but not the QTri<n values. The second part of the composition is represented by the Shifted Q-function. Let QSh(,|θSh) denote the function approximator estimating the Shifted Q-functions Qπi: by n different outputs, parameterized by θSh. We can shift the Q-prediction by bootstrapping without taking the immediate reward into account, so as to skip the first n rewards of a target-policy rollout. The Shifted Q-targets for heads QShi therefore become: yShj,1=γQ(sj+1,μ(sj+1|θμ)|θQ) and yShj,i>1=γQShi1(sj+1,μ(sj+1|θμ)|θSh). In the function approximation setting, we can thus define the Composite Q-target as: yQj=rj+γ(QTrn(sj+1,μ(sj+1|θμ)|θTr)+QShn(sj+1,μ(sj+1|θμ)|θSh)), approximated by Q(,|θQ) with parameters θQ.

Results

Vanilla Reward

In Fig. 2 and 3, we show the data-efficiency of Tabular Composite Q-learning and Shifted and Truncated Q-functions for a deterministic chain MDP of arbitrary horizon.


Fig. 2: (a) Deterministic chain MDP.
(b) Results of Composite Q-learning in the MDP shown in (a) with a horizon of 20.


Fig. 3: (a) Results of Composite and Shifted Q-learning with different learning rates for the
Shifted Q-function. (b) Results of Composite Q-learning with different learning rates
for the Truncated Q-function.

Noisy Reward

Fig. 4 shows the data-efficiency of Tabular Composite Q-learning for a stochastic chain MDP of arbitrary horizon.


Fig. 4: (a) Stochastic chain MDP.
(b) Results for Composite Q-learning for the MDP shown in (a).

Lastly, we compare in Fig. 6 Composite TD3, TD3(Δ) and TD3 at default setting and with high learning rate (HL) in multiple MuJoCo benchmarks shown in Fig 5.


Fig. 5: MuJoCo environments.


Fig. 6: Mean performance and half a standard deviation over 8 runs for (left) Walker2d-v2,
(middle) Hopper-v2 and (right) Humanoid-v2 with uniform noise
on the reward function as in Romoff et al., 2019.

BibTeX

			
@misc{kalweit2019composite,
	title={Composite Q-learning: Multi-scale Q-function Decomposition
	and Separable Optimization},
	author={Gabriel Kalweit and Maria Huegle and Joschka Boedecker},
	year={2019},
	eprint={1909.13518},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
}
							

Gabriel Kalweit*, Maria Huegle*, Moritz Werling and Joschka Boedecker

Constrained Q-learning

Paper (arxiv) Cite

Introduction

In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a list of constraints. Classical methods based on reward shaping, i.e. a weighted combination of different objectives in the reward signal, or Lagrangian methods, including constraints in the loss function, have no guarantees that the agent satisfies the constraints at all points in time and lack in interpretability. When a discrete policy is extracted from an action-value function, safe actions can be ensured by restricting the action space at maximization, but can lead to sub-optimal solutions among feasible alternatives. In this work, we propose Multi Time-scale Constrained DQN, a novel algorithm restricting the action space directly in the Q-update to learn the optimal Q-function for the constrained MDP and the corresponding safe policy. In addition to single-step constraints referring only to the next action, we introduce a formulation for approximate multi-step constraints under the current target policy based on truncated value-functions to enhance interpretability. We compare our algorithm to reward shaping and Lagrangian methods in the application of high-level decision making in autonomous driving, considering constraints for safety, keeping right and comfort. We train our agent in the open-source simulator SUMO and on the real HighD data set. For an intuition, see Fig. 1 below.


Fig. 1: In this MDP, state s6 is marked as unsafe. Assume a safety check with a horizon of two time steps and that the initial state is s0. Unconstrained Q-learning (red) chooses the unsafe path leading to s9 with a return of +2 and Safe Policy Extraction (blue) after Q-learning leads to a safe path to state s10 with a return of +0.5. Multi Time-scale Constrained Q-learning (green) chooses the safe path to s11 with a return of +1.

Theoretical Background

Reinforcement Learning

In a reinforcement learning setting, an agent interacts with an environment, which is typically modeled as an MDP S,A,P,r,γ. The agent is following policy π:SA in some state st, applying a discrete action atπ to reach a successor state st+1P according to a transition model P. In every discrete time step t, the agent receives reward rt for selecting action at in state st. The goal of the agent is to maximize the expectation of the discounted long-term return Eaiπ,si>tP[R(st)]=Eaiπ,si>tP[itγitri], where γ[0,1] is the discount factor. The action-value function Qπ(st,at)=Eai>tπ,si>tP[R(st)|at] represents the value of following a policy π after applying action at. The optimal policy can be inferred from the optimal action-value function Q(st,at)=maxπQπ(s,a) by maximization.

Constrained Markov Decision Processes (CMDP)

We consider a CMDP S,A,P,r,γ,C, with constraint set C=CaCπ, where the set of signals Ca={ci:S×AR|1iN} for single-step constraints only depend on the current state and action. We define the set of safe actions for a single-step constraint ciCa as Sci(st)={aA| ci(st,a)βci}. The set Cπ={JπHi,i:S×AR| 1iM} consists of multi-step constraint signals JπHi,i with horizon Hi which are dependent on policy π. The set of expected safe actions of a multi-step constraint JπHi,iCπ is defined as SπJπHi,i(st)={aA| JπHi,i(st,a)βJπHi,i}. We define SC(st) as the intersection of all safe sets.

Safe Policy Extraction

Given an action-value function Q and a set of constraints C, we can extract the optimal safe policy π w.r.t. Q by π(st)=argmaxaSC(st)Q(st,a). We call this method Safe Policy Extraction (SPE).

Technical Approach

While common constraints are only dependent on the current decision step, it can be crucial to represent the effect of the current policy of the agent for a longer time scale. Typically, long-term dependencies on constraints are represented by the expected sum of discounted or average constraint signals ji, i.e. Jπ(st,at)=Eai>tπ,si>tP[J(st)|at]=Eai>tπ,si>tP[itγitji]. Instead, we only consider the next H steps: JπH(st,at)=Eai>tπ,si>tP[JH(st)]=Eai>tπ,si>tP[t+Hitji]. Due to the fixed horizon H, discounting is not needed, which leads to more interpretable constraints. We apply the formulation of truncated value-functions to predict the truncated constraint-values. We first estimate the immediate constraint-value and then follow a consecutive bootstrapping scheme to get to the estimation of the full horizon H. The update rules for constraint-values Jπh are: Jπ1(st,at)(1αJ)Jπ1(st,at)+αJjt andJπh>1(st,at)(1αJ)Jπh(st,at)+αJ(jt+Jπh1(st+1,argmaxaSC(st+1)Q(st+1,a))), with constraint-specific learning rate αJ. To cope with infinite state-spaces, we jointly estimate Jπh|1hH with function approximator J(,|θJ), parameterized by θJ. We extend the Q-learning update to use a set of constraints C=CaCπ with corresponding safe action set SC(st+1): QC(st,at)(1α)QC(st,at)+α(r+γmax aSC(st+1)QC(st+1,a)). The optimal deterministic constrained policy πΠSC is: π(st)=argmaxaSC(st)QC(st,a). We refer to this algorithm as Multi Time-scale Constrained DQN (MTS-CDQN).

Results

In Fig. 2, we show performance and constraint violations of MTS-CDQN, DQN+SPE, DQN + Reward Shaping and DQN + Constraint Violation Loss. In Fig. 3, we then show the performance of MTS-CDQN trained on the real HighD data set. As can be seen, the approach is very robust to real data which we visualize in Fig. 4.


Fig. 2

Fig. 3

Fig. 2: Results for MTS-CDQN in the autonomous driving task.
Fig. 3: Results for MTS-CDQN trained on the real HighD data set.


Fig. 4: Visualization of the HighD data set.

Demonstration

Video presenting the effect of Constrained Q-learning.

BibTeX

			
@misc{kalweit2020interpretable,
  title={Interpretable Multi Time-scale Constraints in Model-free 
  Deep Reinforcement Learning for Autonomous Driving},
  author={Gabriel Kalweit and Maria Huegle and Moritz Werling and Joschka Boedecker},
  year={2020},
  eprint={2003.09398},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}
							

Gabriel Kalweit and Joschka Boedecker

Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning

Paper Cite

Introduction

Continuous control of high-dimensional systems can be achieved by current reinforcement learning methods such as the Deep Deterministic Policy Gradient algorithm (DDPG) and its extensions, but needs a significant amount of data samples. For real-world systems, this can be an obstacle since excessive data collection can be expensive, tedious or lead to physical damage. The main incentive of this work is to keep the advantages of model-free Q-learning while minimizing real-world interaction by the employment of a dynamics model learned in parallel. To counteract adverse effects of imaginary rollouts with an inaccurate model, a notion of uncertainty is introduced, to make use of artificial data only in cases of high uncertainty. We evaluate our approach on three simulated robot tasks and achieve faster learning by up to factor 15 in comparison to vanilla DDPG with multiple updates. For an overview, see Fig. 1 below.


Fig. 1: Structure of MA-BDDPG.

Technical Approach

Model-assisted DDPG

We approach the problem with a deterministic model and learn to predict the state change: MΔ:S×AΔS. In each time step, the agent collects a real sample st, which is the starting point for B imaginary rollouts of specified depth D. For each step, the action abd is selected based on the current real policy and random noise. For training, target values, loss and target update are equivalent to DDPG. Since actor and critic are updated on imaginary samples, for those updates it comes down to ˆL(θQ)=1ni(ˆyiQ(ˆsi,ai|θQ))2, where target values ˆyi are defined as ˆyi=r(ˆsi,ai)+γQ(ˆsi+1,μ(ˆsi+1|θμ)|θQ) and the occuring states ˆsi=MΔ(ˆsi1,ai1)+ˆsi1. The critic is updated on Ur real batches and Ui imaginary batches and, thus, gets gradients from both loss functions in each cycle. We call this algorithm Model-assisted DDPG (MA-DDPG).

Including Uncertainty

In order to incorporate the uncertainty estimate and to still be able to adapt to continuous action domains, the Bootstrapped DQN architecture is extended to follow the actor-critic scheme of DDPG. The Q-function approximator Q1:K(,|θQ) is defined to be a deep neural network with parameters θQ composed of an optional single shared body and K independent network heads Qk. Analogously to DDPG and MA-DDPG, the target values get computed via slowly updated target networks Q1:K and μ. When the agent receives a new transition, it generates a random mask mt=(m1,m2,,mK)t with miP, for some probability distribution P. In the experiments, each element of the mask is drawn according to a Bernoulli distribution with sharing probability p. This ensures that a sample is only seen by some subset of the heads and, hence, K bootstrapped samples are created. The K heads are then trained on their respective samples. First, the different target values for the K heads are calulated,yQ1:Ktrt+γQ1:K(st+1,μ(st+1|θμ)|θQ1:K), and then the critic gets updated on the combined loss, L(θQ1:K)=k1nkimki(ykiQk(si,ai|θQk))2, where nk=iδmki0 and δ is the Dirac delta function. Note that the masks modify the loss, so as to redirect the gradients only to the corresponding heads. The parameters θμ of single-network actor μ are then updated according to the mean of all respective updates of all heads, θμR1Kk1niaQk(si,a|θQk)|a=μ(si)θμμ(si|θμ). After updating critic and actor, the weights of the target networks Q1:K and μ get shifted towards the current weights by τ(0,1], analogously to DDPG. Based on the distribution over Q-functions, the agent can then estimate its uncertainty by its variance.

Limiting Model Usage

A simple way to limit the use of imagination proportional to the current uncertainty is to transform the uncertainty measure to a probability distribution and to induce a biased coin flip. The result of the coin flip then determines whether to train on an imaginary batch or not. The variance of an agent could be task specific. On the other hand, the measure should not depend on prior knowledge over the variance. In order to circumvent this necessity, we put the uncertainty into context of the maximum variance seen so far. Formally, we define the probability of learning on minibatch b from the imaginary replay buffer I as P[bI]=ˆσ2e+1=ˉσ2e/maxEˉσ2E<e in episode e, where ˉσ2e denotes the mean variance in e. This restricts the mean variance ratio in [0,1] and can be interpreted as follows: if it is unknown whether a value of variance is high or not, the maximum value is a rational reference point. Hence, this underlying probability distribution yields a dimensionless measure of uncertainty and should decrease over time when the different heads of the critic converge.

Results

Fig. 2 depicts the final results of MA-BDDPG compared to DDPG with default setting and multiple updates per collected sample. As can be seen, MA-BDDPG yields an improved data-efficiency of up to factor 15. The amount of artificial samples, however, decreases over time along with the variance within the Q-ensemble, as shown in Fig. 3.


Fig. 2: Comparison of MA-BDDPG and DDPG with default setting
and multiple updates per time step. MA-BDDPG
is up to 15 times more data-efficient.

Fig. 3: Ratio of real and artificial transitions during training. The use of virtual samples
decreases over time.

Demonstration

Video presenting the data-efficiency of MA-BDDPG.

BibTeX

			
@InProceedings{pmlr-v78-kalweit17a,
title = {Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning},
author = {Gabriel Kalweit and Joschka Boedecker},
booktitle =  {Proceedings of the 1st Annual Conference on Robot Learning},
pages = {195--206},
year = {2017},
editor = {Sergey Levine and Vincent Vanhoucke and Ken Goldberg},
volume = {78},
series = {Proceedings of Machine Learning Research},
month = {13--15 Nov},
publisher =  {PMLR},
pdf =  {http://proceedings.mlr.press/v78/kalweit17a/kalweit17a.pdf},
url =  {http://proceedings.mlr.press/v78/kalweit17a.html},
}

							

People

Maria Hügle

Website

Gabriel Kalweit

Website

Joschka Bödecker

Website