Gabriel Kalweit*, Maria Huegle*, Moritz Werling and Joschka Boedecker
NeurIPS 2020: Deep Inverse Q-Learning with Constraints
Paper (arxiv) CiteIntroduction
Popular Maximum Entropy Inverse Reinforcement Learning approaches require the computation of expected state visitation frequencies for the optimal policy under an estimate of the reward function. This usually requires intermediate value estimation in the inner loop of the algorithm, slowing down convergence considerably. In this work, we introduce a novel class of algorithms that only needs to solve the MDP underlying the demonstrated behavior once to recover the expert policy. This is possible through a formulation that exploits a probabilistic behavior assumption for the demonstrations within the structure of Q-learning. We propose Inverse Action-value Iteration which is able to fully recover an underlying reward of an external agent in closed-form analytically. We further provide an accompanying class of sampling-based variants which do not depend on a model of the environment. We show how to extend this class of algorithms to continuous state-spaces via function approximation and how to estimate a corresponding action-value function, leading to a policy as close as possible to the policy of the external agent, while optionally satisfying a list of predefined hard constraints. We evaluate the resulting algorithms called Inverse Action-value Iteration, Inverse Q-learning and Deep Inverse Q-learning on the Objectworld benchmark, showing a speedup of up to several orders of magnitude compared to (Deep) Max-Entropy algorithms. We further apply Deep Constrained Inverse Q-learning on the task of learning autonomous lane-changes in the open-source simulator SUMO achieving competent driving after training on data corresponding to 30 minutes of demonstrations. The scheme of Deep Constrained Inverse Q-learning can be seen in Fig. 1.
Fig. 1: Scheme of Deep Constrained Inverse Q-learning for the constrained transfer of unconstrained driving demonstrations, leading to optimal constrained imitation.
Technical Approach
Inverse Action-value Iteration
We start with the derivation of model-based Inverse Action-value Iteration (IAVI), where we first establish a relationship between Q-values of a state action-pair (s,a) and the Q-values of all other actions in this state. We assume that the trajectories are collected by an agent following a stochastic policy with an underlying Boltzmann distribution according to its unknown optimal value function. For a given optimal action-value function Q∗, the corresponding distribution is: πE(a|s):=exp(Q∗(s,a))∑A∈Aexp(Q∗(s,A)), for all actions a∈A. Rearranging gives: ∑A∈Aexp(Q∗(s,A))=exp(Q∗(s,a))πE(a|s) and exp(Q∗(s,a))=πE(a|s)∑A∈Aexp(Q∗(s,A)). Denoting the set of actions excluding action a as Aˉa, we can express the Q-values for an action a in terms of Q-values for any action b∈Aˉa in state s and the probabilities for taking these actions: exp(Q∗(s,a))=πE(a|s)∑A∈Aexp(Q∗(s,A))=πE(a|s)πE(b|s)exp(Q∗(s,b)). Taking the log then leads to Q∗(s,a)=Q∗(s,b)+log(πE(a|s))−log(πE(b|s)). Thus, we can relate the optimal value of action a with the optimal values of all other actions in the same state by including the respective log-probabilities: (n−1)Q∗(s,a)=(n−1)log(πE(a|s))+∑b∈AˉaQ∗(s,b)−log(πE(b|s)). The optimal action-values for state s and action a are composed of immediate reward r(s,a) and the optimal action-value for the next state as given by the transition model, i.e. Q∗(s,a)=r(s,a)+γmaxa′Es′∼M(s,a,s′)[Q∗(s′,a′)]. Using this definition to replace the Q-values and defining the difference between the log-probability and the discounted value of the next state as: ηas=:log(πE(a|s))−γmaxa′Es′∼M(s,a,s′)[Q∗(s′,a′)], we can then solve for the immediate reward: r(s,a)=ηas+1n−1∑b∈Aˉar(s,b)−ηbs. Formulating this equation for all actions ai∈A results in a system of linear equations XA(s)RA(s)=YA(s), with reward vector RA(s), coefficient matrix XA(s) and target vector YA(s): [1−1n−1…−1n−1−1n−11…−1n−1⋮⋮⋱⋮−1n−1−1n−1…1][r(s,a1)r(s,a2)⋮r(s,an)]=[ηa1s−1n−1∑b∈A¯a1ηsbηa2s−1n−1∑b∈A¯a2ηsb⋮ηans−1n−1∑b∈A¯anηsb].
Tabular Inverse Q-learning
To relax the assumption of an existing transition model and action probabilities, we extend the Inverse Action-value Iteration to a sampling-based algorithm. For every transition (s,a,s′), we update the reward function using stochastic approximation: r(s,a)←(1−αr)r(s,a)+αr(ηas+1n−1∑b∈Aˉar(s,b)−ηbs), with learning rate αr. Additionally, we need a state-action visitation counter ρ(s,a) for state-action pairs to calculate their respective log-probabilities: ˜πE(a|s)=:ρ(s,a)/∑A∈Aρ(s,A). Therefore, we approximate ηas by ˜ηas=:log(˜πE(a|s))−γmaxa′Es′∼M(s,a,s′)[Q∗(s′,a′)]. In order to avoid the need of a model M, we evaluate all other actions via Shifted Q-functions as in Kalweit et al.: QSh(s,a)=:γmaxa′Es′∼M(s,a,s′)[Q∗(s′,a′)], i.e. QSh(s,a) skips the immediate reward for taking a in s and only considers the discounted Q-value of the next state s′ for the maximizing action a′. We then formalize ˜ηas as: ˜ηas=:log(˜πE(a|s))−γmaxa′Es′∼M(s,a,s′)[Q∗(s′,a′)]=log(˜πE(a|s))−QSh(s,a). Combining this with the count-based approximation ˜πE(a|s) and updating QSh, r, and Q via stochastic approximation yields the model-free Tabular Inverse Q-learning algorithm (IQL).
Deep Inverse Q-learning
To cope with continuous state-spaces, we now introduce a variant of IQL with function approximation. We estimate reward function r with function approximator r(⋅,⋅|θr), parameterized by θr. The same holds for Q and QSh, represented by Q(⋅,⋅|θQ) and QSh(⋅,⋅|θSh) with parameters θQ and θSh. To alleviate the problem of moving targets, we further introduce target networks for r(⋅,⋅|θr), Q(⋅,⋅|θQ) and QSh(⋅,⋅|θSh), denoted by r′(⋅,⋅|θr′), Q′(⋅,⋅|θQ′) and QSh′(⋅,⋅|θSh′) and parameterized by θr′, θQ′ and θSh′, respectively. Each collected transition (st,at,st+1), either online or in a fixed batch, is stored in replay buffer D. We then sample minibatches (si,ai,si+1)1≤i≤m from D to update the parameters of our function approximators. First, we calculate the target for the Shifted Q-function: yShi=:γmaxaQ′(si+1,a|θQ′), and then apply one step of gradient descent on the mean squared error to the respective predictions, i.e. L(θSh)=1m∑i(QSh(si,ai|θSh)−yShi)2. We approximate the state-action visitation by classifier ρ(⋅,⋅|θρ), parameterized by θρ and with linear output. Applying the softmax on the outputs of ρ(⋅,⋅|θρ) then maps each state s to a probability distribution over actions aj|1≤j≤n. Classifier ρ(⋅,⋅|θρ) is trained to minimize the cross entropy between its induced probability distribution and the corresponding targets, i.e.: L(θρ)=1m∑i−ρ(si,ai)+log∑j≠iexpρ(si,aj). Given the predictions of ρ(⋅,⋅|θρ), we can calculate targets yri for reward estimation r(⋅,⋅|θr) by: yri=:˜ηaisi+1n−1∑b∈A¯air′(si,b|θr′)−˜ηbsi, and apply gradient descent on the mean squared error L(θr)=1m∑i(r(si,ai|θr)−yri)2. Lastly, we can perform a gradient step on loss L(θQ)=1m∑i(Q(si,ai|θQ)−yQi)2, with targets: yQi=r′(si,ai|θr′)+γmaxaQ′(si+1,a|θQ′), to update the parameters θQ of Q(⋅,⋅|θQ). We update target networks by Polyak averaging, i.e. θSh′←(1−τ)θSh′+τθSh, θr′←(1−τ)θr′+τθr and θQ′←(1−τ)θQ′+τθQ.Deep Constrained Inverse Q-learning
Following the definition of Constrained Q-learning in Kalweit & Huegle et al., we extend IQL to incorporate a set of constraints C={ci:S×A→R|1≤i≤C} shaping the space of safe actions in each state. We define the safe set for constraint ci as Sci(s)={a∈A| ci(s,a)≤βci}, where βci is a constraint-specific threshold, and SC(s) as the intersection of all safe sets. In addition to the Q-function in IQL, we estimate a constrained Q-function QC by: QC(s,a)←(1−αQC)QC(s,a)+αQC(r(s,a)+γmaxa′∈SC(s′)QC(s′,a′)). For policy extraction from QC after Q-learning, only the action-values of the constraint-satisfying actions must be considered. As shown in Kalweit & Huegle et al., this formulation of the Q-function optimizes constraint satisfaction on the long-term and yields the optimal action-values for the induced constrained MDP. Put differently, including constraints directly in IQL leads to optimal constrained imitation from unconstrained demonstrations. Analogously to Deep Inverse Q-learning, we can approximate QC with function approximator QC(⋅,⋅|θC) and associated target network QC′(⋅,⋅|θC′). We call this algorithm Deep Constrained Inverse Q-learning.Results
In Fig. 2 and Fig. 3, we evaluate the performance of IAVI, IQL and DIQL on the common IRL Objectworld benchmark and compare to MaxEnt IRL (closest to IAVI of all entropy-based IRL approaches) and Deep MaxEnt IRL (which is able to recover non-linear reward functions):
Fig. 2: Results for the Objectworld environment, given a data set with an action
distribution equivalent to the true optimal Boltzmann distribution.
Visualization of the true and learned state-value functions. Resulting expected
value difference and time needed until convergence, mean and standard
deviation over 5 training runs on a 3.00 GHz CPU.
Fig. 3: Mean EVD and SD over 5 runs for different numbers of demonstrations
in Objectworld.
In Fig. 4 we show the potential of constrained imitation by DCIQL on a more complex highway scenario in the open-source traffic simulator SUMO:
Fig. 4: Results for DCIQL in the autonomous driving task.
Demonstration
Video explaining Deep Inverse Q-learning with Constraints.
BibTeX
@inproceedings{DBLP:conf/nips/KalweitHWB20,
author = {Gabriel Kalweit and
Maria H{\"{u}}gle and
Moritz Werling and
Joschka Boedecker},
title = {Deep Inverse Q-learning with Constraints},
booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference
on Neural Information Processing Systems 2020, NeurIPS 2020, December
6-12, 2020, virtual},
year = {2020},
url = {https://proceedings.neurips.cc/paper/2020/hash/a4c42bfd5f5130ddf96e34a036c75e0a-Abstract.html},
timestamp = {Tue, 19 Jan 2021 15:57:00 +0100},
biburl = {https://dblp.org/rec/conf/nips/KalweitHWB20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}