Increasing sample efficiency in deep reinforcement learning using generative environment modelling S P E C I A L I S S U E P A P E R Increasing sample efficiency in deep reinforcement learning using generative environment modelling Per-Arne Andersen | Morten Goodwin | Ole-Christoffer Granmo Department of ICT, University of Agder, Grimstad, Norway Correspondence Per-Arne Andersen, Jon Lilletunsvei 9, 4879 Grimstad, Norway. Email: per.andersen@uia.no Abstract Reinforcement learning is a broad scheme of learning algorithms that, in recent times, has shown astonishing performance in controlling agents in environments pres- ented as Markov decision processes. There are several unsolved problems in current state-of-the-art that causes algorithms to learn suboptimal policies, or even diverge and collapse completely. Parts of the solution to address these issues may be related to short- and long-term planning, memory management and exploration for reinforcement learning algorithms. Games are frequently used to benchmark reinforcement learning algorithms as they provide a flexible, reproducible and easy to control environments. Regardless, few games feature the ability to perceive how the algorithm performs exploration, memorization and planning. This article presents The Dreaming Variational Autoencoder with Stochastic Weight Averaging and Generative Adversarial Networks (DVAE- SWAGAN), a neural network-based generative modelling architecture for explo- ration in environments with sparse feedback. We present deep maze, a novel and flexible maze game-engine that challenges DVAE-SWAGAN in partial and fully observable state-spaces, long-horizon tasks and deterministic and stochas- tic problems. We show results between different variants of the algorithm and encourage future study in reinforcement learning driven by generative exploration. K E Y W O R D S artificial experience-replay, deep reinforcement learning, environment Modelling, exploration, generative adversarial networks, generative Modelling, Markov decision processes, model- based RL, neural networks, variational autoencoder 1 | INTRODUCTION Reinforcement learning (RL) is a field of research that has quickly become one of the most promising branches of machine learning algorithms to solve artificial general intelligence (Arulkumaran, Deisenroth, Brundage and Bharath 2017; Kaelbling, Littman and Moore 1996; Li 2017 and Mousavi, Schukat and Howley 2018). There have been several breakthroughs in reinforcement learning research in recent years for Extension of prior study found in Andersen, Goodwin and Granmo (2017, 2018). Received: 16 May 2019 Revised: 10 October 2019 Accepted: 24 January 2020 DOI: 10.1111/exsy.12537 This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. Expert Systems published by John Wiley & Sons Ltd. Expert Systems. 2020;e12537. wileyonlinelibrary.com/journal/exsy 1 of 15 https://doi.org/10.1111/exsy.12537 https://orcid.org/0000-0002-7742-4907 https://orcid.org/0000-0001-6331-702X https://orcid.org/0000-0002-7287-030X mailto:per.andersen@uia.no http://creativecommons.org/licenses/by/4.0/ http://wileyonlinelibrary.com/journal/exsy https://doi.org/10.1111/exsy.12537 http://crossmark.crossref.org/dialog/?doi=10.1111%2Fexsy.12537&domain=pdf&date_stamp=2020-03-01 environments such as the Atari Arcade (Mnih et al. 2013, 2015), AlphaZero (Silver et al. 2017), OpenAI Five and AlphaStar (Arulkumaran, Cully and Togelius 2019), but no algorithms are capable of human performance without extensive hardware requirements that are accessi- ble to only a few institutions, such as Deep Mind and Open AI. Due to this, reinforcement learning still has several open research questions to address before the general public can deploy these algorithms successfully. It is possible that many of the issues in the current state of the art can be resolved with algorithms that adequately accounts for planning, exploration and memory efficiency at different time- horizons. In current state-of-the-art RL algorithms, long-horizon RL tasks are difficult to master because there is as of yet no optimal exploration algorithm that is capable of proper state-space pruning. Exploration strategies such as ϵ-greedy are widely used in RL, but cannot find an adequate exploration/exploitation balance without exhaustive hyperparameter-tuning. Environment modelling is a promising exploration method where the goal is to imitate the behaviour of a target environment. By constructing an artificial model of the environment, the need to interact with the ground-truth environment is reduced significantly. This enables the RL-algorithm to explore large parts of the state space without the cost of exhausting the ground-truth environment. A balance between exploration and exploitation must be accounted for to learn the underlying dynamics of the environment. The algorithm must, therefore, select observations carefully to only learn essential fea- tures during the exploration phase. By combining generative modelling with deep reinforcement learning, we find that it is possible for agents to learn optimal policies using only generated training data samples. The approach that we present is the dreaming variational autoencoder with two extensions, based on generative adversarial networks and stochastic weight averaging. We also present a new learning environment, deep maze, that aims to bring a vast set of challenges for reinforcement learning algorithms and is the environment used for testing the presented algorithms. This article is organized as follows. Section 2 surveys recent study in the field. Section 3 briefly introduces the reader to preliminaries. Section 4 proposes The Dreaming Variational Autoencoder for environment modelling to improve exploration in RL. Section 5 introduces the deep maze learning environment for exploration, planning and memory management research for reinforcement learning. Section 6 shows results in the deep line wars environment and that RL agents can be trained to navigate through the deep maze environment using only artificial training data. Finally, we summarize our findings and outlines future study in Section 7. 2 | LITERATURE REVIEW In machine learning, the goal is to create an algorithm that is capable of accurately representing a target environment. There is, however, little literature on generative modelling for game environments on the scale we propose in this article. The primary focus of recent RL research has been through improvements in on and off policy algorithms, while less attention has been put into generative exploration. This section introduces a thorough literature review of reinforcement-based generative modelling. Bangaru, Suhas and Ravindran (2016) proposed a method of deducing the Markov Decision Process (MDP) by introducing an adaptive explo- ration signal (pseudo-reward), which was obtained using deep generative model. Their approach was to compute the Jacobian of each state and used it as the pseudo-reward when using deep neural networks to learn the state-generalization. Xiao and Kesineni (2016) proposed the use of generative adversarial networks (GAN) for model-based reinforcement learning. The goal was to utilize GAN for learning dynamics of the environment in a short-horizon timespan and combine this with the strength of far-horizon value iteration RL algorithms. The GAN architecture proposed illustrated near authentic generated images giving comparable results to Mnih et al. (2013). Higgins et al. (2017) proposed DARLA, an architecture for modelling the environment using β-VAE (Higgins et al. 2016). The trained model was used to learn the optimal policy of the environment using algorithms such as Deep Q-Networks (DQN) (Mnih et al. 2015), Asynchronous Actor-Critic Agents (A3C) (Mnih et al. 2016) and Episodic Control (Blundell et al. 2016). DARLA is to the best of our knowledge, the first algorithm to introduce learning without access to the ground-truth environment during training. Buesing et al. (2018) recently compared several methods of environment modelling, showing that it is far better to model the state-space then to utilize Monte-Carlo rollouts (RAR). The proposed architecture, state-space models (SSM) was significantly faster and produced acceptable results compared to auto-regressive (AR) methods. The algorithm VMAV-C is a combination of VAE and attention-based value function (AVF), and mixture density network recurrent neural network (MDN-RNN) from Ha and Schmidhuber (2018). This modification to the original World Models algorithm improved performance in the Cart Pole environment. They used the on-policy algorithm PPO to learn the optimal policy from the latent representation of the state-space (Liang, Wang, Feng, Liu and Huang 2018). Deep Planning Network (PlaNet) is a model-based agent that interpret the pixels of a state to learn the dynamics of an environment. The environment dynamics are stored into latent-space, where the agent sample actions based on the learned representation. The proposed algorithm showed significantly better sample efficiency compared to algorithms such as A3C (Hafner et al. 2018). 2 of 15 ANDERSEN ET AL. Chua, Calandra, McAllister and Levine (2018) recently proposed Probabilistic Ensembles with Trajectory Sampling (PETS). The algorithm uses an ensemble of bootstrap neural networks to learn a dynamics model of the environment over future states. The algorithm then uses this model to predict the best action for future states. The authors show that the algorithm significantly lowers sampling requirements for environments such as half-cheetah compared to SAC and PPO. Stochastic optimal control with latent representations (SOLAR) is an algorithm that learns the dynamics of the environment by exploiting the knowledge from a reinforcement learning policy. This enables the algorithm to learn local models which are used in policy learning for complex systems. SOLAR is built around a probabilistic graphical model (PGM) structure that allows efficient learning of the environment model. By exploi- ting the locality of the model, the authors show that the gradients give good direction for policy improvements during training. The algorithm was compared to model-free methods and showed significantly better performance and data efficient compared to algorithms such as LQR-FLM. (Zhang, Patras and Haddadi 2018) Ha and Schmidhuber (2018) proposed in Recurrent World Models Facilitate Policy Evolution, a novel architecture for training RL algorithms using variational autoencoders. This article showed that agents could successfully learn the environment dynamics and use this as an exploration technique requiring no interaction with the target domain. The architecture is mainly three components; vision, controller and model, the vision model is a variational autoencoder that outputs a latent-space variable of an observation. The latent-space variable is processed in the model and is fed into the controller for action decisions. Their algorithms show state-of-the-art performance in self-supervised generative modelling for rein- forcement learning agents. One of the most recent advancements in generative modelling for reinforcement learning is the Neural Differential Information Gain Optimisation (NDIGO) algorithm by Azar et al. (2019), a self-supervised exploration model that learns a world model representation from noisy data. The primary features of NDIGO are its robustness to noise due to their method to cancel out negative loss and to give positive learning more value. The authors show in their maze environment that the model successfully converges towards an optimal world model even when introducing noise. The author claims that the algorithm outperforms previous state-of-the-art, being the Recurrent World Model from. 3 | BACKGROUND We base our work on the well-established theory of reinforcement learning originally formulated in Sutton, Precup and Singh (1999), defining the problem as a MDP as 〈S,A,ℛ,T 〉 pairs. The state-space, S represents all possible states while the action-space, A represents all available actions the agent can perform in the environment. ℛ is the reward function while T denotes the transition function T :S×A!Sð Þ, which is a mapping from state st∈S and action at∈A to the future state st + 1. After each performed action, the environment dispatches a reward signal, ℛ :S!r. We call a sequence of states and actions a trajectory denoted as τ = (s0, a0, …, st, at) and the sequence is sampled through the use of a stochas- tic policy that predicts the optimal action in any state: πθ(at| st), where π is the policy and θ are the parameters. The primary goal of the reinforce- ment learning is to reinforce good behaviour. The algorithm should try to learn the policy that maximizes the total expected discounted reward given by, J πð Þ=  st,atð Þ~p πð Þ PT i = 0 γiℛ sið Þ � � Mnih et al. (2015). Several algorithms try to address the problem of reinforcement learning, commonly divided into three categories; Value Iteration, Policy Itera- tion and Actor-Critic Algorithms, with variations of on-policy and off-policy learning. Autoencoders are commonly used in supervised learning to encode arbitrary input to a compact representation, and using a decoder to recon- struct the original data from the encoding. The purpose of autoencoders is to store redundant data into a densely packed vector form. In its sim- plest form, an autoencoder consists of feed-forward neural network where the input and output layer is of equal neuron capacity and the hidden layer smaller, used to compress the data. Such autoencoder can be defined as: ϕ :X!Z, ψ :Z!X, where ϕ,ψ : argminϕ,ψ X− ϕ × ψð ÞX̂ �� ��2. In this TABLE 1 A comparison of features in general reinforcement learning, variational autoencoders and generative adversarial networks Algorithm branch Sample efficient Sequential data Generative modelling Labelled data Reinforcement learning No Yes No No Variational autoencoders Yes Yes Yes Yes Generative adversarial networks Yes No Yes Yes DVAE-GAN Yes Yes Yes Yes Note: The purpose of this illustration is to show that the proposed algorithm uses generative modelling to inherit sample efficient generative modelling for sequential data for use in reinforcement learning algorithms. ANDERSEN ET AL. 3 of 15 notation, X defines the input, Z , the encoding and X̂ as the reconstructed data. There are, however several issues with autoencoders, as they are not generative, in the sense that they can transition between features in the input data. To remedy this, Kingma and Welling (2013) proposed the Variational Autoencoder (VAE) algorithm that enables interpolation between features in the latent space. The interpolation is done by intro- ducing a vector of means μ and a vector of standard deviations σ. These vectors, along with the additional KL-Loss gives the algorithm the ability to learn variations in the input data, enabling the output to be vastly more diverse. Our background for choosing the following branch of algorithms are described in Table 1. Algorithms based on reinforcement learning learns by hands-on interaction. Experience is attained by exploring the environment, but in some cases, the environment may be exhausted before the agent can learn to behave optimally. To remedy this, we introduce a model-based exploration approach using VAE to model the environment with sequential data as input. This increase the sampling efficiency of the RL algorithm by a significant amount. To increase the performance of the environment model, GAN is used to strengthen the quality of the environment-model, where the VAE and GAN continually reinforce each other's performance. 4 | THE DREAMING VARIATIONAL AUTOENCODER The Dreaming Variational Autoencoder (DVAE) is an end-to-end solution for generating probable future states ŝt + n from an arbitrary state-space S using state-action pairs explored prior to st + n and at + n. Figure 1 illustrates the DVAE model, where Algorithm 1 works as follows. First, the agent collects experiences for utilizing experience-replay in the Run-Agent function. At this stage, the agent explores the state-space guided by a Gaussian distributed policy. The agent acts, observes, and stores the observations into the experience-replay buffer D. After the agent reaches terminal state, the DVAE algorithm encodes state-action pairs from the replay-buffer D into probable future states. This is stored in the replay- buffer for artificial future-states D̂. Algorithm 1 The Dreaming Variational Autoencoder 1: Initialize replay memory D and D̂ to capacity N 2: Initialize policy πθ 3: function Run-Agent(T , D) 4: for i = 0 to N_EPOCHS do 5: Observe starting state, s0 ~N 0,1ð Þ 6: while st not TERMINAL do 7: at πθ(st = s) FIGURE 1 Illustration of the DVAE model. The model consumes state and action pairs, yielding the input encoded in latent-space. Latent-space can then be decoded to a probable future state. Q zjXð Þ is the encoder, zt is latent-space and P Xjzð Þ is the decoder. DVAE can also use LSTM to better learn longer sequences in continuous state- spaces 4 of 15 ANDERSEN ET AL. 8: stþ1,rt,terminalt T st,atð Þ 9: Store experience into replay buffer D st,at,rt,stþ1,terminaltð Þ 10: st st + 1 11: end while 12: end for 13: end function 14: Initialize encoder Q zjXð Þ 15: Initialize decoder P Xjzð Þ 16: Initialize DVAE model T θ ¼P XjQ zjXð Þð Þ 17: function DVAE. 18: for di in D do 19: st, at, rt, st + 1 di ⊳ Expand replay buffer pair 20: Xt st, at 21: zt Q Xtð Þ ⊳ Encode Xt into latent-space 22: ŝtþ1 P ztð Þ ⊳ Decode zt into probable future state 23: Store experience into artificial replay buffer D̂(ŝt,at,rt, ŝtþ1,terminalt) 24: ŝt ¼ ŝtþ1 25: end for 26: return D̂ 27: end function Figure 2 illustrates how the algorithm can generate sequences of artificial trajectories using T θ =P XjQ zjXð Þð Þ, where z =Q zjXð Þ is the encoder, and T θ =P Xjzð Þ is the decoder. With state s0 and action Aright as input, the algorithm generates state ŝ1 which in the table can be observed is simi- lar to the real state s1. With the next input, Adown, the DVAE algorithm generates the next state ŝ2 which again can be observed to be equal to s2. Note that this is without ever observing state s1. Hence, the DVAE algorithm needs to be initiated with a state, for example, s0, and actions follows. It then generates (dreams) next states. The requirement is that the environment must be partially discovered so that the algorithm can learn to behave similarly to the target environment. To predict a trajectory of three timesteps, the algorithm does nesting to generate the whole sequence: τ = ŝ1,a1, ŝ2,a2, ŝ3,a3 =T θ T θ T θ s0,Arndð Þ,Arndð Þ,Arndð Þ. The algorithm does this well early on, but have difficulties with longer state sequences in continuous state-spaces. TABLE 2 Feature overview of VAE, GAN, DVAE-GAN and DVAE-SWAGAN Model Features Image quality Data generalization VAE �Converge to local minimum�Stable �Few artefacts�Blurry �Low GAN �Converge to saddle points�Unstable �Artefacts�Sharp �Medium DVAE-GAN �Converge to saddle points�Unstable �Few artefacts�Sharp �Medium DVAE-SWAGAN �Converge to saddle points�Stable �Few artefacts�Sharp �High Note: The highlighted text outlines the benefits of the model. FIGURE 2 DVAE algorithm for generating states using T θ versus the real transition function T . First, a real state is collected from the replay- memory. DVAE can then produce new states from current the trajectory τ using the state-action pairs. θ represent the trainable model parameters ANDERSEN ET AL. 5 of 15 4.1 | Generative adversarial network approach (DVAE-GAN) To combat the divergence behaviour in continuous state-space, we extend the model to use generative adversarial networks. The most common cases of divergence are found where there are (a) complex state-transitions, (b) many state-transitions and (c) sparse state-transitions. In general, we find these properties in continuous and stochastic environments. By using an adversarial approach, we see that DVAE-GAN better generalize for such problems and is far more stable due to increased diversity compared to DVAE (See Table 2 for a detailed comparison). Figure 3 illustrates the proposed DVAE-GAN extension to the original DVAE architecture. In this model, two new components; the generator G and discriminator D is introduced. This is an adversarial approach previously proposed by Makhzani, Shlens, Jaitly, Goodfellow and Frey (2015). The generator G(n| St, At) samples from a Gaussian distribution, while being conditioned on the current state and action, to predict the latent-space distribution zgan. The discriminator D(zvae, zgan) is a neural network that tries to predict the validity of the input, in this case, if the latent-space variable is from the ground-truth distribution. A min-max game between the generator and the discriminator fuels learning where the generator minimizes its error towards the real latent-space and the discriminator learns to distinguish between the real and fake latent distribution. In the VAE model, the latent-space distribution is sampled from zvae = αt + (μt × N(0, 1)) as proposed by Kingma and Welling (2013). The discriminator should then evaluate zvae to be genuine, and its parameters are updated according to the confidence of the prediction. To increase the stability of the VAE, we add the loss of the discriminator to the VAE loss, where we use the original loss function first proposed by Kingma and Welling (2013). 4.2 | Stochastic weight averaging approach (DVAE-SWAGAN) We introduced DVAE-GAN which tries to combat divergence for complex environments. The GAN architecture increases the diversity of the latent- space, but model collapse and instability become a problem. Table 2 outlines a quick overview of the benefits of each of the introduced generative models. VAE is known for its stability but has limited capabilities in its latent-space capacity. The quality of VAE is good but suffers from a severe blur- ring of the decoded latent-space. Compared to other algorithms, the VAE architecture does not generalize well to a diverse data set. The generative adversarial networks have gotten increased attention due to their diverse latent-space. The images of GANs are known for its sharpness, but they suffer from artefacts in the produced output. These networks are state-of-the-art in the sense that they generalize well to the data set. DVAE-GAN as proposed by Makhzani et al. (2015) have few artefacts, which is an improvement from the regular GAN architecture. There is, however, high variance in the model which makes it unstable beyond what is seen in vanilla VAE and GAN. To improve the model instability, we introduce Stochastic weight averaging (SWA), first proposed by Izmailov, Podoprikhin, Garipov, Vetrov and Wilson (2018). DVAE-SWAGAN is a FIGURE 3 The proposed DVAE-GAN architecture. While the original VAE architecture persist, as described in Algorithm 1, a new generative adversarial networks component is added for increased generalization across the latent-space 6 of 15 ANDERSEN ET AL. combination of VAE, SWA and GAN to significantly reduce the variance of the predictions. The algorithm is in principle the averaged ensemble of DVAE-GAN along the trajectory of SGD. We use a cyclical learning rate (Smith 2015) and average the weights each training iteration creat- ing the DVAE-SWAGAN model, seen in Figure 4. 5 | ENVIRONMENTS The DVAE algorithm was tested in two environments. The first environment is the deep line wars that were introduced by Andersen et al. (2017), a simplified Real-Time Strategy game. We present deep maze, a flexible environment with a wide range of challenges suited for reinforcement learning research. The deep line wars environments feature a continuous environment where complex strategies must be planned. The deep maze environment provides simpler rules and a non-continuous action and state-space that in-comparison is far simpler then deep line wars. 5.1 | The deep maze environment The deep maze is a flexible learning environment for controlled research in exploration, planning and memory for reinforcement learning algo- rithms. Maze solving is a well-known problem, and is used heavily throughout the RL literature Sutton et al. (1999), but is often limited to small FIGURE 5 Overview of four distinct MDP scenarios using deep maze. (a) A small, fully observable MDP. (b) A large, fully observable MDP. (c) Partially observable MDP having a vision distance of three tiles. (d) Partially observable MDP having ray-traced vision DVAE-GAN1 DVAE-GAN2 ....... DAVAE-GANn-1 DAVAE-GANn DVAE-SWAGAN OutputInput FIGURE 4 By using ensembles of DVAE-GAN models, the training is significantly more stable and prediction across sparse states yield better results for sequential input ANDERSEN ET AL. 7 of 15 and fully observable scenarios. The deep maze environment extends the maze problem to over 540 unique scenarios including Partially Observ- able Markov Decision Processes (POMDP). Figure 5 illustrates a small subset of the available environments in the deep maze, ranging from small- scale MDP's to large-scale POMDP's. The deep maze further features custom game mechanics such as relocated exits and dynamically changing mazes. RL agents depend on sensory input to evaluate and predict the best action at the current timestep. Preprocessing of data is essential so that agents can extract features from the input data. For this reason, deep maze has built-in state representation for imaging and raw state matri- ces. The game engine is modularized and has an SDK that enables the development of third-party scenarios. This extends the capabilities of deep maze to support nearly all possible scenario combination in the realm of maze solving.1 The deep maze learning environment presents the following scenarios. (a) Normal, (b) POMDP, (c) Limited POMDP and (d) Timed Limited POMDP. The first mode exposes a seed-based randomly generated maze where the state-space an MDP. The second mode narrows the state-space observation to a configurable area around the player. In addition to radius-based vision, the POMDP mode also features ray- tracing vision that better mimic the sight of a physical agent. The third and fourth mode is intended for memory research where the agent must find the goal in a limited number of time-steps. In addition to this, the agent is presented with the solution but fades after a few initial time steps. The objective is for the agent to remember the solution to find the goal. All scenario setups have a variable map-size ranging between 2 × 2 and 56 × 56 tiles. FIGURE 7 The training loss for DVAE in the 2 × 2 No-Wall and 8 × 8 deep maze scenario. The experiment is run for a total of 1000 (5000 for 8 × 8) epochs. The algorithm only trains on 50% of the state-space to the model for the 2 × 2 environment while the whole state-space is trainable in the 8 × 8 environment FIGURE 8 For the 2 × 2 scenario, only 50% of the environment is explored, leaving artefacts on states where the model is uncertain of the transition function. In more extensive examples, the player disappears, teleports or gets stuck in unexplored areas FIGURE 6 The graphical user interface of the deep line wars environment 8 of 15 ANDERSEN ET AL. 5.2 | The deep line wars environment The deep line wars environment was originally introduced in Andersen et al. (2017). Deep line wars is a real-time strategy environment that makes an extensive state-space reduction to enable swift research in reinforcement learning for RTS games. The game objective of deep line wars is to invade the enemy player with mercenary units until all health points are depleted, see Figure 6. For every friendly unit that enters the far edge of the enemy base, the enemy health pool is reduced by one. When a player purchases a merce- nary unit, it spawns at a random location inside the edge area of the buyers base. Mercenary units automatically move towards the enemy base. To protect the base, players can construct towers that shoot projectiles at the opponent's mercenaries. When a mercenary dies, a fair percentage of its gold value is awarded to the opponent. When a player sends a unit, the income is increased by a portion of the units gold value. As a part of the income system, players gain gold at fixed intervals. 6 | EXPERIMENTS We centre our experiments around the DVAE, DVAE-GAN, DVAE-SWA and DVAE-SWAGAN. The goal of the proposed extensions is to improve the model stability so that the DVAE algorithm can produce better quality output for continuous and sparse state-spaces. In this section, we show results of model-based reinforcement learning using DVAE in the deep-maze and deep-line wars environment. We show the performance of encoding raw pixel input to a compact representation and to decode this representation to probable future states. FIGURE 9 Results of 8 × 8 Deep Maze modelling using the DVAE algorithm. To simplify the environment, no reward signal is received per iteration. The left caption describes current state, st, while the right caption is the action performed to compute, st + 1 =T st,atð Þ TABLE 3 Results of the deep maze 11 × 11 and 21 × 21 environment, comparing DQN Mnih et al. (2015), TRPO Schulman et al. (2015) and PPO Schulman et al. (2017) Algorithm 11 × 11 Average performance (%) Converged epoch 21 × 21 Average performance (%) Converged epoch DQN-D̂ 94.56 9314 64.36 N/A TRPO-D̂ 96.32 5320 78.91 7401 PPO-D̂ 98.71 3151 89.33 7195 DQN-D 98.26 4314 84.63 8241 TRPO-D 99.32 3320 92.11 4120 PPO-D 99.35 2453 96.41 2904 Note: The optimal path yields performance of 100% while no solution yields 0%. Each of the algorithms ran 10000 epochs for both map-sizes. Converged Epoch represents at which epoch the algorithm converged during training. ANDERSEN ET AL. 9 of 15 FIGURE 10 A typical deep maze of size 11 × 11. The lower-right square indicates the goal state, the dotted-line is a retrace of the predicted optimal path for that maze, while the final square represents the player's current position in the state-space. The controller agent is DQN, TRPO and PPO (from left to right) FIGURE 11 The DVAE algorithm applied to the deep line wars environment. Each epoch illustrates the quality of generated states in the game, where the left image is real state s and the right image is the generated state ŝ 10 of 15 ANDERSEN ET AL. 6.1 | The dreaming variational autoencoder 6.1.1 | Deep maze For the deep maze environment, the algorithm must be able to generalize over many similar states to model a large state-space. DVAE aims to learn the transition function T st,atð Þ, bringing the state from st to st + 1. We use the deep maze environment because it provides simple rules, with a controllable state-space complexity. Also, we can omit the importance of reward for some scenarios. We trained the DVAE model on two No-Wall deep maze scenarios of size 2 × 2 and 8 × 8. For the encoder and decoder, we used the same convolution architecture as proposed by Pu et al. (2016) and trained for 5000 epochs2 in the 8 × 8 scenario and 1000 epochs for 2 × 2, respec- tively. The reasoning behind different epoch lengths is because we expect simpler environments to converge faster. For the encoding of actions and states, we concatenate the flattened state-space and action-space, having a fully connected layer with ELU activation before calculating the latent-space. We used the Adam optimizer Kingma and Ba (2015) with a learning-rate of 3e-05 to update the parameters. To calculate the loss we used the same loss function as proposed by Kingma and Welling (2013). Figure 7 illustrates the loss during the training phase for the DVAE algorithm in the No-Wall Deep Maze scenario. In the 2 × 2 scenario, DVAE is trained on only 50% of the state space, which results in noticeable graphics artefacts in the prediction of future states, seen in Figure 8. In the 8 × 8 environment, the algorithm is allowed to train on all possible states, and we observe in Figure 9 that there is significantly better image quality throughout the sampling process. (a) (b) (c) FIGURE 12 Training of DVAE compared to the DVAE-SWA, DVAE-GAN and DVAE-SWAGAN extension. (a) Total Loss for the tested algorithms. (b) Autoencoder loss (L2) for the algorithms. (c) Variational Autoencoder Loss (KL) for the tested algorithms ANDERSEN ET AL. 11 of 15 6.2 | Training RL-agents using D̂ replay-buffer The goal of this experiment is to observe the performance of RL-agents using the generated experience-replay D̂ from Algorithm 1 in Deep Maze environments of size 11 × 11 and 21 × 21. In Table 3, we compare the performance of DQN Mnih et al. (2013), TRPO Schulman, Levine, Abbeel, Jordan and Moritz (2015) and PPO Schulman, Wolski, Dhariwal, Radford and Klimov (2017) using the DVAE generated D̂ to tune the parameters. Because TRPO and PPO are on-policy algorithms, the generated states must be generated on-the-fly so that the algorithm remains on-policy. Figure 10 illustrates three maze variations of size 11 × 11, where the agent has learned the optimal path. We see that the best performing algorithm, PPO Schulman et al. (2017) beats DQN and TRPO using eitherD̂ or D. The DQN-D̂ agent did not converge in the 21 × 21 environment, but it is likely that value-based algorithms could struggle to map inaccurate states with graphical artefacts generated from the DVAE algorithm. These artefacts significantly increase the state-space significantly, but empirical data suggest that on-policy algorithms perform better on noisy state-spaces. 6.3 | Deep line wars The DVAE algorithm works well in more complex environments, such as the deep line wars game environment Andersen et al. (2017). Here, we expand the DVAE algorithm with LSTM to improve the capability of generating time-bound data, such as animations seen in Figure 1. Figure 11 illustrates the state quality during training of DVAE in a total of 6000 epochs. Both players draw actions from a Gaussian dis- tributed policy. The algorithm understands that the player units can be located in any tiles after only 50 epochs, and at 1000 epochs we observe the algorithm makes significantly better predictions of the probability of unit locations (i.e. some units show more densely in the output state). At the end of the training, the DVAE algorithm is to some degree capable of determining both towers, and unit locations at any given time-step during the game epoch. 6.4 | Extending the dreaming variational autoencoder The goal of DVAE-SWA, DVAE-GAN and DVAE-SWAGAN is to perform better in continuous state-spaces, such as the deep line wars environ- ment. The experiments were performed using a map size of 11 × 11 sampling actions from a PPO policy. FIGURE 13 The first row represents the ground truth future state, while the second row is the predicted future state. The DVAE-SWAGAN algorithm sampled states after training for 1000 epochs, see Figure 12. Notice that the quality is notably better compared to Figure 11. We found that states were generalized too much producing some inaccurate predictions. Despite this inaccuracy, the RL agents learned a policy capable of beating random agents. We believe this is because similar states often represent similar value functions 12 of 15 ANDERSEN ET AL. Figure 12 shows the training loss of the algorithms DVAE, DVAE-SWA, DVAE-GAN and DVAE-SWAGAN for 1000 epochs (x-axis), where the y-axis describes the loss value. The new architectures perform significantly better than DVAE across all loss components of the architecture. The consequence of lower loss is better image quality (autoencoder loss), and better transitions (variational loss). For these experiments, we tried to model the deep line wars environment using RGB input. Figure 13 illustrates the resulting images for each of the algorithms. Here, we see that DVAE-SWAGAN perform significantly best, in terms of quality and accuracy. 7 | CONCLUSION AND FUTURE STUDY This article introduces The Dreaming Variational Autoencoder along with its extensions DVAE-SWA, DVAE-GAN and DVAE-SWAGAN as a neural network-based generative modelling architecture to enable exploration in environments with sparse reward. The DVAE algorithm successfully generates authentic world models in non-continuous state-spaces where the dynamics of the environment is simple. It works well for small environments but is limited when the state-sequence become too large. The algorithm performs marginally better when using LSTM for sequence prediction, but is a significant performance drop due to the increased model complexity. For most environments, such as deep line wars and deep maze, it is sufficient to run DVAE using only fully connected nodes. The DVAE-SWAGAN improves the original model significantly and enables the algorithm to imitate environment models with continuous state-space better. DVAE-SWAGAN performs better in all environments including deep line wars and most environments found in the GYM reinforcement learning environment. There are, however, several fundamental issues that limit DVAE, and DVAE-SWAGAN from fully modelling environments. In some situations, exploration may be a costly act that makes it impossible to explore all parts of the environment in its entirety. The algorithms cannot accurately predict the outcome of unexplored areas of the state-space, making the prediction blurry or incorrect. To combat this, the model should be improved further to include some sense of logic, and understanding of the environment dynamics. In current state-of- the-art, this frequently introduced as domain knowledge that is manually crafted by the programmer, but the hope is that future research will find a method for self-supervised domain knowledge modelling. Reinforcement learning has many unresolved problems, and the hope is that the deep maze and the deep line wars learning environment can be a useful tool for future research. For future study, we plan to introduce an inverse reinforcement learning component to learn the reward function ℛ̂. We also plan to explore non-parametric variants. DVAE and environment modelling is an ongoing research question, and the goal is that reinforcement learning algorithms could utilize this form of dreaming to make the algorithm far more sample efficient. ORCID Per-Arne Andersen https://orcid.org/0000-0002-7742-4907 Morten Goodwin https://orcid.org/0000-0001-6331-702X Ole-Christoffer Granmo https://orcid.org/0000-0002-7287-030X ENDNOTES 1 The deep maze is open-source and publicly available at https://github.com/CAIR/deep-maze. 2 We regard the term epoch as an episode, which is frequently in literature. REFERENCES Andersen, P.-A., Goodwin, M., & Granmo, O.-C. (2017). Towards a deep reinforcement learning approach for tower line wars. In M. Bramer & M. Petridis (Eds.), Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (Vol. 10630 LNAI, pp. 101–114) NY, USA: Springer, Cham. https://doi.org/10.1007/978-3-319-71078-5_8 Andersen, P.-A., Goodwin, M., & Granmo, O.-C. (2018, Dec). The dreaming variational autoencoder for reinforcement learning environments. In Max Bramer & M. Petridis (Eds.), Artificial intelligence (XXXV ed 11311, pp. 143–155). Springer, Cham. Retrieved from http://link.springer.com/10. 1007/978-3-030-04191-5_11 doi: https://doi.org/10.1007/978-3-030-04191-5_11 Arulkumaran, K., Cully, A., & Togelius, J. (2019). AlphaStar: An evolutionary computation perspective (Tech. Rep.). Retrieved from https://deepmind.com/ blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34 (6), 26–38. Retrieved from https://ieeexplore.ieee.org/document/8103164/. doi: https://doi.org/10.1109/MSP.2017.2743240 Azar, M. G., Piot, B., Pires, B. A., Grill, J.-B., Altché, F., & Munos, R. (2019, feb). World discovery models. arxiv preprint arXiv:1902.07685. Retrieved from http://arxiv.org/abs/1902.07685 Bangaru, S. P., Suhas, J., & Ravindran, B. (2016, nov). Exploration for multi-task reinforcement learning with deep generative models. arxiv preprint arXiv: 1611.09894. Retrieved from http://arxiv.org/abs/1611.09894 Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J. Z., … Hassabis, D. (2016, jun). Model-free episodic control. arxiv preprint arXiv:1606.04460. Retrieved from http://arxiv.org/abs/1606.04460 ANDERSEN ET AL. 13 of 15 https://orcid.org/0000-0002-7742-4907 https://orcid.org/0000-0002-7742-4907 https://orcid.org/0000-0001-6331-702X https://orcid.org/0000-0001-6331-702X https://orcid.org/0000-0002-7287-030X https://orcid.org/0000-0002-7287-030X https://github.com/CAIR/deep-maze https://doi.org/10.1007/978-3-319-71078-5_8 http://link.springer.com/10.1007/978-3-030-04191-5_11 http://link.springer.com/10.1007/978-3-030-04191-5_11 https://doi.org/10.1007/978-3-030-04191-5_11 https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ https://ieeexplore.ieee.org/document/8103164/ https://doi.org/10.1109/MSP.2017.2743240 http://arxiv.org/abs/1902.07685 http://arxiv.org/abs/1611.09894 http://arxiv.org/abs/1606.04460 Buesing, L., Weber, T., Racaniere, S., Eslami, S. M. A., Rezende, D., Reichert, D. P., … Wierstra, D. (2018, feb). Learning and querying fast generative models for reinforcement learning. Arxiv Preprint arXiv:1802.03006. Retrieved from http://arxiv.org/abs/1802.03006 Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018, may). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems 31, 4759–4770. Retrieved from http://arxiv.org/abs/1805.12114 Ha, D., & Schmidhuber, J. (2018, sep). Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems 31, 2455– 2467. Retrieved from http://arxiv.org/abs/1809.01999 Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2018, nov). Learning latent dynamics for planning from pixels. in Proceedings of the 36th International Conference on Machine Learning. Retrieved from http://arxiv.org/abs/1811.04551 Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., … Lerchner, A. (2016, nov). beta-VAE: Learning basic visual concepts with a con- strained variational framework. in International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id= Sy2fzU9gl Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., … Lerchner, A. (2017). DARLA: Improving zero-shot transfer in reinforcement learning. In D. Precup & Y. W. Teh (Eds.), in Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 1480–1490). International Convention Centre, Sydney, Australia: PMLR. Retrieved from http://proceedings.mlr.press/v70/higgins17a.html Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018, mar). Averaging weights leads to wider optima and better generalization. Retrieved from http://arxiv.org/abs/1803.05407 Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996, apr). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Retrieved from http://arxiv.org/abs/cs/9605103 doi: 10.1.1.68.466 Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. in Proceedings, International Conference on Learning Representations 2015. doi: https://doi.org/10.1145/1830483.1830503 Kingma, D. P., & Welling, M. (2013, dec). Auto-encoding variational Bayes. arxiv preprint arXiv:1312.6114. Retrieved from http://arxiv.org/abs/1312.6114 doi: https://doi.org/10.1051/0004-6361/201527329 Li, Y. (2017, jan). Deep reinforcement learning: An overview. Arxiv preprint arXiv:1701.07274. Retrieved from http://arxiv.org/abs/1701.07274 Liang, X., Wang, Q., Feng, Y., Liu, Z., & Huang, J. (2018, dec). VMAV-C: A deep attention-based reinforcement learning algorithm for model-based control. arxiv preprint arXiv:1812.09968. Retrieved from http://arxiv.org/abs/1812.09968 Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015, nov). Adversarial autoencoders. Retrieved from http://arxiv.org/abs/1511.05644 Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., … Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning (Vol. 48, pp. 1928–1937). New York, USA: PMLR. Retrieved from http://proceedings.mlr.press/v48/mniha16.html Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013, dec). Playing atari with deep reinforcement learning. Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1312.5602 Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves A., Riedmiller M., Fidjeland A. K., Ostrovski G., Petersen S., Beattie C., Sadik A., Antonoglou I., King H., Kumaran D., Wierstra D., Legg S. Hassabis, D. (2015, feb). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. Retrieved from https://doi.org/10.1038/nature14236 Mousavi, S. S., Schukat, M., & Howley, E. (2018). Deep reinforcement learning: An overview. In Y. Bi, S. Kapoor, & R. Bhatia (Eds.), Proceedings of Sai Intelli- gent Systems Conference (Intellisys) 2016 (pp. 426–440). Cham: Springer International Publishing. Retrieved from http://link.springer.com/10.1007/ 978-3-319-56991-8_32 Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., & Carin, L. (2016). Variational autoencoder for deep learning of images, labels and captions. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (pp. 2352–2360). NY, USA: Cur- ran Associates, Inc. Retrieved from http://papers.nips.cc/paper/6528-variational-autoencoder-for-deep-learning-of-images-labels-and-captions.pdf Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In F. Bach & D. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning (Vol. 37, pp. 1889–1897). Lille, France: PMLR. Retrieved from http://proceedings.mlr.press/ v37/schulman15.html Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017, jul). Proximal policy optimization algorithms. arxiv preprint arXiv:1707.06347. Retrieved from http://arxiv.org/abs/1707.06347 Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550, 354–359. https://doi.org/10.1038/nature24270 Smith, L. N. (2015, jun). Cyclical learning rates for training neural networks. Retrieved from http://arxiv.org/abs/1506.01186 Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning (Vol. 112; Tech. Rep.). Xiao, T., & Kesineni, G. (2016). Generative adversarial networks for model based reinforcement learning with tree search (Tech. Rep.). Berkeley: University of California. Retrieved from http://tedxiao.me/pdf/gans_drl.pdf Zhang, C., Patras, P., & Haddadi, H. (2018, mar). Deep learning in mobile and wireless networking: A survey. IEEE Communications Surveys and Tutorials, 21, 2224–2287. Retrieved from http://arxiv.org/abs/1803.04311 AUTHOR BIOGRAPHIES Per-Arne Andersen is a Ph.D. candidate in Artificial Intelligence at the University of Agder, Norway where he also com- pleted his B.Sc. and M.Sc. in 2015 and 2017, respectively. He has expertise in several domains in computer science, including security, software development and machine learning. His primary field of research is reinforcement learning in environments with incomplete information, including real-time games and industry-near systems. Towards the future, Per-Arne hopes to have an active role in the scientific community for reinforcement learning and to contribute towards the understanding of artificial general intelligence. 14 of 15 ANDERSEN ET AL. http://arxiv.org/abs/1802.03006 http://arxiv.org/abs/1805.12114 http://arxiv.org/abs/1809.01999 http://arxiv.org/abs/1811.04551 https://openreview.net/forum?id=Sy2fzU9gl https://openreview.net/forum?id=Sy2fzU9gl http://proceedings.mlr.press/v70/higgins17a.html http://arxiv.org/abs/1803.05407 http://arxiv.org/abs/cs/9605103 https://doi.org/10.1145/1830483.1830503 http://arxiv.org/abs/1312.6114 https://doi.org/10.1051/0004-6361/201527329 http://arxiv.org/abs/1701.07274 http://arxiv.org/abs/1812.09968 http://arxiv.org/abs/1511.05644 http://proceedings.mlr.press/v48/mniha16.html http://arxiv.org/abs/1312.5602 https://doi.org/10.1038/nature14236 http://link.springer.com/10.1007/978-3-319-56991-8_32 http://link.springer.com/10.1007/978-3-319-56991-8_32 http://papers.nips.cc/paper/6528-variational-autoencoder-for-deep-learning-of-images-labels-and-captions.pdf http://proceedings.mlr.press/v37/schulman15.html http://proceedings.mlr.press/v37/schulman15.html http://arxiv.org/abs/1707.06347 https://doi.org/10.1038/nature24270 http://arxiv.org/abs/1506.01186 http://tedxiao.me/pdf/gans_drl.pdf http://arxiv.org/abs/1803.04311 Morten Goodwin received the B.Sc. and M.Sc. degrees from University of Agder, Norway, in 2003 and 2005, respec- tively, and the Ph.D. degree from Aalborg University Department of Computer Science, Denmark, in 2011, on applying machine learning algorithms on eGovernment indicators which are difficult to measure automatically. He is an Associate Professor with the Department of ICT, University of Agder, deputy director for Centre for Artificial Intelligence Research, coordinator for the International Master's Programme in Artificial Intelligence, a public speaker and an active researcher. His main interests include swarm intelligence, deep learning and adaptive learning in the fields of medicine, games and chatbots. He has more than 70 peer reviews scientific publications, and has supervised more than 110 student projects including Master's and Ph.D. theses. Prof. Ole-Christoffer Granmo is director and founder of the Centre for Artificial Intelligence Research (CAIR) at the Uni- versity of Agder, Norway. He obtained his master's degree in 1999 and the Ph.D. degree in 2004, both from the Univer- sity of Oslo, Norway. Granmo develops theory and algorithms for systems that explore, experiment and learn in complex real-world environments. His research interests include artificial intelligence, machine learning, learning automata, bandit algorithms, deep reinforcement learning, Bayesian reasoning and computational linguistics. Within these areas of research, Dr. Granmo has written more than 125 refereed journal and conference publications. He is co-founder of the Norwegian Artificial Intelligence Consortium (NORA). Apart from his academic endeavours, Granmo is also co-founder of the company Anzyz Technologies AS. How to cite this article: Andersen P-A, Goodwin M, Granmo O-C. Increasing sample efficiency in deep reinforcement learning using generative environment modelling. Expert Systems. 2020;e12537. https://doi.org/10.1111/exsy.12537 ANDERSEN ET AL. 15 of 15 https://doi.org/10.1111/exsy.12537 Increasing sample efficiency in deep reinforcement learning using generative environment modelling 1 INTRODUCTION 2 LITERATURE REVIEW 3 BACKGROUND 4 THE DREAMING VARIATIONAL AUTOENCODER 4.1 Generative adversarial network approach (DVAE-GAN) 4.2 Stochastic weight averaging approach (DVAE-SWAGAN) 5 ENVIRONMENTS 5.1 The deep maze environment 5.2 The deep line wars environment 6 EXPERIMENTS 6.1 The dreaming variational autoencoder 6.1.1 Deep maze 6.2 Training RL-agents using D replay-buffer 6.3 Deep line wars 6.4 Extending the dreaming variational autoencoder 7 CONCLUSION AND FUTURE STUDY Endnotes REFERENCES