key: cord-0043121-cv9n6qzp
authors: Wang, Chengwei; Zhou, Tengfei; Chen, Chen; Hu, Tianlei; Chen, Gang
title: Off-Policy Recommendation System Without Exploration
date: 2020-04-17
journal: Advances in Knowledge Discovery and Data Mining
DOI: 10.1007/978-3-030-47426-3_2
sha: 9763dfbd71700f83e705090365ef7e20d3cc0c7d
doc_id: 43121
cord_uid: cv9n6qzp

Recommendation System (RS) can be treated as an intelligent agent which aims to generate policy maximizing customers’ long term satisfaction. Off-policy reinforcement learning methods based on Q-learning and actor-critic methods are commonly used to train RS. Though these methods can leverage previously collected dataset for sampling efficient training, they are sensitive to the distribution of off-policy data and make limited progress unless more on-policy data are collected. However, allowing a badly-trained RS to interact with customers can result in unpredictable loss. Therefore, it is highly desirable that the off-policy method can stably train an RS when the off-policy data is fixed and there is no further interaction with the environment. To fulfill these requirements, we devise a novel method name Generator Constrained Q-learning (GCQ). GCQ additionally trains an action generator via supervised learning. The generator is used to mimic data distribution and stabilize the performance of recommendation policy. Empirical studies show that the proposed method outperforms state-of-the-art techniques on both offline and simulated online environments.

Recommender System (RS) is one of the most important applications in artificial intelligence [15, 20] . An intelligent RS can significantly reduce users' searching time, greatly enhance their shopping experience and bring considerable profits to vendors.

From the Reinforcement Learning (RL) perspective, RS is an autonomous agent that intelligently learns the optimal recommendation behavior over time to maximize each user's long term satisfaction through interacting with its environment. This offers us the opportunity to solve the recommendation task on top of the recent RL advancement. Considering that a previously collected customers' feedback dataset is often available for recommendation tasks, many researchers adopt the off-policy RL methods to extract patterns from the data [4, 21, 23] .

Off-policy RL algorithms are often expected to fully exploit off-policy datasets. Nevertheless, these methods can break down when the datasets are not collected by learning agents. Theoretically, [2] points out that Bellman updates could diverge with off-policy data. The divergence issue would surely invalidate the performance of DQN agents. [12, 16] find that in off-policy learning, the fixpoint of Bellman updates may have poor quality even if the update converges. Empirically, [9] shows that off-policy agents perform dramatically worse than the behavioral agent when trained by the same numerical algorithm on the same dataset. Moreover, many researchers observe that these methods can still fail to learn the optimal strategy even when training data are deliberately selected by effective experts. All these observations suggest that off-policy methods are unstable to static datasets.

The instability of off-policy methods is highly undesirable in training an RS. One would hope that the RS has learned sound policies before deploying into a production environment. If its performance turns out to be unpredictable, deploying the RS would be risky. To stabilize off-policy methods, one can compensate for the performance of the RS by online feedbacks. That is, allow the offpolicy agent to interact with customers and use the customers' feedbacks to stabilize its performance. In practice, collecting user's feedback is time-consuming, and deploying an unstable RS to interact with customers would greatly reduce their satisfaction. As a result, designing a stable off-policy RL method for RS which has reasonable performance for any static training set without further exploration, is a fundamental problem.

As indicated in [9, 14] , the instability issue of off-policy methods results from exploration error which is a fundamental problem with off-policy reinforcement learning. exploration error usually behaves as the value function is erroneously estimated on unseen state-action pairs. The exploration error can be unboundedly large, even if the value function can be perfectly approximated [9] . Moreover, it can accumulate during the training iterations [14] . It may misguide the training agent and make the agent take over-optimistic or over-pessimistic decisions. As a result, the training process becomes unstable and potentially diverging unless new data is collected to remedy those errors.

In this paper, we propose a novel off-policy RL method for RS to diminish the exploration error. Our method can learn recommendation policy successfully from large static datasets without further interacting with the environment. exploration error results from a mismatch in the distribution of data induced by the recommendation policy and the distribution of customers' feedback contained in the training data [9] . The proposed Generator Constrained deep Q-learning (GCQ) utilizes a neural generator to simulate customers' possible feedbacks. This generative model is combined with a Q-network which select the highest valued action to form recommendation policy. Furthermore, to reduce the decision time, we design the generator's architecture based on Huffman Tree. We show that with the generator pruning unlikely actions, the decision complexity can be reduced to O(log |A|) where |A| is the number of actions, namely the number of items.

A typical recommendation process can be formulated as a Markov Decision Process (MDP) (S, A, r, P, γ) which is defined as follows.

-State space S: The state s u t = {u, i 1 , . . . i ct } contains the active user u and his/her chronological clicked items.

-Action space A: The action space is the item set.

-Reward r(s u , a u ): Reward is the immediate gain of the RS after action a u .

-Transition probability P (s u t+1 |s u t , a u t ): The state transits as follows.

-Discount rate γ: γ ∈ [0, 1] is a hyperparameter. It is the tradeoff between the immediate reward and long term benefits.

The off-policy recommendation problem can be formulated as follows. Let B = {(s u t , a u t , s u t+1 , r u t )} be the dataset collected by a unknown behavior policy. Construct a recommendation policy π : S → A such that the accumulated reward is maximized. For notation simplicity, we may omit the superscript of s u , r u , a u in the following section.

Q-learning learns the state-action Q-function Q(s, a), which is the optimal expected cumulative reward when the RS starts in state s and takes action a. The optimal policy π can be recovered from the Q-function by choosing the maximizing action that is π(s) = arg max a∈A Q(s, a). The Q-function is a fix point of the following Bellman iteration:

with (s t , a t , s t+1 , r t ) sampled from B. The above update formula is called Qlearning in reinforcement learning literature. According to [9, 14] , Q-learning may have unrealistic value on unobserved state-action pairs, which results in large exploration error and makes the performance of an RS unstable.

To cope with the exploration error, [14] proposes the Batch Constrained Q-Learning (BCQ) method. BCQ avoids exploration error by explicitly constraining an agent's candidate actions in the training set. Specifically, BCQ estimates the Q-function by the following batch constrained Bellman update.

where "(s t+1 , a) ∈ B" means that there exist state s and reward r such that (s t+1 , a, s , r ) ∈ B. Due to the sparsity of recommendation dataset, for most observed state s, there exists at most one action a such that (s, a) ∈ B. Thus, for most state-action pairs, the BCQ update (3) can be simplified to the following iteration

Such iteration implicitly assumes that the observed action a t+1 is optimal for state s t+1 , which is unrealistic because users' feedbacks are noisy.

To prevent BCQ from overfitting into noisy data, we propose a new off-policy RL algorithm named Generator Constrained Q-learning (GCQ). GCQ utilizes a neural generator to recover the distribution of observed dataset. Then, the Qfunction is updated on a candidate set sampled from the generator. Specifically, the main iteration of GCQ can be formulated as follows.

where (s t , a t , s t+1 , r t ) is a randomly sampled tuple from B and g θ (·|s) is a neural generator which gives the conditional probability of actions. The size of candidate set c is a hyperparameter of GCQ method. When c is fixed to n, the number of items, GCQ becomes Q-Learning method.

Since the state space of RS is large, it is impossible to compute the Q-function of each state-action pairs. To handle the difficulty, we approximate the unknown Q-function by a deep neural network Q θ (s, a) a.k.a deep Q-net where θ is its parameter.

Obviously, both Q-net and generator need an encoder to extract features from a state s = {u, i 1 , . . . i T }. According to [3] , a shared encoder generalizes better than multiple task-specified encoders. Therefore, we use the same encoder for Q-net Q θ (s, ·) and generator g θ (·|s). We depict the structure of encoder in Fig. 1(a) . Embedding Layer. The embedding layer maps a user or an item to correspondent semantic vector. Formally, Let U ∈ R m×d and V ∈ R n×d be the embedding matrix of user and item respectively. The embedding vector of user u and item i can be expressed as follows.

where we use X[k] to denote k-th row of matrix X.

Residual Recurrent Layer. The layer transforms the sequence s u = {u, i 1 , . . . i T } into hidden states. In the field of sequence modeling, GRU [6] and LSTM [10] are arguably the most powerful tools. However, both recurrent structures suffer from gradient vanishing/exploding issues for long sequences. Inspired by that residual network has stable gradients [19] , we proposed a variant of GRU cell with residual structure. Specifically, we use the following recurrent to map the state s into hidden states {h t } t=T t=0 .

where p u is the embedding vector of user u, q it is the embedding vector of item i t , and W is an alignment matrix.

Fast Attention Layer. The layer utilizes attention mechanism to aggregate hidden states {h t } t=T t=0 into a feature vector e. For efficiency, we adopt a faster linear attention mechanism instead of the common tanh-based ones [7] . The linear attention has two stages. Stage one: compute the signal matrix C t via the following recurrence.

where α t = σ(W α h t ) is the forget gate and W α its parameter. Stage two: output encoding feature via e = C T h T .

The output vector e is the encoded feature vector of s.

Considering that actions with high cumulative rewards shall have close correlations with the current state, we use the inner product of the two object's feature vectors to model the Q-function, that is

where q a = V[a] is the embedding vector of action a.

Since Huffman tree uses shorter codes for more frequent items, it results in a faster sampling process and is widely used in NLP tasks [17, 18] . To reduce training time, we build a novel neural structure based on Huffman tree. The proposed structure is depicted in Fig. 1(b) . The Huffman tree is built according to the popularity of items f which is defined by

with #ocurr i being the occurrence number of item i. We assign Huffman code to each node of the tree by the following rules: (a) encode root node by b 0 = 0; (b) for a node with code b 0 b 1 . . . b j , encode its left child by b 0 b 1 . . . b j 0 and right child by b 0 b 1 . . . b j 1. Let z b 0:k ∈ R d be an embedding vector of a tree node with code b 0:k . For an item a with code b 0:j , its generating probability can be computed as follows. 

The recommendation policy can be executed in O(d log |A|) flops.

Loss Function of the Generator. We use the negative log-likelihood of the generator to evaluate the performance of the generator.

Loss Function of the Q-Net. According to the framework of fitted Q-iteration [1] , the loss function of Q-net is the mean square error between the Q-net and its bellman update, namely

where A = {a i ∼ g θ (a|s )} c i=1 is the candidate set and (s, a, s , r) ∈ B.

input : Replay Buffer B, size of candidate set c, regularizer λ, number of iterations K, discount rate γ, learning rate η

Joint Inference. Since the Q-net and the generator share the same encoder, We jointly train them via iteratively minimizing the following loss. min θ qloss(θ) + λnll(θ) (16) where λ > 0 is a tuning parameter controlling the balance of mean square loss and log-likelihood. The joint loss can be optimized via stochastic gradient descent, as showed in Algorithm 1.

In this section, we compare the performance of proposed GCQ method with state-of-the-art recommendation methods. We assess the performance of considered methods on both real-world offline datasets and simulated online environments. Besides, empirical studies on the hyperparameter sensitivities and computing time are conducted on several datasets. The baseline methods are listed as follows.

-MF [13] : It utilizes the latent factor model to predict the unknown ratings.

-W&D [5] : W&D uses wide & deep neural architecture to learn nonlinear latent factors. -GRU4Rec [11] : It applies GRU to model click sequences.

-DQN: It recommends items by a deep Q-net. For fairness, We set the Q-net to the same one as the proposed method. -DDPG [8] : DDPG utilizes deterministic policy gradient descent to update parameters. -DEERS [22] : It tries to incorporate a user's negative feedback via sampling from the unclicked items.

We use three publicly available datasets: MovieLens 1M (M1M), MovieLens 10M (M10M) and Amazon 5-core grocery and gourmet food (AMZ) to compare the considered methods. These datasets contain historical ratings of items with scoring timestamps. Now according to timestamps, we can transform the datasets into replay buffers of the form {(s u t , a u t , s u t+1 , r u t )}. For simplicity, we set the dimension of user embedding, the dimension of item embedding, and the dimension of hidden states of the proposed neural architectures to the same value d. We call d the model dimension. We set the model dimension d = 150, the discount factor γ = 0.9, the size of sampling size c = 50, and the regularizer λ = 0.1 as default. All these hyperparameters are chosen by cross-validation. The hyperparameters of baseline methods are set to default values.

According to the temporal order, we use the top 70% tuples in the derived replay buffers for training and hold out the remaining 30% for testing. In an offline environment, we cannot obtain the immediate reward of the recommendation policy. As a result, we cannot use the cumulative reward to evaluate the performance of the compared learning agents. Considering that a Q-net Q θ (s, a) with high cumulative reward shall assign large value to clicked items and give small value the ignored ones, Q θ (s, a) can be viewed as a scoring function which ranks the clicked items ahead of ignored ones. Thus, we can use the ranking metric such as Recall@k and Precision@k to evaluate the compared methods. To reduce randomness, we run each model five times and report their average performances in Table 1 and Table 2 . From the tables, we can see that GCQ consistently outperforms DQN. Since the two methods share the same Q-net, such result shows that GCQ has a lower exploration error during the learning process. GCQ also has higher accuracy than DEERS. The reason is that the proposed encoder is more expressive than DEERS's GRU based one. Compared with DDPG, our GCQ consistently has better accuracy. This is because the policy-gradient-based method DDPG has higher variances during the learning process. Both Table 1 and Table 2 exhibit that GCQ outperforms non-RL methods, namely MF, W&D and GRU4Rec. These results demonstrate that taking the long term reward into consideration can improve the accuracy of recommendation. The computational time of compared RL methods is recorded in Table 3 . The table exhibits that GCQ takes significantly less computational time in handling the benchmark datasets. This is because GCQ only takes O(log |A|) flops to make a recommendation decision while the decision complexities of other baseline methods are O(|A|).

To simulate online environment, we train a GRU to model users' sequential decision processes. The GRU takes a user, the user's last clicked 20 items, and a candidate item as input. Then, it outputs the click probability of the candidate item. Such a simulation GRU is widely used in evaluating the online performance of RL-based recommender agent [22] . We split the datasets into the front 10%, the middle 80% and the tail 10% sub-datasets by temporal order. The front sub-dataset is used for initializing the learning agents. The middle sub-dataset is utilized for training the simulation GRU. The simulator will be validated on the tail sub-dataset. After training, we find that the simulator has classification accuracy greater than 75%. Therefore, the simulator quite precisely models a user's click decision. After the simulator is trained, we collect the simulated responses of users and then obtain cumulative reward.

The cumulative reward curves are reported in Fig. 2 . From the figure, we find that GCQ yields much higher cumulative rewards than baseline methods. Its superior performance results from the smaller exploration error and better encoder structure. These figures also show that GCQ is more stable than the baseline methods. This confirms that GCQ has a lower exploration error during the learning process.

We find that the most important hyperparameters include: the model dimension parameter d which controls the model complexity of GCQ; and the size of candidate set c which controls exploration error.

We report Precision@10 of GCQ under different settings of d in Fig. 3(a) . 

We proposed a novel Generator Constrained Q-learning technique for recommendation tasks. GCQ reduce the decision complexity of Q-net from O(|A|) to O(log |A|). In addition, GCQ enjoys lower exploration error through better characterization of observed data. Further, GCQ employs a new multi-layer encoder to handle long sequences through attention mechanism and skip connection. Empirical studies demonstrate GCQ outperforms state-of-the-art methods both in efficiency and accuracy.

Fitted Q-iteration in continuous action-space MDPs

Residual algorithms: reinforcement learning with function approximation

Effective shared representations with multitask learning for community question answering

Top-k off-policy correction for a reinforce recommender system

Wide & deep learning for recommender systems

Empirical evaluation of gated recurrent neural networks on sequence modeling

A cheap linear attention mechanism with fast lookups and fixed-size representations

Deep reinforcement learning in large discrete action spaces

Off-policy deep reinforcement learning without exploration

Learning to forget: Continual prediction with LSTM

Session-based recommendations with recurrent neural networks

The fixed points of off-policy TD

Matrix factorization techniques for recommender systems

Stabilizing off-policy Q-learning via bootstrapping error reduction

Amazon.com recommendations: item-to-item collaborative filtering

Error bounds for approximate policy iteration

Incrementally learning the hierarchical softmax function for neural language models

word2vec parameter learning explained

Norm-preservation: why residual networks can become extremely deep?

Deep learning based recommender system: a survey and new perspectives

Deep reinforcement learning for list-wise recommendations

Recommendations with negative feedback via pairwise deep reinforcement learning

DRN: a deep reinforcement learning framework for news recommendation