1 Introduction

Dialogue systems aim to interact with humans, employing conversations in natural language. They can be broadly divided into three categories: socialbots, question & answering, and task-oriented systems. Socialbots aim to have an entertaining conversation to keep the user engaged, without having a specific goal other than being friendly and keeping company. Question & answering systems aim to provide a concise and straightforward answer to the user’s question, possibly using information stored in knowledge bases. Finally, task-oriented systems help users to complete a specific task [6]. These tasks range from simple tasks, such as setting an alarm and making an appointment, to more complex tasks, such as finding a tourist attraction, booking a restaurant or taking a taxi.

Due to their wide range of applications, task-oriented systems have been showing great relevance in recent years, with both academia and industry drawing their attention to them. There are roughly two types of architectures to model a task-oriented agent: end-to-end and pipeline (or modular) [24]. End-to-end approaches consider the system composed of a single component that maps user input in natural language directly to the system output also in natural language (see Fig. 1). On the other hand, pipeline architectures comprise three components: Natural Language Understanding (NLU), Dialogue Management (DM) and Natural Language Generation (NLG) (see Fig. 2). NLU extracts the major information from the user utterance and transforms it in a structured data known as dialogue act. The dialogue act shows relevant information for the dialogue comprehension and it is defined by the tuple [domain, intent, slot, value], encoding the domain of this particular act, the intent, i.e., its broad objective (inform, request, or thanks, for instance), and the slot and value representing specific pieces of information in this domain. The example illustrated in Fig. 2 shows that utterance I want a restaurant located in the centre of town would have the following dialogue act: [restaurant, inform, area, centre]. DM contains two sub-modules, the Dialogue State Tracking (DST) which keeps track of the dialogue state and the Policy (POL) which decides the best response to give to the user. Finally, the NLG component transforms the DM output, which is also a dialogue act, into natural language to present to the user.

Fig. 1.
figure 1

Illustration of end-to-end architecture.

Fig. 2.
figure 2

Illustration of pipeline architecture.

Early works on task-oriented dialogue systems [9, 12, 13] focus on problems with a single domain. However, it is common that real systems such as the popular virtual assistants span their applications over multiple domains. This means that a single system must be able to act and complete the user task in more than one domain, for example, the user first asks to find a tourist attraction for sightseeing and then, in the same conversation, she requests a restaurant to have lunch. Dealing with multi-domain dialogue systems is a problem much harder since the complexity of user goals and conversations increases a lot. Besides the complexity of the problem, which is already challenging, another issue faced by multi-domain systems is related to the redundancy of information given by the user during a conversation [15]. For instance, suppose that the user first reserves a restaurant table for two people and then she requests to reserve a hotel room for that night. It is usual that a single-domain dialogue system asks again the number of people although this kind of information—that can be shared among the domains—was already given by the user. These redundant turns make the conversation longer and are not nice for the user experience [14].

In this paper, we use the pipeline architecture with a focus on the DM component and adopt the divide-and-conquer approach, that is, several agents are trained independently, each one in a specific domain, and aggregate them to act in the multi-domain scenario. We show that this approach leads to better results when compared to the same algorithm trained in all domains at once. Furthermore, we propose the use of a mechanism that enables the system to reuse some shareable slots among the domains, avoiding the agent to ask for redundant information.

The remainder of this paper is organized as follows: Sect. 2 shows some related work, Sect. 3 describes our methods and proposal, Sect. 4 shows our experiments and results obtained, and finally Sect. 5 highlights our conclusions and directions for further work.

2 Related Work

There have been some efforts in the past that focused on multi-domain task-oriented dialogue systems. Komatani et al. [11] proposed a distributed architecture to integrate expert dialogue systems in different domains using a domain selector trained with a decision tree classifier. Further works employed traditional reinforcement learning to learn the domain selector [23]. However, these systems require manual feature engineering for their building. Finally [4] proposed to use deep reinforcement learning to allow training the system using raw data, without the manual feature engineering.

More recent works give a great attention to centralized systems, i.e., a unique system capable of handling multiple-domains instead of having multiple agents, each one specialized for each domain. One reason for this is the increase in the power processing in modern computers. Traditional works using the end-to-end architectures rely on the use of recurrent neural networks (RNN) with a sequence-to-sequence approach [2, 19]. Recurrent Embedding Dialogue Policy (REDP) [21] learns vector embeddings for dialogue acts and states showing it can adapt better to new domains than the usual RNNs. Vlasov et al. [22] proposed to use a transformer architecture [20] with the self-attention mechanism operating over the sequence of dialogue turns outperforming previous models based on RNNs.

Another line of research focus on the DM module using the pipeline architecture. There are some attempts to use supervised learning to learn a policy for the DM, but as it can be seen as a sequential decision making problem, RL is more used [5]. However, RL algorithms are too slow in general when trained from scratch. Many works attempt to include some expert knowledge either by supervised pre-training or by warm-up, i.e., pre-filling the replay buffer with rule agents in DQN algorithms. DQfD (Deep Q-learning from Demonstrations) uses the expert demonstrations to guide the learning and encourages the agent to explore in high reward areas, avoiding random exploration [7, 8]. Redundancy with respect to the overlapping slots between domains is another issue in dialogue systems. Chen et al. [3] address this problem by implementing the policy with a graph neural network where the nodes can communicate to each other to share information but they assume the adjacency matrix for this communication is known.

Although recent work on multi-domain settings does not consider a distributed architecture, it is relevant for two reasons: it can reuse well-established algorithms for a single domain and, in the need to add a new domain to the system – which is common for real applications such as virtual assistants – it is unnecessary to retrain the entire system. Therefore, we adopted a distributed architecture using the divide-and-conquer approach. Furthermore, to the best of our knowledge, it is the first work that focuses on learning this slot-sharing mechanism.

3 Proposal

In this work we propose the Divide-and-Conquer Distributed Architecture with Slot Sharing Mechanism (DCDA-S2M) showed in Fig. 3, which uses the pipeline architecture (Fig. 2) primarily focusing on the DM component.

The dialogue state must encode all useful information collected during interactions. The MultiWOZFootnote 1 annotated states basically comprise the slots informed, and the ones that are required to complete the task in each domain. For example, the hotel domain requires, among others, the area of the city where the hotel is located, the number of stars it has been rated, as well as its price range.

Our proposal to train POL is divided into two steps. The first step is to use a distributed architecture in a divide-and-conquer approach to build a system capable of interacting in a multi-domain environment. The second stage consists of a mechanism that shares slots between domains. In the following sections we detail each component of DCDA-S2M, the Divide-and-Conquer Distributed Architecture (DCDA) and the Slot Sharing Mechanism (S2M).

Fig. 3.
figure 3

Illustrative figure of our proposed architecture.

3.1 Divide-and-Conquer Distributed Architecture

We implemented seven agents, each one specifically trained for a domain of the MultiWOZ dataset (attraction, hospital, hotel, police, restaurant, taxi, and train), as illustrated in Fig. 3. The idea is that, using a simple reinforcement learning algorithm, we can have a multi-domain system with better performance than if we had a single agent trained in all domains at once.

However, just having multiple agents, each one for each domain, is not enough. We need a controller capable of perceiving when there is a domain change and selecting the right agent to collect the right response. The controller keeps track of all state features from all agents allowing it to know the past and current domains.

As mentioned before, in multi-domain systems the dialogue act is defined by the tuple [domain, intent, slot, value], i.e., it already includes the domain of the conversation [10]. Therefore, the controller simply observes the domain element of the dialogue act. If it is a new domain (different from the current one) it checks with S2M (detailed in Sect. 3.2) the slots from past domains that can share values with the new domain, and it copies the values of these shareable slots. Given this, the controller sends the state features to the corresponding agent of the current domain. Each agent is trained using reinforcement learning to learn an optimal policy (POL component of DM) in each domain.

Reinforcement Learning. In reinforcement learning problems there is an agent that learns by interacting with an environment through the Markov Decision Process (MDP) framework defined as a tuple \((\mathcal {S}, \mathcal {A}, \mathcal {R}, \mathcal {T}, \gamma )\), where \(\mathcal {S}\) is the set of possible states, \(\mathcal {A}\) is the set of actions, \(\mathcal {R}\) is the reward function, \(\mathcal {T}\) is the state transition function, and \(\gamma \in [0,1]\) is the discount factor that balances the trade-off between immediate rewards and future rewards. In the context of dialogue systems, the state represents the dialogue state. The actions are the set of dialogue acts of the agent, and the reward is \(+R\) if the dialogue succeeds, i.e., the agent achieve the user goal, \(-R\) if the dialogue fails and \(-1\) for each turn to encourage the agent lasts the minimum number of turns.

The agent’s objective is to maximize the cumulative discounted rewards, \(\sum _{t = 0}^{T}\gamma ^tr_t\), \(r_t \in \mathcal {R}\), received during interactions in order to find its optimal policy that maps state \(s_t \in \mathcal {S}\) to the best action \(\pi (s_t) = a_t\), \(a_t \in \mathcal {A}\).

In this work the Proximal Policy Optimization (PPO) algorithm [18] was employed which is a policy gradient method which aim to directly optimize the parameterized policy to maximize the expected reward. The intuition of PPO is to make the greatest improvement in policy without stepping so far from current policy to avoid a performance collapse. Formally, it optimizes the loss function, \( \mathcal {L}(\theta ) = \min \left( \rho (\theta )\hat{A},\ clip(\rho (\theta ), 1 - \epsilon , 1 + \epsilon )\hat{A}\right) , \) where \(\rho (\theta ) = \frac{\pi _\theta (a | s)}{\pi _{\theta _{old}(a | s)}}\) denotes the probability ratio, \(\hat{A}\) is the estimated advantage function and \(\epsilon \) is a hyperparameter that indicates how far can we go from old policy.

For the advantage function estimation, we use the Generalized Advantage Estimation (GAE) [17], \( \hat{A}(s_t, a_t) = \delta (s_t) + \gamma \lambda \hat{A}(s_{t+1}, a_{t+1}), \) with \( \delta (s_t) = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t), \) where \(\gamma \) is the discount factor, \(\lambda \) is a hyperparameter to adjust the bias-variance tradeoff and \(\hat{V}(s)\) is the estimate of value function, i.e., the expected reward the agent receives being at state s.

However reinforcement learning algorithms often suffer in sparse reward environments. One possible option to deal with this issue is to use some pre-training method, such as imitation learning, where the agent tries to clone expert behaviors. We use the Vanilla Maximum Likelihood Estimation (VMLE) algorithm [25] to pre-train the agents. VMLE employs a multiclass classification with data extracted from the MultiWOZ dataset. The algorithm optimizes its policy trying to mimic the behavior presented by the agent in the dataset.

User Simulator. Furthermore, training the agent with real users is impracticable since it requires a great number of interactions. Therefore there is a need of an user simulator. The used user simulator follows an agenda-based approach [16]. First, it generates a user goal which comprises all information needed to complete the task. Then, the user simulator generates an agenda in a stack-like structure with all actions it needs to take (informing its constraints and/or requesting information). During the conversation, as the agent requests or informs something, the user can reschedule the agenda accordingly. For example, if the agent requests the type of restaurant, than the user can move the action “inform the type of restaurant” to the top of the stack. The conversation lasts until the stack is empty or the maximum number of turns is reached.

We made a small modification to the user simulator to handle the confirmation actions provided by the agent using the slot sharing mechanism. The simulator checks if the values of the confirmation act are correct and if they are wrong it informs the correct value; otherwise, it just removes the action regarding that slot-value pair from the agenda and continues with its policy.

Finally, after all the seven agents are trained using reinforcement learning we plug them with the controller resulting in the DCDA system.

3.2 Slot Sharing Mechanism

Some domains contain overlapping slots, i.e., slots that can share their values in a conversation. For instance, if the user is looking for a restaurant and a hotel, it is likely that they are for the same day and in the same price range. More complex relationships can be represented, such as the time of reservation for a restaurant and the time when the taxi must arrive at its destination (which would be the restaurant). However, not all slots are shareable between two domains. Therefore, we need to know which slots whose contents can be transferred from one domain to another.

In the following subsections we show how we learned these relationships. But for now, suppose we already know what such slots are. Therefore, during the conversation, when the controller notices a domain change, the information-sharing mechanism first gets all informed slots in previous domains and then checks whether any of those slots can share their values with any slots in the current domain. In this case, the controller copies the value from that slot to the slot of the new domain. For example, suppose the system has already interacted with the user in the hotel domain and knows that the demand is for two people. When the conversation changes to the restaurant domain, the slot sharing mechanism will see that hotel-people slot can be shared with restaurant-people slot and the controller will transfer this information (two people) to the restaurant domain. The agent then asks for confirmation and acts considering this transferred slot. This can speed up the dialogue and improve the user experience during interactions by avoiding asking redundant information. In the worst case, if the transferred value is wrong, the user informs the correct value and continues the interaction normally.

Learning Shareable Slots. Our proposal to learn which slots can be shared, named Node Embedding for Slots Sharing Mechanism, uses the node embedding technique in which each node represents a domain-slot pair. The similarity of two nodes indicates whether they can share the same value in a conversation. It is defined by a simple scalar product, that is, given nodes \(u = [u_1, u_2, \ldots , u_d]\) and \(v = [v_1, v_2, \ldots , v_d]\), we have: \( similarity(u, v) = \langle u, v\rangle = \sum _{i=1}^d u_i\cdot v_i, \) where d is the embedding dimension. For instance, the nodes restaurant-day and hotel-day must be similar, i.e., have a high scalar product, while restaurant-name and hotel-name must have a low scalar product.

Before learning the node embedding, we need to build a similarity matrix \(A \in \mathbb {R}^{n\times n}\), a \(n\times n\) matrix where n is the number of nodes, i.e., number of domain-slot pairs and the cell \(A_{uv}\) represents the similarity between nodes u and v normalized to fall between 0 and 1, that is, \(similarity(u, v) \in [0, 1]\). This is done using the dialogues from dataset \(\mathcal {D}\) as shown in Algorithm 1. At the end of each conversation, we observe the final state (the state of the dialogue in the last interaction) and check if each node pair presents the same value (a node represents a domain-slot pair). If they do, the weight between these two nodes is increased by one. In the end, all weights are normalized to the number of times each pair of nodes appeared in the dialogues.

figure a

However, keeping a matrix with \(\mathcal {O}(n^2)\) of space complexity does not scale with the number of domains and slots. For this reason, we trained a node embedding representation for each domain-slot pair. The learning of node embedding uses the similarity matrix \(A \in \mathbb {R}^{n\times n}\) and follows Algorithm 2 proposed by Ahmed et al. [1]. In each step, for each node pair \((u, v) \in E\), where E is the set containing all node pairs, it performs an update to minimize the following error \( L(A, Z, \lambda ) = \frac{1}{2}\sum _{(u,v)\in E}(A_{uv} -\langle Z_u, Z_v \rangle )^2 + \frac{\lambda }{2}\sum _u||Z_u||^2, \) where Z represents the embedding space and \(Z_u\) is the vector for node u. The t in Algorithm 2 can be thought as a learning rate for each update.

figure b

This S2M is independent of the agent trained, so it can be plugged with any agent we want. Gathering the DCDA and S2M we get the DCDA-S2M system (Fig. 3), where the controller asks for the S2M for slots that can share their values when there is a domain shift and then send the state features to the respective agent to get the response to the user.

4 Experiments and Results

To evaluate our proposal we did three experiments: the first was to train the embedding of nodes and to do a qualitative analysis of the learned embedding. We then evaluate the divide-and-conquer approach with the information-sharing mechanism, and finally we evaluate the same approach without the mechanism.

4.1 Experimental Setup

We used the ConvLab-2Footnote 2 platform to run our experiments. It provides a platform with a user simulator and some implementations of all dialogue systems components (NLU, DST, POL and NLG). In this way, it is easy to assess new algorithms for each of the components. We trained the agents for each domain using the PPO algorithm with the standard parameters of the ConvLab-2 platform: discount factor \(\gamma = 0.99\), clipping factor \(\epsilon = 0.2\) and \(\lambda = 0.95\) (for advantage value estimation). The training lasted 200 epochs and in each epoch we collected around 100 turns and sampled a batch of size 32 for optimization. The agent contains two separate networks, one for policy estimation with two hidden layers of size 100 and other for value estimation with two hidden layers of size 50. The optmizers used for policy and value networks are RMSProp and Adam and learning rate \(lr_p = 10^{-4}\) and \(lr_v = 5\cdot 10^{-5}\), respectively. The reward function is \(-1\) for each turn (to encourage the agent complete the task more quickly), 40 for success dialogue and 20 for fail dialogue. For pre-training we employed the VMLE algorithm using RMSProp optimizer with \(lr_{vmle} = 10^{-3}\) and binary cross entropy with logits as loss function. We also used the available PPO model trained with all domains at once to compare with our results.

For our node embedding, we built the adjacency matrix using the dialogue corpus available at ConvLab-2 platform. For hyperparameters, we used an embedding dimension \(d = 50\), regularization factor \(\lambda = 0.3\), and 1000 epochs for training.

4.2 Node Embedding

For visualization of the learned node embedding we used a t-distributed stochastic neighbor embedding (t-SNE) model with perplexity 5 using the scalar product as similarity function. Figure 4 shows this visualization.

Fig. 4.
figure 4

Visual representation of the learned node embedding.

Figure 4 clearly shows some groups of nodes that are related to each other. For example, hotel-area, attraction-area, and restaurant-area forms a group, indicating that users generally request places in the same area. It also happens for the price range (restaurant-pricerange and hotel-pricerange), day (restaurant-day, hotel-day, and train-day) and people (restaurant-people, hotel-people, and people) slots. Although hotel-stay and hotel-stars looks close to the group with slot “people”, computing their similarity with restaurant-people we got 0.118 and 0.088, respectively. Thus they are not similar and should not share values. On the other hand, the similarity between restaurant-area and attraction-area is 0.91 showing that they are similar and must share their values inside a conversation. Here we used a similarity of 0.8 as a threshold for sharing the slots values.

An interesting observation is that attraction-name and hotel-name are quite close to taxi-departure, with similarity 0.57 and 0.668, respectively, but they are not close to each other, i.e., the similarity between them is 0.011. This is expected since it is not common an attraction with the same name as a hotel.

4.3 DCDA Evaluation

To evaluate our proposal we assessed four models: the baseline Rule-based policy available in ConvLab-2, the VLME policy (obtained from the VLME algorithm), and the PPO algorithm trained in both approaches: a centralized system with a unique agent trained to handle all domains at once (PPO\(_{all}\)) and our proposal (DCDA). We also evaluated the effects of using or not the S2M in the rule and DCDA agents. The metrics are automatically computed by the evaluator presented in ConvLab-2 and encompasses the complete rate, success rate, book rate, precision, recall, F1-score for the informed slots, and average number of turns for both the dialogues that were successful and the total set of dialogues. The complete rate indicates the rate of dialogues that could finish (either with success or fail) before achieving the maximum number of turns. The precision, recall and F1-score indicate the ability of the agent to fulfill the slots of the user goal, i.e., leads to the correct slot. Tests were performed over 2000 dialogues.

Table 1 shows the evaluation results for all the four models trained in the pipeline setting, i.e., without the NLU and NLG modules. As expected the rule policy performs almost “perfectly” succeeding in 98.45% of the dialogues and it can serve as a baseline. Among the trainable agents, DCDA performs better in almost all aspects achieving 88.14% of success rate, 94.01% of completion rate, and 88.01% of book rate. It beats the PPO\(_{all}\) by almost 21.51% in the success rate, showing a much better performance and efficiency as it can solve user tasks using less number of turns. The average number of turns in all dialogues, 14.92, is very close to the baseline rule policy (13.48) showing it could learn a very good policy in solving tasks. The large increase in the average number of turns for all dialogues can be explained by analysing the failed dialogues during test. It can be seen that in many failed dialogues the conversation went in a loop with the agent and the simulated user, repeating the same act of dialogue consecutively. The reason why this phenomenon occurs is not very clear to us. The worst performance of the VLME is expected, as the other agents depend on the VLME for pre-training.

Table 1. Results of the four agents: Rule, VLME, PPO\(_{all}\) and DCDA tested in a pipeline setting. Best results among trainable agents are in bold.
Table 2. Evaluation of the use of S2M in the Rule policy and DCDA with the goal generator generating random goals. Best results are in bold.

The results of the second experiment regarding the use of the slot sharing mechanism are presented in Table 2. We evaluated both the Rule policy and our proposed model DCDA. Results show that for the Rule policy the sharing mechanism also helped the agent to have a slightly better performance. Although the success rate for DCDA did not change much, the sharing mechanism also helped it to have a better complete and book rate. Another enhancement was in the average number of turns. The average number of turns required in successful dialogues for the Rule and DCDA policies decreased from 13.40 to 13.20 and from 13.84 to 13.43, respectively, when the sharing mechanism was incorporated. Thus we can see that the sharing mechanism makes the agent to complete dialogues faster than without this mechanism for both agents.

An interesting fact is that besides the slightly better performance with the sharing mechanism, the precision, recall and F1-score did not followed the same behavior, i.e., they had better results or very close (less than 0.05%) results as those without the sharing mechanism. This result is not very surprising because as the agent with the sharing mechanism tries to “guess” the slots of new domains within the conversation, it ends up reporting more wrong slots of the user goal causing worse precision, recall, and F1-score.

All theses experiments was assessed with the user simulator generating random goals based on a distribution of the goal model extracted from the dataset. So this can include simple goals within only one domain and/or goals that span to more than one domain but do not have any slot with the same value. Indeed, among all 2000 goals generated during testing, only about 400 contain common values between slots. With that in mind, we ran another test of the sharing mechanism that restricts the user simulator to only generating goals that contain common slots. Therefore, the generated goals end up being more complex in general than those generated in the first test.

Table 3 shows the results. There is an expected significant decrease in the general performance due to the increase in user goals complexity. However, here we can clearly observe the great advantage of the sharing mechanism in this setting.

Table 3. Evaluation of the use of S2M in the Rule policy and DCDA with the goal generator generating slots with common values. Best results are in bold.

There is a 12.25% and 9.99% success rate difference with the Rule and DCDA policies, respectively. We also see a bigger impact on the average number of turns. It affects mostly the successful dialogues because the number of turns is affected only when the transferred slots values are correct – otherwise the user would still need to inform these slots – and chances of a successful dialogue increase when it happens. Finally, we also see a better precision, recall and F1-score for the agent with the sharing mechanism. Since all goals in this tests have at least one common value among the slots, the agent “guesses” are more likely to be correct.

Fig. 5.
figure 5

Example of dialogue using the slot sharing mechanism, resulting in a dialogue length 8.

Fig. 6.
figure 6

Example of dialogue that does not use slot sharing mechanism, resulting in a dialogue length 11.

Figures 5 and 6 show examples of system generated dialogues using and not using the slot sharing mechanism, respectively. Observe that when the domain switched to the hotel domain, the agent in Fig. 5 asked for confirmation if the price is moderate and area is north and recommended a hotel with these constraints. In natural language we could think in this dialogue act as: “Do you want a hotel in north with a moderate price, right? There is the hotel Limehouse”. In this way, the user did not need to inform these slots again, saving some turns until task completion. While in Fig. 6 the agent needed to ask again the area and price for the user, resulting in a redundant dialogue which takes more turns to be completed (11 turns against 8 turns).

One drawback for the DCDA-S2M is the training time required for training all the agents. Table 4 shows the average training time for each agent. As we can see, the total amount of time required to train all seven agents is 291.29 minutes, which is approximately 16% more than centralized system training. However, it is worth noting that agents could be trained in parallel, which would require greater computational power.

Table 4. Training time in minutes for each agent.

5 Conclusions

In this work we show that the use of a distributed architecture, with multiple agents trained separately for each domain, can leverage the system performance compared to the same algorithm used to train a single agent for all domains at once. This is because each agent can specialize in solving its own problem well, which is much simpler than solving tasks well in all domains, as with the centralized approach in a single agent. Furthermore, distributed systems can add new domains without the need to retrain the entire system.

The use of the slot sharing mechanism also proved to enhance system performance, especially for tasks where the goal has some common slot across domains. Besides improving the system’s success rate, it also decreases the average number of turns, showing that the system avoided asking for redundant information.

A major disadvantage of DCDA-S2M is the need to train several agents separately and this can be time and energy consuming. In this sense, for future work we intend to explore transfer learning techniques in reinforcement learning to accelerate the training of new agents.