Polite Dialogue Generation Without Parallel Data Tong Niu and Mohit Bansal UNC Chapel Hill {tongn, mbansal}@cs.unc.edu Abstract Stylistic dialogue response generation, with valuable applications in personality-based conversational agents, is a challenging task because the response needs to be fluent, contextually-relevant, as well as paralinguis- tically accurate. Moreover, parallel datasets for regular-to-stylistic pairs are usually un- available. We present three weakly-supervised models that can generate diverse, polite (or rude) dialogue responses without parallel data. Our late fusion model (Fusion) merges the decoder of an encoder-attention-decoder dia- logue model with a language model trained on stand-alone polite utterances. Our label-fine- tuning (LFT) model prepends to each source sequence a politeness-score scaled label (pre- dicted by our state-of-the-art politeness classi- fier) during training, and at test time is able to generate polite, neutral, and rude responses by simply scaling the label embedding by the cor- responding score. Our reinforcement learn- ing model (Polite-RL) encourages politeness generation by assigning rewards proportional to the politeness classifier score of the sam- pled response. We also present two retrieval- based, polite dialogue model baselines. Hu- man evaluation validates that while the Fu- sion and the retrieval-based models achieve politeness with poorer context-relevance, the LFT and Polite-RL models can produce sig- nificantly more polite responses without sacri- ficing dialogue quality. 1 Introduction Generating stylistic, personality-based language is crucial to developing engaging, convincing, and trustworthy conversational agents, for their effec- tive application in intelligent tutoring, home assis- tance, online reservations/purchasing, health care, etc. Most current chatbots and conversational mod- els lack any such style, which can be a social issue because human users might learn biased styles from such interactions, e.g., kids learning to be rude be- cause the dialogue system encourages short, curt re- sponses, and also does not itself use politeness to set an example.1 In this work, we focus on the impor- tant and diverse paralinguistic style axis of polite- ness vs. rudeness (Brown and Levinson, 1987). Generating stylistic dialogue responses is a sub- stantially challenging task because the generated re- sponse needs to be syntactically and semantically fluent, contextually-relevant to the conversation, as well as convey accurate paralinguistic features. This is further complicated by the fact that content and style are only available in separate unpaired datasets, as opposed to translation-type parallel datasets con- taining regular-to-stylistic text pairs. Hence, we need indirectly-supervised models that can incorpo- rate style into the generated response in absence of parallel data (i.e., where the training data for the conversation, versus style components, comes from two different datasets or domains), while still main- taining conversation relevance. In this work, we present three such weakly- supervised models2 that can generate diverse, nat- ural, and contextually-relevant polite (and rude) di- 1https://qz.com/701521/parents-are-worried-the-amazon- echo-is-conditioning-their-kids-to-be-rude/ 2The first version of this paper with the three Fusion, Discrete-LFT, and Polite-RL models was submitted on Oct 1, 2017. The two retrieval baselines and the continuous version 373 Transactions of the Association for Computational Linguistics, vol. 6, pp. 373–389, 2018. Action Editor: Colin Cherry. Submission batch: 10/2017; Revision batch: 2/2018; Published 6/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. alogue responses, using data from separate style and dialogue domains: the Stanford Politeness Cor- pus (Danescu-Niculescu-Mizil et al., 2013) with Wikipedia and Stack Exchange requests, and the MovieTriples Dialogue Corpus (Serban et al., 2016) with IMSDB movie scripts, respectively. Each of our three models is based on a state-of-the-art polite- ness classifier and a sequence-to-sequence dialogue model. The first model (Fusion) employs a late fu- sion technique to merge the response generation de- coder of the dialogue model with a language model trained on polite utterances chosen by the politeness classifier. The second label-fine-tuning (LFT) model prepends to the input utterance a single politeness la- bel whose embedding is continuously scaled by the politeness score of the target sequence during train- ing. This score is determined by feeding the cor- responding ground-truth target sequence to our po- liteness classifier. During test time, we show that the LFT model is able to control the politeness level of generated responses by simply scaling the la- bel’s embedding by the continuous target politeness score of our choice. Our third reinforcement-based model (Polite-RL) encourages politeness generation by using the continuous-scale politeness score of the decoder-sampled sentence as a reward (via mixed- objective policy gradient methods), i.e., polite utter- ances are encouraged with positive reward, and rude ones discouraged with negative reward. Hence, our models only need a style classifier (without parallel data) to automatically influence and encourage continuous-scale stylistic language generation in a complex dialogue setup, which also requires maintaining relevance to conversational context. Each of these models requires minimal changes to the architecture of either the underly- ing sequence-to-sequence (Seq2seq) dialogue base model or the style classifier, and hence can mod- ularly update the architecture with the latest state- of-the-art dialogue models or style classifiers (and for diverse styles). In addition, we also employ two retrieval-based models, where we output the re- sponse which has the highest match with the in- put context from a set of classifier-picked polite responses or manually-picked generic polite utter- of the LFT model were added to the Feb 1, 2018 resubmission based on reviewer discussions. ances. These two retrieval models serve as parallel investigations on the performance of our three pro- posed generative models above. We conducted multiple human evaluations (for style and dialogue quality) on Amazon Mechani- cal Turk (MTurk) (Buhrmester et al., 2011) for all three models plus the base sequence-to-sequence di- alogue model and the retrieval-based models, and show that while the Fusion and the two retrieval models increase the politeness level of responses at the cost of poorer dialogue quality, both our LFT and Polite-RL models can successfully produce po- lite responses (capturing several politeness strategies discussed by Brown and Levinson (1987)), without sacrificing dialogue coherence and relevance com- pared to the base Seq2seq model (hence better bal- ance between politeness and dialogue quality). We also compare the output dialogue politeness levels of the continuous LFT model for three different po- liteness levels. Finally, we present several detailed qualitative and quantitative analyses, including pos- itive and negative output examples, automatic metric results on output responses, classifier error analysis, and visualization of the RL rewards. 2 Related Works 2.1 Models for Style Transfer Style Transfer with Parallel Data There have been multiple works on style transfer with parallel data. These tasks can often be solved by directly ap- plying some variation of translation-based Seq2seq model discussed in the previous section. For ex- ample, Xu et al. (2012) use a phrase-based statis- tical model, and Jhamtani et al. (2017) use a stan- dard Seq2seq model to convert modern language to Shakespeare-style language by treating style transfer as a translation task. Some labeled sequence trans- duction methods have also been proposed (Kobus et al., 2017; Yamagishi et al., 2016; Johnson et al., 2017). For example, Kikuchi et al. (2016) are able to control the length of the summarization text by feeding to the Seq2seq base model a label that in- dicates the intended output length in addition to the source input. Our LFT model also adopts this la- beling idea, and is able to handle a similar situation but without parallel data, because by labeling each target sequence in the training set with its politeness 374 classifier score, we are essentially converting non- parallel data to (noisy) parallel data (by using a clas- sifier with high accuracy). Style Transfer without Parallel Data Several previous works have looked at style transfer with- out parallel data, in both vision (Gatys et al., 2016; Zhu et al., 2017; Liu and Tuzel, 2016; Liu et al., 2017; Taigman et al., 2016; Kim et al., 2017; Yi et al., 2017), and text (Sennrich et al., 2016a; Hu et al., 2017; Ghosh et al., 2017; Zhao et al., 2017; Mueller et al., 2017; Wang et al., 2017; Luan et al., 2017). Among these models, some are bag-of-words based, i.e., they use style-related keywords to annotate the target sequences in the training set. For example, to control how formal the output sequences are in a EN-DE translation task, Sennrich et al. (2016a) la- beled each target sequence based on whether it con- tains formal or informal verbs and pronouns (hon- orifics). To build a language model that generates utterances with the desired style, Ficler and Gold- berg (2017) annotated their text with meta-data and keywords/POS tags based heuristics, while Ghosh et al. (2017) also adopted keyword spotting based on a dictionary of emotional words. The basic ideas of their models are similar to that of our LFT model. However, these keyword-spotting approaches do not fully extend to our politeness generation task, be- cause politeness strategies follow complex patterns of grammar, word order, and phrasing (Danescu- Niculescu-Mizil et al., 2013). For example, the po- liteness of please depends on where it occurs in a sentence, and what other politeness markers it co- occurs with (e.g., ‘could/would you’ style counter- factual modals vs. ‘can/will you’ style indicative modals). Therefore, our novel polite dialogue mod- els are based on an accurate neural classifier, which is better at capturing several compositional paralin- guistic features (as visualized in Aubakirova and Bansal (2016), whose politeness classifier we ex- tend). Moreover, our LFT and Polite-RL models can generate a continuum of style levels based on the continuously-scaled (by the politeness score) label embedding or reinforcement rewards. Lastly, there have also been style transfer mod- els that rely on the latent representation of text and use variational auto-encoders or cross-alignment to disentangle the representation of content and style in text (Hu et al., 2017; Shen et al., 2017; Zhao et al., 2017; Fu et al., 2018). During inference time, the latent style representation is combined with new content to generate stylized, content-preserving text. Although both fall into the category of style transfer, our task differs in two important aspects from their tasks. First, as opposed to the task of strict content preservation when rephrasing a sentence to a differ- ent style, our task is about maintaining good rele- vance to the context when adding style, especially useful for dialogue-based tasks. Another distinc- tive trait of our task is that politeness resides in a spectrum rather than a fixed category or topic (e.g., Shakespearean), and our models can treat politeness as a continuum, i.e., controlling the politeness level by adjusting the fusion rate in the Fusion model, the magnitude of the continuous label in the LFT model, or the RL weight in the Polite-RL model. 2.2 Multi-Task Learning and Style Transfer In order to obtain a persona-based conversational agent, Luan et al. (2017) proposed a multi-task learning (MTL) based approach: they train a Seq2seq model with conversation data and an au- toencoder with non-conversational persona-related data from target speakers, and share the decoder parameters of these two models so that the gener- ated responses can be adapted to the style of the target-speaker. This way of incorporating MTL into Seq2seq learning was first investigated by Dong et al. (2015) and Luong et al. (2016) to achieve mul- tilingual NMT. In addition, Sennrich et al. (2016b) also employed MTL to improve NMT models with monolingual (non-parallel) data. These approaches are related to our Fusion model, because we use our classifier to obtain noisy polite target sequences (non-parallel data) that a polite language model trains on; next, during inference, we combine the parameters of the language model with a genera- tive dialogue model trained on parallel data. In gen- eral, our models are also related to previous works like Johnson et al. (2017), who adopted labeled se- quence transduction methods for MTL tasks, be- cause our task also involves adapting generated re- sponses to different politeness styles and optimizing two sub-tasks’ (namely response and politeness gen- eration) loss functions (related to a multi-task setup). 375 2.3 Politeness Studies Danescu-Niculescu-Mizil et al. (2013) created the Stanford Politeness Corpus and trained an SVM classifier using a list of useful linguistic features based on strategies from Brown and Levinson’s theory of politeness (Brown and Levinson, 1987). Aubakirova and Bansal (2016) recently took an end- to-end neural approach to this politeness classifi- cation task by training a CNN model that directly learns to identify polite requests without using any hand-engineered features, while still improving on prediction accuracy. They also visualized what fea- tures the CNN model was learning and discovered some new features along the way. Our classifier mainly extends their work by adding a bi-directional LSTM layer (Hochreiter and Schmidhuber, 1997; Schuster and Paliwal, 1997) before the CNN layer to capture long-distance relationships in the sentence, which leads to higher cross-domain performance. A related early work in personality-based dia- logue is Mairesse and Walker (2007), who stud- ied introvert/extrovert personality language based on templated content and sentence planning (via personality dimensions such as hedges, tag ques- tions, negations, subject implicitness, etc.). Relat- edly, Sennrich et al. (2016a) use an English to Ger- man translation task to present a model that can gen- erate target sequences that are either formal or infor- mal, specifically based on honorifics-related verbs and pronouns. Our task is more general, taking into account several politeness-related paralinguis- tic features of Brown and Levinson (1987) and al- lowing end-to-end trainable stylistic dialogue gen- eration with a polite-to-rude spectrum (based on a politeness classifier, without relying on parallel data). Moreover, our approaches allow simply re- placing the politeness classifier with any other emo- tion or personality based language classifier to gen- erate stylistic dialogue for that new style dimension. 3 Politeness Classification Model In order to develop an accurate politeness classifier for effective use in stylistic dialogue response gener- ation, we extend and improve upon the state-of-the- art CNN model of Aubakirova and Bansal (2016), and propose a bi-directional LSTM followed by a convolutional layer (see Figure 1), in order to both S 1 S 2 S 3 S 4 embedding Convolution layer polite rude LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM concat concat concat concat Softmax Max-pooling Figure 1: Our LSTM-CNN politeness classifier. capture long-distance relationships in the sentence as well as windowed filter based features. For a sentence v1:n (where each token vi is a d-dim word embedding vector), the LSTM layer first produces hidden states h1:n (where ht is the concatenation of forward and backward hidden states at time step t). A filter m is then applied on a window of u hidden states. This produces a convolution feature ci = f(m ∗ vi:i+u−1 + b), where f is a non-linear function and b is a bias term. Every feature map c ∈ Rn−u+1 is applied to each window, so that c = [c1, ...,cn−u+1]. The output of the convolu- tional layer is then fed to a max-pooling layer (Col- lobert et al., 2011) which gives C = max{c} for the filter. Filters of various sizes are used to ob- tain multiple features. The result is then passed to a fully-connected softmax layer that outputs proba- bilities over two labels, namely Polite and Rude. Our classification model achieves compara- ble in-domain accuracy and improved cross- domain accuracy over the state-of-the-art results reported in Danescu-Niculescu-Mizil et al. (2013) and Aubakirova and Bansal (2016). We will discuss these results in detail in Section 6. 4 Polite-Style Dialogue Models In this section, we first describe our base dialogue model, i.e., the core (backbone) dialogue architec- ture upon which the three proposed politeness mod- 376 Input S1 S2 S3 Response by Seq2seq Q1 Q3 T1 T2 Response by LM G1 G2 G3 Q2 Figure 2: Fusion model: the output probability distribu- tions of the decoder and the polite-LM are linearly mixed to generate the final decoded outputs. els are built, and then present these three models that can generate polite dialogue responses. As a paral- lel investigation on the performance of our proposed models, we also employ two retrieval-based polite dialogue models toward the end. 4.1 Base Seq2seq Dialogue Model Our base dialogue model is a simple sequence-to- sequence (Seq2seq) model that consists of a two- layer bi-directional LSTM-RNN encoder to encode the conversation history turns, and a four-layer LSTM-RNN decoder to generate the response. Ad- ditive attention from the output of the encoder is ap- plied to the last layer of the decoder. This archi- tecture is almost identical to that proposed by Bah- danau et al. (2015), except with more layers (simi- lar to Shao et al. (2017)). Our base dialogue model achieves perplexity and word error rate results on par with those reported for the popular hierarchical HRED models in Serban et al. (2016), thus serving as a good base model to incorporate style into. De- tails will be discussed in Section 6. 4.2 Fusion Model Inspired by the ‘late fusion’ approach in Venu- gopalan et al. (2016), our Fusion model (Fig. 2) combines the response generation decoder of the base Seq2seq dialogue model with a language model (polite-LM) trained exclusively on polite utterances. These utterances are chosen by feeding the classifier all response utterances in the MovieTriples training set, and only keeping those with politeness scores great than a certain threshold (set to 0.8 in our ex- periments, as will be discussed in Section 4.5). The polite-LM model is a two-layer LSTM-RNN based on Jozefowicz et al. (2016). During inference time, we used the language