key: cord-0039654-v9an24rg
authors: Sancheti, Abhilasha; Krishna, Kundan; Srinivasan, Balaji Vasan; Natarajan, Anandhavelu
title: Reinforced Rewards Framework for Text Style Transfer
date: 2020-03-17
journal: Advances in Information Retrieval
DOI: 10.1007/978-3-030-45439-5_36
sha: b5f61790116132de62e1cad3c3fe5f863cb6ff00
doc_id: 39654
cord_uid: v9an24rg

Style transfer deals with the algorithms to transfer the stylistic properties of a piece of text into that of another while ensuring that the core content is preserved. There has been a lot of interest in the field of text style transfer due to its wide application to tailored text generation. Existing works evaluate the style transfer models based on content preservation and transfer strength. In this work, we propose a reinforcement learning based framework that directly rewards the framework on these target metrics yielding a better transfer of the target style. We show the improved performance of our proposed framework based on automatic and human evaluation on three independent tasks: wherein we transfer the style of text from formal to informal, high excitement to low excitement, modern English to Shakespearean English, and vice-versa in all the three cases. Improved performance of the proposed framework over existing state-of-the-art frameworks indicates the viability of the approach.

Text style transfer deals with transforming a given piece of text in such a way that the stylistic properties change to that of the target text while preserving the core content of the given text. This is an active area of research because of its wide applicability in the field of content creation including news rewriting, generating messages with a particular style to maintain the personality of a brand, etc. The stylistic properties may denote various linguistic phenomenon, from syntactic changes [7, 23] to sentiment modifications [4, 10, 18] or extent of formality in a sentence [16] .

Most of the existing works in this area either use copy-enriched sequenceto-sequence models [7] or employ an adversarial [4, 15, 18] or much simpler generative approaches [10] based on the disentanglement of style and content in text. On the other hand, more recent works like [19] and [3] perform the task of style transfer without disentangling style and content, as practically this condition cannot always be met. However, all of these works use word-level objective function (eg. cross-entropy) while training which is inconsistent with the desired metrics (content preservation and transfer strength) to be optimized in style transfer tasks. These metrics are generally calculated at a sentence-level and use of word level objective functions is not sufficient. Moreover, discreteness of these metrics makes it even harder to directly optimize the model over these metrics.

Recent advancements in Reinforcement Learning and its effectiveness in various NLP tasks like sequence modelling [8] , abstractive summarization [14] , and a related one machine translation [21] have motivated us to leverage reinforcement learning approaches in style transfer tasks.

In this paper, we propose a reinforcement learning (RL) based framework which adopts to optimize sequence-level objectives to perform text style transfer. Our reinforced rewards framework is based on a sequence-to-sequence model with attention [1, 12] and copy-mechanism [7] to perform the task of text style transfer. The sentence generated by this model along with the ground truth sentence is passed to a content module and a style classifier which calculates the metric scores to finally obtain the reward values. These rewards are then propagated back to the sequence-to-sequence model in the form of loss terms.

The rest of our paper is organized as follows: we discuss related work on text style transfer in Sect. 2. The proposed reinforced rewards framework is introduced in Sect. 3 . We evaluate our framework and report the results on formality transfer task in Sect. 4, on affective dimension like excitement in Sect. 5 and on Shakespearean-Modern English corpus in Sect. 6 . In Sect. 7, we discuss few qualitative sample outputs. Finally, we conclude the paper in Sect. 8.

Style transfer approaches can be broadly categorized as style transfer with parallel corpus and style transfer with non-parallel corpus.

Parallel corpus consists of input-output sentence pairs with mapping. Since such corpora are not readily available and difficult to curate, efforts here are limited. [23] introduced a parallel corpus of 30K sentence pairs to transfer Shakespearean English to modern English and benchmark various phrase-based machine translation methods for this task. [7] use a copy-enriched sequence-tosequence approach for Shakespearizing modern English and show that it outperforms the previous benchmarks by [23] . Recently, [16] introduced a parallel corpus of formal and informal sentences and benchmark various neural frameworks to transfer sentences across different formality levels. Our approach contributes in this field of parallel style transfer and extends the work by [7] by directly optimizing the metrics used for evaluating the style transfer tasks.

Another class of explorations are in the area of non-parallel text style transfer [4, 10, 15, 18] which does not require mapping between the input and output sentences. [4] compose a non-parallel dataset for paper-news titles and propose models to learn separate representations for style and content using adversarial frameworks. [18] assume a shared latent content distribution across a given corpora and propose a method that leverages refined alignment of latent representations to perform style transfer. [10] define style in terms of attributes (such as, sentiment) localized to parts of the sentence and learn to disentangle style from content in an unsupervised setting. Although these approaches perform well on the transfer task, content preservation is generally observed to be low due to the non-parallel nature of the data. Along this line, parallel style transfer approaches have shown better performance in benchmarks despite the data curation challenges [16] .

Style transfer models are primarily evaluated on content preservation and transfer strength. But the existing approaches do not optimize on these metrics and rather teach the model to generate sentences to match the ground truth. This is partly because of the reliance on a differentiable training objective and discreteness of these metrics makes it challenging to differentiate the objective. Leveraging recent advancements in reinforcement learning approaches, we propose a reinforcement learning based text style transfer framework which directly optimizes the model on the desired evaluation metrics. Though there exists some prior work on reinforcement learning for machine translation [21] , sequence modelling [8] and abstractive summarization [14] dealing model optimization for qualitative metrics like Rouge [11] , they do not consider style aspects which is one of the main requirements of style transfer tasks. More recently, efforts [5, 22] have been made to incorporate RL in style transfer tasks in a non-parallel setup. However, our work is in the field of parallel text style transfer which is not much explored.

Our work is different from these related works in the sense that we take care of content preservation and transfer strength with the use of a content module (to ensure content preservation) and cooperative style discriminator (style classifier) without explicitly separating content and style. We illustrate the improvement in the performance of the framework on the task of transferring text between different levels of formality [16] . Furthermore, we present the generalizability of the proposed approach by evaluating it on a self-curated excitement corpus as well as modern English to Shakespearean corpus [7] .

The proposed approach takes an input sentence x = x 1 . . . x l from source style s 1 and translates it to sentence y = y 1 . . . y m with style s 2 , where x and y are represented as a sequence of words. If x is given by (c 1 , s 1 ) where c 1 represents the content and s 1 the style of the source, our objective is to generate y = (c 1 , s 2 ) which has same content as the source but with the target style.

Our approach is based on a copy-enriched sequence-to-sequence framework [7] which allows the model to retain factual parts of the text while changing the style specific text using an attention mechanism. At the time of training, the framework takes in the source style and the target style sentence as input to the attention based sequence-to-sequence encoder-decoder model. The words in the input sentence are mapped into an embedding space and the sentence is encoded into a latent space by the LSTM encoder. The network learns to pay attention to the words in the source sentence and creates a context vector based on the attention. The decoder model is a mixture of RNN and pointer (PTR) network where the RNN predicts the probability distribution over the vocabulary and the pointer network predicts the probability over the words in the input sentence based on the context vector. A weighted average of the two probabilities yields the final probability distribution at time step t given by,

where δ is computed based on encoder outputs and previous decoder hidden states. The decoder generates the transferred sentence by selecting the most probable word at each time step. This model is trained to minimize cross entropy loss given by

where m is the maximum length of the output sentence and y * t is the ground truth word at time t in the transferred sentence. While this framework optimizes for generating sentences close to the ground truth, it does not explicitly teach the network to preserve the content and generate sentences in target style. To achieve this, we introduce a style classifier and a content module which takes in the generated sentence from the sequence-to-sequence model along with the ground truth target sentence to provide reward to the sentence, as shown in Fig. 1 . We leverage BLEU [13] score to measure the reward for preserving content and because of the lack of any formal score for transfer strength, we use a cooperative discriminator to provide score to the generated sentence. This score from the discriminator is used as a measure to reward for transfer strength. These rewards are then back propagated as explicit loss terms to penalize the network for incorrect generation. 

To preserve the content while transferring the style, we leverage Self-Critic Sequence Training (SCST) [17] approach and optimize the framework with BLEU scores as the reward. SCST is a policy gradient method for reinforcement learning and is used to train end-to-end models directly on non-differentiable metrics. We use BLEU score as reward for content preservation because it measures the overlap between the ground truth and the generated sentences. Teaching the network to favor this would result in high overlap with the ground truth and subsequently preserve the content of the source sentence since ground truth ensures this preservation.

We produce two output sentences y s and y , where y s is sampled from the distribution p(y s t |y s 1:t−1 , x) at each decoding time step and y (baseline output) is obtained by greedily maximizing the output distribution at each time step. The BLEU score between the sampled and greedy sequences is computed as the reward and the corresponding content-preservation loss is given by,

where the log term is the log likelihood on sampled sequence and the difference term is the difference between the reward (BLEU score) for the greedily sampled y and multinomially sampled y s sentences. Note that our formulation is flexible and does not require the metric to be differentiable because rewards are used as weights to the log-likelihood loss. Minimizing L cp is equivalent to encouraging the model to generate sentences which have higher reward as compared to the baseline y and thus increasing the reward expectation of the model. The framework can now be trained end to end by using this loss function along with the cross entropy loss to preserve the content of the source sentence in the transferred sentence.

To optimize the model to generate sentences which belong to the target style, it is possible to use a similar loss function as above and use it with the SCST framework [17] . However, that will require a formal measure for the target style aspect. Here, we present an alternate framework where such a formal measure is not readily available. We train a convolutional neural network based style classifier as proposed by [9] on the training dataset. This style classifier predicts the likelihood that an input sentence is in the target style, and the likelihood is taken as a proxy to the reward for style of a sentence and appended to a discriminatorbased loss function extended from [6] . Based on the transfer direction, we add the following term to the cross-entropy loss, L ts = − log(1 − s(y )), high to low level − log(s(y )), low to high level In this formulation, y is the greedily generated output from the decoder and s(y ) is the likelihood score predicted by the classifier for y . When transfer is done from high to low level of style, minimization of L ts will encourage generation of sentences such that the classifier score is as low as possible. When the sentences are transferred from low to high level of style then the formulation ensures that the generated sentences have a score as high as possible. The framework is trained end-to-end using this loss function to generate the sentences which belong to the target style.

The overall loss function thus can be written as a combination of the 3 loss functions,

We train various models using this loss function and different training methodologies (setting α = 1.0, β = 0.125, γ = 1.0 after hyper-parameter tuning) as described in the next section. During the inference phase, the model predicts a probability distribution over the vocabulary based on the sentence generated so far and the word having the highest probability is chosen as the next word till the maximum length of the output sentence is reached. Note that unlike training phase in which case both the input and ground truth transferred sentences are available to the model, only the input sentence is made available to the model.

We evaluate the proposed approach on the GYAFC [16] dataset which is a parallel corpus for formal-informal text. We present the transfer task results in both the directions -formal to informal and vice-versa. This dataset (from Entertainment and Music domain) consists of ∼56K informal-formal sentence pairs: ∼52K in train, ∼1.5K in test and ∼2.5K in validation split. We use both human and automatic evaluation measures for content preservation and transfer strength to illustrate the performance of the proposed approach.

Content preservation measures the degree to which the target style model outputs have the same meaning as the input style sentence. Following [16] , we measure preservation of content using BLEU [13] score between the ground truth and the generated sentence since the ground truth ensures that content of the source style sentence is preserved in it. For human evaluation, we presented 50 randomly selected model outputs to the Mechanical turk annotators and requested them to rate the outputs on a Likert [2] scale of 6 as described in [16] .

Transfer strength measures the degree to which style transfer was carried out. We reuse the classifiers that we built to provide rewards to the generated sentences (Sect. 3.2). A score above 0.5 from the classifier represents that the generated sentence belongs to the target style and to the source style otherwise. We define accuracy as the fraction of generated sentences which are classified to be in the target style. The higher the accuracy, higher is the transfer strength. For human evaluation, we ask the Mechanical turk annotators to rate the generated sentence on a Likert scale of 5 as described in [16] .

Following [4] who illustrate the trade-off between the two metrics -content preservation and transfer strength, we combine the two evaluation measures and present an overall score for the transfer task since both the measures are central to different aspects of text style transfer task. The trade-off arises because the best content preservation can be achieved by simply copying the source sentence. However, the transfer strength in such scenario will be the worst. We compute overall score in the following way Overall = BLEU × Accuracy BLEU + Accuracy which is similar to F1-score since content preservation can be considered as measuring recall of the amount of source content retained in the target style sentence and transfer strength acts as a measure of precision with which the transfer task is carried out. In the above formulation, both BLEU and accuracy scores are normalized to be between 0 and 1.

We first ran an ablation study to demonstrate the improvement in performance of the model with introduction of the two loss terms in the various settings differing in the way training is being carried out. Below we provide details about each of the settings.

CopyNMT: Trained with L ml TS: Trained with L ml followed by αL ml + γL ts CP: Trained with L ml followed by αL ml + βL cp TS+CP: Trained with L ml followed by αL ml + βL cp + γL ts TS→CP: Trained with L ml followed by αL ml +γL ts and finally with αL ml + βL cp CP→TS: Trained with L ml followed by αL ml +βL cp and finally with αL ml + γL ts Table 1 . Ablation study to demonstrate the improvement of the addition of the loss terms on formality transfer task. Training with L ml alone in all the above settings is done for 10 epochs with all the hyper-parameters set as default in the off-the-shelf implementation of [7] . Each of the iterative model training is done using the model with the best performance on validation set for 5 more epochs. We can observe from Table 1 that L ts and L cp helps in improving the accuracy which measures transfer strength (TS) and BLEU score which measures content preservation (CP) respectively as compared to CopyNMT. When all the three loss terms are used simultaneously (TS+CP) the resulting performance lies between TS and CP, indicating that there is a trade-off between the two metrics and improvement in one metric is at the cost of another as observed by [4] . This phenomenon is evident from the results of TS→CP and CP→TS where the network gets a bit biased towards the latter optimization. Moreover, improvement in CP→TS and TS→CP as compared to TS and CP respectively suggests that incremental training better helps in teaching the framework. Since the performance on both transfer strength and content preservation metrics plays an important role in text style transfer task, we chose TS→CP, which has the maximum overall score, over the other models for further analysis.

We compare the proposed approach TS→CP against the state-ofthe-art cross-aligned autoencoder style transfer approach (Cross-Aligned) by Table 2 . Comparison of TS→CP with the baselines on the three transfer tasks in both the directions. All the scores are normalized to be between 0 and 1.

Transformer [20] 0 

It can be seen from Table 2 that even though the transformer model has the best accuracy, it fails in preserving the content. Closer look at the outputs (formal to informal transfer task in Table 4 ) reveal that it generates sentences in target style but the sentences do not preserve the meaning of the input and sometimes are out of context (discussed in the Sect. 7). Cross-Aligned performs the worst in informal to formal transfer task among all the other approaches because it is generating a lot of unknowns and is not able to preserve content. TS→CP, on the other hand, has the highest overall score and performs the best in preserving the content. We also observed that the dataset had many sentences containing proper nouns like name of the songs, person or artists. In such cases, copy mechanism helps in retaining the proper nouns whereas other models are not able to do so. This is evident from the higher BLEU scores for our proposed model. Table 3 presents the human evaluation results aggregated over three annotators per sample. It can be seen that in at least 70% of the cases, annotators rated model outputs from TS→CP as better than the three baselines on both the evaluated metrics except for the content preservation as compared to CopyNMT in formal to informal task wherein, both the models perform equally good. One reason behind this is that both the models use copy-mechanism. Table 3 . Human evaluation results of 50 randomly selected model outputs. The values represent the % of times annotators rated model outputs from TS→CP (R) as better than the baseline CopyNMT (C), Transformer (T) and Cross-Aligned (S) over the metrics. I-F (E-NE) refers to informal to formal (exciting to non-exciting) task. 

R > C R > T R > S R > C R > T R

In order to demonstrate the generalizability of our approach on an affective style dimension like excitement (the feeling of enthusiasm and eagerness), we curated our own dataset using reviews from Yelp dataset 4 which is a subset of Yelp's businesses, reviews, and user data. We request human annotators to provide rewrites for given exciting sentences such that they sound as non-exciting/boring as possible. Reviews with rating greater than or equal to 3 were filtered out and considered as exciting to get the non-exciting/boring rewrites. We also asked the annotators to rate the given and transferred sentences on a Likert scale of 1 (No Excitement at all) to 5 (Very high Excitement). The dataset thus curated was split into train (∼36K), test (1K) and validation (2K) sets. We evaluate the transfer quality on content preservation and transfer strength metrics as defined in Sect. 4. For measuring the transfer strength we train a classifier as described in Sect. 3.2. We use the annotations provided by the human annotators on these sentences to get the labels for the two styles. Sentences with a rating greater than or equal to 3 were considered as exciting and non-exciting otherwise.

The transfer task in this case is to convert the input sentence with high excitement (exciting) to a sentence with low excitement (non-exciting) and vice-versa. We can observe from Table 2 that model performance in the case of excitement transfer task is similar to what we observed in the formality transfer task. However, CopyNMT performs the best in transferring style in case of non-exciting to exciting transfer task because the model has picked up on expressive words ('awesome', 'great', and 'amazing') which helps in boosting the transfer strength. TS→CP (with highest overall score) consistently outperforms Cross-Aligned in all the metrics and both the directions. Table 3 presents the human evaluation results on this transfer task. We notice that humans preferred outputs from our proposed model at least 60% of the times on both the measures as compared to the other three baselines. This provides an evidence that the proposed RL-based framework indeed helps in improving generation of more content preserving sentences which align with the target style.

Besides affective style dimensions, our approach can also be extended to other style transfer tasks like converting modern English to Shakespearean English. To illustrate the performance of our model on this task we experimented with the corpus used in [7] . The dataset consists of ∼21K modern-Shakespearean English sentence pairs with ∼18K in train, ∼1.5K in test and ∼1.2K in validation split.

We use the same evaluation measures as in the previous two tasks for illustrating the model performance and generalizability of the approach. For this task we present only the automatic evaluation results because manual evaluation of this task is not easy since it requires an understanding of Shakespearean english and finding such population is a difficult task due to limited availability.

We can observe from Table 2 that model performance in the case of this transfer task is also similar to what we have observed in the earlier two transfer tasks. Although Cross-Aligned has better accuracy than TS→CP, it fails to preserve the content (sample 3 of Table 6 ). Similar is the case with transformer which outperforms others in accuracy but is not able to retain the content (sample 1 of Table 6 ). TS→CP outperforms the three baselines in preserving the content with the highest overall score. This establishes the viability of our approach to various types of text style transfer tasks. These experiments further indicate that our proposed reinforcement learning framework improves the transfer strength and content preservation of parallel style transfer frameworks and is also generalizable across various stylistic expression.

In this section, we provide few qualitative samples from the baselines and the proposed reinforcement learning based model. We can observe from the transformer model output for Input 1 and 2 in formal to informal column of Table 4 that it generates sentences with correct target style but does not preserve the content. It either adds random content or deletes the required content ('band' instead of 'better' in 1 and 'hot' instead of 'talented' in 2). As mentioned earlier, in sample output 3 of Table 4 , Cross-Aligned is unable to retain the content and tend to generate unknown tokens. CopyNMT, even though is able to preserve content, tend to generate repeated token like 'please' in sample input 2 (informal to formal task) which results in lower BLEU score than our proposed approach. Transformer model outputs for exciting to non-exciting task in samples 1 and 2 of Table 5 , miss specific content words like 'environment' and 'alisha' respectively. However, it is able to generate the sentences in target style. Similary, Cross-Aligned and CopyNMT are also not able to retain the name of the server in sample 2 of Table 5 . Sample 2 of Shakespearean to Modern English and 1 of Modern to Shakespearean English task in Table 6 provide evidence for high accuracy and lower BLEU scores for transformer model. From sample 2 of Shakespearean to modern English transfer task, we can observe that Cross-Aligned although can generate the sentence in the target style is not able to preserve the entities like 'father' and 'child'. On the other hand, TS→CP can not only generate the sentences in the target style but is also able to retain the entities. There are few cases when CopyNMT is better in preserving the content Your mama is so unintelligent she got hit by a cop and told that she was so as compared to other models, for instance, sample 1 of formal to informal transfer task and sample 3 of non-exciting to exciting transfer task since it leverages copy-mechanism. Another point to notice is the lexical level changes made to reflect the target style. For example, the use of 'would', 'don't' and 'inform' instead of 'want', 'dono' and 'let me know' respectively for transforming informal sentences into formal ones. Use of colloquial words like 'u', 'gonna' and 'mama' for converting the formal sentences to informal can be observed from the sample outputs. Not Table 5 . Sample model outputs and target style reference for Exciting to Non-exciting and Non-exciting to Exciting style transfer task. The first line is the source style sentence (input), second line is the reference output and the following lines correspond to the outputs from the baselines and RL-based model.

Exciting to Non-exciting Non-exciting to Exciting Cross-Aligned The patio is pretty good Awesome food and great selection of music and music

The patio is good Great food and great drinks and live music TS→CP The patio is good Great food, great beers, and great music only lexical level changes but structural transformations can also be observed as in 'Please inform me if you find out'. In case of excitement transfer task, use of strong expressive words like 'amazing' and 'great' makes the sentence sound more exciting while less expressive words such as 'okay' and 'good' makes the sentence less exciting. Use of 'thou' for you and 'hither' for here are more frequently used in Shakespearean English than in modern English. These sample outputs indeed provide an evidence that our model is able to learn these lexical or structural level differences in various transfer tasks, be it formality, beyond formality or beyond affective dimensions. 

The primary contribution of this work is a reinforce rewards based sequence-tosequence model which explicitly optimizes over content preservation and transfer strength metrics for style transfer with parallel corpus. Initial results are promising and generalize to other stylistic characteristics as illustrated in our experimental sections. Leveraging this approach for simultaneously changing multiple stylistic properties (for e.g. high excitement and low formality) is a subject of further research.

Neural machine translation by jointly learning to align and translate

Likert scales

Style transformer: unpaired text style transfer without disentangled latent representation

Style transfer in text: exploration and evaluation

Reinforcement learning based text style transfer without parallel training corpus

Learning to write with cooperative discriminators

Shakespearizing modern language using copy-enriched sequence to sequence models

Deep reinforcement learning for sequence to sequence models

Convolutional neural networks for sentence classification

Delete, retrieve, generate: a simple approach to sentiment and style transfer

ROUGE: a package for automatic evaluation of summaries. Text Summarization Branches Out

Effective approaches to attention-based neural machine translation

BLEU: a method for automatic evaluation of machine translation

A deep reinforced model for abstractive summarization

Style transfer through back-translation

Dear sir or madam, may i introduce the GYAFC dataset: corpus, benchmarks and metrics for formality style transfer

Self-critical sequence training for image captioning

Style transfer from non-parallel text by cross-alignment

Multiple-attribute text style transfer

Attention is all you need

A study of reinforcement learning for neural machine translation

Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach

Paraphrasing for style