c©Copyright 2018 Aaron Jaech Low-Rank RNN Adaptation for Context-Aware Language Modeling Aaron Jaech A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2018 Reading Committee: Mari Ostendorf, Chair Hannaneh Hajishirzi Noah Smith Program Authorized to Offer Degree: Electrical Engineering University of Washington Abstract Low-Rank RNN Adaptation for Context-Aware Language Modeling Aaron Jaech Chair of the Supervisory Committee: Professor Mari Ostendorf Electrical Engineering A long-standing weakness of statistical language models is that their performance drastically degrades if they are used on data that varies even slightly from the data on which they were trained. In practice, applications require the use of adaptation methods to adjust the predictions of the model to match the local context. For instance, in a speech recognition application, a single static language model would not be able to handle all the different ways that people speak to their voice assistants such as selecting music and sending a message to a friend. An adapted model would make its predictions conditioned on the knowledge of who is speaking and what task they are trying to do. The current standard approach to recurrent neural network language model adaptation is to apply a simple linear shift to the recurrent and/or output layer bias vector. Although this is helpful, it does not go far enough. This thesis introduces a new approach to adaptation, which we call the FactorCell, that generates a custom recurrent network for each context by applying a low-rank transformation. The FactorCell allows for a more substantial change to the recurrent layer weights. Different from previous approaches, the introduction of a rank hyperparameter gives control over how different or similar the adapted models should be. In our experiments on several different datasets and multiple types of context, the in- creased adaptation of the recurrent layer is always helpful, as measured by perplexity, the standard for evaluating language models. We also demonstrate impact on two applica- tions: personalized query completion and context-specific text generation, finding that the enhanced adaptation benefits both. We also show that the FactorCell provides a more ef- fective text classification model, but more importantly the classification results reveal that there are important differences between the models that are not captured by perplexity. The classification metric is particularly important for the text generation application. TABLE OF CONTENTS Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 General Language Modeling Background . . . . . . . . . . . . . . . . . . . . 9 2.2 Neural Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Language Model Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3: Exploring Context-Aware RNNs . . . . . . . . . . . . . . . . . . . . . 21 3.1 Additive Bias Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 4: Factor Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 FactorCell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Experiments with Different Contexts . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Analysis for Sparse Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 i 4.5 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 5: Personalized Query Auto-Completion . . . . . . . . . . . . . . . . . . . 61 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 6: Context-Specific Text Generation . . . . . . . . . . . . . . . . . . . . . 73 6.1 Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 7: Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . 85 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Appendix A: Sparse Corrections for Output Layer Adaptation . . . . . . . . . . . . 105 A.1 Sparse plus low-rank softmax bias adaptation . . . . . . . . . . . . . . . . . 105 A.2 L1 Penalty for Bias Layer Fine-Tuning . . . . . . . . . . . . . . . . . . . . . 108 A.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 ii LIST OF FIGURES Figure Number Page 1.1 Usage of the terms “Black Friday” and “Super Bowl” on Reddit during an eight year time period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Vocabulary size in large vocabulary continuous speech recognition systems over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 The value of the dimension of the LSTM hidden state in an unadapted model that is the strongest indicator for Spanish text for three different code-switched Tweets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 Illustration of the FactorCell architecture. . . . . . . . . . . . . . . . . . . . 41 4.2 Accuracy vs. perplexity for different classes of models on the four word-based datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Log likelihood ratio between a model that assumes a 5 star review and the same model that assumes a 1 star review. Blue indicates a higher 5 star likelihood and red is a higher likelihood for the 1 star condition. . . . . . . . 51 4.4 Accuracy vs. Perplexity for different classes of models on the two character- based datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Comparison of the effect of LSTM parameter count and FactorCell rank hy- perparameters on perplexity for DBPedia. . . . . . . . . . . . . . . . . . . . 54 4.6 Distribution of a PCA projection of hotel embeddings from the TripAdvisor FactorCell model showing the grouping of the hotels by city. . . . . . . . . . 56 4.7 Distribution of a PCA projection of the hotel embeddings from the TripAd- visor FactorCell model showing the grouping of hotels by class. . . . . . . . . 57 5.1 Perplexity versus MRR on the development data for different classes of models. 67 5.2 Relative improvement in MRR over the unpersonalized model versus queries seen using the large size models. Plot uses a moving average of width 9 to reduce noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 MRR by prefix and query lengths for the large FactorCell and unadapted models with the first 50 queries per user excluded. . . . . . . . . . . . . . . . 69 iii 6.1 Context classification accuracy versus generation context-specificity for each type of adaptation on the Yelp data. . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Plot of FactorCell rank and perplexity against generation context-specificity accuracy for 14 FactorCell models on the Yelp restaurant data. . . . . . . . . 79 6.3 Context-specificity of hotel class versus FactorCell rank and perplexity in gen- erated reviews using the models learned on the TripAdvisor data. . . . . . . 82 6.4 Context classification accuracy versus generation context-specificity for each type of adaptation on the TripAdvisor data. . . . . . . . . . . . . . . . . . . 82 iv LIST OF TABLES Table Number Page 1.1 Example use cases for language model adaptation . . . . . . . . . . . . . . . 3 2.1 Approaches to RNN language model adaptation in prior work. The X indicates the use of the ConcatCell (CC) or the SoftmaxBias (SB) adaptation strategy 19 3.1 Number of sentences, vocabulary size and context variables for the three corpora. 25 3.2 Summary of Key Hyperparamters . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Perplexities and Classification Avg. AUCs for Reddit Models . . . . . . . . . 28 3.4 Nearest neighbors to selected subreddits in the context embedding space. . . 29 3.5 Comparison of perplexities per subreddit . . . . . . . . . . . . . . . . . . . . 31 3.6 Results on Twitter data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Results on the SCOTUS data in terms of perplexity and classification accuracy (ACC) for the justice identification task. . . . . . . . . . . . . . . . . . . . . 35 3.8 Perplexities for different combinations of context variables on the SCOTUS corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.9 Sentences generated from the adapted model using beam search under different assumptions for speaker and role contexts. . . . . . . . . . . . . . . . . . . . 36 4.1 Dataset statistics: Dataset size in words (* or characters) of Train, Dev and Test sets, vocabulary size, number of training documents, and context variables. 42 4.2 Selected hyperparameters for each dataset. When a range is listed it means that a different values were selected for the FactorCell, ConcatCell, Soft- maxBias or Unadapted models. . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Perplexity and classification accuracy on the test set for the four word-based datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Perplexity and classification accuracies for the EuroTwitter and GeoTwitter datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 The top boosted words in the Softmax bias layer for different context settings in a FactorCell model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 v 5.1 Top five completions for the prefix ba for a cold start model with no previous queries from that user and a warm model that has seen the queries espn, sports news, nascar, yankees, and nba. . . . . . . . . . . . . . . . . . . . 63 5.2 MRR reported for seen and unseen prefixes for small (S) and big (B) models. 68 5.3 The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “high school softball” and “math homework help”. . 70 5.4 The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “prada handbags” and “versace eyewear”. . . . . . . . 71 5.5 The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “discount flights” and “yellowstone vacation packages”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 Top completions for the sentence “My boyfriend and I ate here and !” after conditioning on each star rating. . . . . . . . . . . . . . . . . . . . . . . 76 6.2 Top completions for the sentence “This was my first time coming here and the food was ” after conditioning on each star rating. . . . . . . 76 6.3 Top completions for the sentence “I will again” after conditioning on each star rating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the Yelp data. . . . . . . . . . . . . . 79 6.5 Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the TripAdvisor data. . . . . . . . . . 81 A.1 Perplexity on the validation set of models with no adaptation and varying softmax adaptation strategies. Results are not comparable to those in Chapter 4 because of a difference in vocabulary size. . . . . . . . . . . . . . . . . . . 107 vi ACKNOWLEDGMENTS First, I would like to thank my advisor Prof. Mari Ostendorf for her excellent men- torship and for her constant encouragement, patience, and support. I have had the pleasure of working with her on many projects during the last five years and in each case her deep experience has been a valuable asset. I would also like to thank the other members of my thesis committee, Hannaneh Hajishirzi and Noah Smith, with whom I have had the pleasure of collaborating. My peers and colleagues at the University of Washington have played a key role in my graduate experience. My interactions with them has been the most enjoyable part of my graduate education. I would like to thank Ji He, Hao Fang, Vicky Zay- ats, Yi Luan, Hao Cheng, Trang Tran, Kevin Lybarger, Farah Nadeem, Rik Koncel- Kedziorski, and Shobhit Hathi. I am very fortunate to have been able to work along- side these students. I thank Arjun Sondhi for the many fruitful discussions. I am grateful to the mentors I had during my internships, including Henry Schnei- derman, Larry Heck, Eric Ringger, and Hetunandan Kamisetty. Each of them taught me important research skills and changed my perspective on the field. This thesis would not have been possible without the experience I gained from working with them. I thank my parents Jeff and Rebecca Jaech, who have given me their unwavering support, unconditional love, and constant encouragement during all my many years of schooling. Lastly, I acknowledge my lovely girlfriend Gayoung Park, who understands the struggles of a Ph.D student and who has spent many long days and late nights vii working by my side. I thank her for her help and for reminding me to always have fun. viii 1 Chapter 1 INTRODUCTION 1.1 Overview Language is a highly adaptable means of communication and as humans we routinely vary our usage of it to match our environment. Changes in language from one setting to another can include large differences in topic, register, politeness, etc. depending on a host of vari- ables such as the social context, the medium of communication, the time and location, and the task at hand. If statistical language models can be made to mimic this contextual adapt- ability, then they will be useful in a wider range of applications, including speech recognition, machine translation, abstractive summarization, text generation, and more. Without context awareness, static models are incredibly brittle, meaning they are “ex- tremely sensitive to changes in the style, topic, or genre” (Rosenfeld, 2000). Performance drastically degrades when a model is used in a context other than the one for which it was trained. An early study looked at incorporating text from the Associated Press for modeling contemporaneous articles from the Wall Street Journal (Rosenfeld, 1996). Even though the difference in style between two American news organizations is relatively minor compared to the variations in language that are likely to be observed in other applications, the Associated Press text was practically useless for this task. Unfortunately, it is often the case that there is a limited amount of text available for learning a language model to characterize a particular context. Therefore, researchers have long sought ways to more effectively use different sources of data. This area of research is often referred to as domain adaptation, characterized by making use of data from multiple contexts to target a particular context. Domain adaptation has been explored for a wide variety of contexts, as shown in Table 1.1. As an illustration of the importance of context, 2 Figure 1.1: Usage of the terms “Black Friday” and “Super Bowl” on Reddit during an eight year time period. Figure 1.1 shows the bursty variation of the frequency of two terms in eight years of text from Reddit. Mentions of these two events increase in likelihood by several orders of magnitude during predictable but narrow windows of time. This illustrates both why models that ignore context are brittle and why there is so much to be gained from adaptation. Context is often represented by partitioning data into in-domain and out-of-domain sets. The in-domain data is considered to be sampled from the same or a similar distribution as the evaluation data and the out-of-domain data is everything else. A simple adaptation method is to build separate models for the in-domain corpus and out-of-domain corpora and then choose the interpolation weights to give the best performance on the in-domain data. There can be multiple out-of-domain corpora that are treated separately but typically no attempt is made to model the similarities or differences between domains/contexts. The problem with this adaptation approach is that the notions of discrete domains are much too crude when the goal is to mimic the fine-grained contextual adaptations that humans typically employ. 3 Category Purpose Topic Adjust to variations in topic across documents or in speech. This is the most popular use case for adaptation techniques. Temporal Changing language usage patterns over time (Rosenfeld, 1995; Os- borne et al., 2014; Yogatama et al., 2014) Geographic Variations in speaking style in different geographic regions (Eisenstein et al., 2010; Chelba et al., 2015; Halpern et al., 2016) Modality Adapt a model trained on written text for use in conversational speech (Bulyko et al., 2003; Jaech and Ostendorf, 2015; Mendels et al., 2015) Language Share information between similar languages (Ragni et al., 2016; Östling and Tiedemann, 2017) TV Programs & Youtube Channels Adapt to styles and topics of different television shows (Chen et al., 2015; Deena et al., 2016) Lectures, Talks, & Meetings Use text from slides, lecture titles, and other written materials to bias language model in speech recognition (Schwarm et al., 2004; Glass et al., 2007; Hsu and Glass, 2008; Hoang et al., 2016) Dialog State Generate an appropriate response given the dialog state (Riccardi and Gorin, 2000; Liu and Lane, 2017) Personalization Match the model predictions to the style of each individual from a large group (Tseng et al., 2015; Li et al., 2016) Table 1.1: Example use cases for language model adaptation 4 In the newspaper example, instead of relying solely on a single binary variable indicating domain membership (AP vs. WSJ), additional contextual variables could have been used. For example, we could have a variable to indicate which section of the paper the article appeared in, or who the author was, or the date of publication. In general, the context representation can leverage multiple discrete and continuous variables—the more expressive the contextual representation, the greater the ability of the model to adapt to it. Ideally, a context-aware language model could model text using the style of one newspaper and the topic of a different publication. While researchers have tried to do this with class language models, equating word class sequences with style, many real-world applications are best characterized by interacting combinations of several contextual factors. Therefore, this thesis investigates models that can use rich context representations. Language is often produced in association with some information about its context. For example, in speech recognition, if a user is speaking to a personal assistant then the system might know the time of day or the identity of the task that the user is trying to accomplish. If the user takes a picture of a sign to translate it with their smart phone, the system would have contextual information related to the geographic location and the user’s preferred language. The probability of certain terms and phrases appearing can change dramatically with respect to geographic location (Chelba and Shazeer, 2015). When adapting to context information, the language model conditions on both the con- text and the previous words in the sequence. It is computing P(w1:n|context). The mech- anism for computing P(w1:n) impacts the approach for accounting for context. Recurrent neural networks (RNNs) have been shown to be very effective language models compared to previous approaches such as n-gram models or maximum entropy models (Mikolov et al., 2010), in part because they are able to make use of arbitrarily long word histories. RNN based language models are the focus of this thesis due to their recent successes and current widespread use. Improvements in adapting RNNs are likely to have a positive impact on multiple tasks. In addition, the continuous-space approach is well-suited to characterizing multiple contextual variables. 5 The standard method of adapting RNN language models is due to Mikolov and Zweig (2012) and involves learning an embedding to represent the context (originally the output of a topic model but any type of learned embedding will work) and including it via concatenation as an additional input to the model. We refer to this method of adapting the recurrent layer as the ConcatCell because of its reliance on concatenation of the context embedding. As we will show later on, when using this adaptation method most of the parameters of the model are static with respect to context. We propose a more powerful mechanism for using a context vector, which we call the FactorCell. Rather than simply using context as an additional input, it is used to control a factored (low-rank) transformation of the recurrent layer weight matrix. The motivation is that allowing a greater fraction of the model parameters to be adjusted in response to the input context will produce a model that is more adaptable and responsive to that context. In addition, we introduce a mechanism for handling new contexts that emerge after the model has been trained and deployed, showing how adaptation can be effective in these scenarios as well. 1.2 Key Contributions The primary goal of this work is to increase the adaptability of recurrent neural network language models. Our main contribution is to propose a novel mechanism, which we call the FactorCell, for using a context embedding vector to transform the weights of the recur- rent layer. This is a fundamentally different way of adapting the recurrent layer. Instead of viewing context as an additional input to the RNN, we create a function that uses con- text information to output the weights of a custom RNN that matches the given context. Moreover, by using precomputation and caching techniques, the FactorCell delivers superior adaptation at little to no extra computational cost. We demonstrate the superiority of the FactorCell model over commonly used methods for RNN adaptation, including several recent approaches that do not adapt at the recurrent layer, preferring to focus on the output bias vector. Experiments on nine datasets with varying domains, contexts, and model and vocabulary sizes confirms that adapting the recurrent layer 6 always helps. We also show that our model beats the current standard method of adapting the recurrent layer. The extensive experimentation leads to some useful observations for predicting when context conditioning will be more or less successful for a given dataset. The many prior use cases for adaptation mentioned in Section 1.1 all have in common the requirement that all contexts be known during training. Another contribution is to introduce an online learning method for adapting to contexts that emerge after the model has been deployed. Online learning improves the quality of the adaptation and also widens the set of possible applications. We use a personalized query auto-completion task to demonstrate how this method can be successfully used. Again, our FactorCell model beats the standard approach to adaptation and the performance gap widens as more data becomes available over time. Adaptation is vital for language models to work in real world applications. We show how the FactorCell model impacts multiple applications and discuss potential implications for many more. The first application is the just mentioned personalized query completion. Another important application is context-specific or controllable text generation. We show how the FactorCell model is significantly more controllable than the standard adaptation methods. The gap between our model and the standard approach is not easily closed even when we give the baseline an advantage by doubling the dimensionality of its recurrent layer. Automatic evaluation of context-specific text generation can be difficult. We propose a metric based on text classification that is predictive of human ratings for text generation performance and avoids many of the headaches of prior evaluation techniques. For both the query completion and text generation applications we show through analysis that the FactorCell model has qualitative benefits that are not fully captured by perplexity as an evaluation criterion. 1.3 Thesis Overview The remainder of the thesis proceeds as follows. In Chapter 2, we provide background information on language modeling and adaptation. 7 We review relevant prior work in these areas including the ConcatCell and SoftmaxBias models from Mikolov and Zweig (2012) that serve as the principal baselines for the rest of the thesis. In Chapter 3, we show that the ConcatCell model is effectively a constant additive bias in the recurrent layer. We then explore trade-offs of different architectures for adapting at the recurrent layer and the output layer. We make some observations about what factors make a dataset more or less amenable to adaptation. In Chapter 4, we introduce the FactorCell, a more powerful model for RNN adaptation. By controlling the rank of a low-rank context-dependent weight transformation, the Fac- torCell can be adjusted to allow for more or less sharing of information between contexts depending on the situation. This model remedies the prominent weakness of the Concat- Cell, namely, that it often does not allow its predictions to be changed enough in response to context. We show that the FactorCell beats the ConcatCell in terms of perplexity and that it also does better at capturing the relationship between context and language in text classification experiments. In Chapter 5, we introduce an approach for adapting to new contexts that emerge after the model has been trained and deployed, and apply the FactorCell model to the task of personalized query auto-completion. The key result is that stronger adaptability at the recurrent layer enables the model to better take advantage of information from new users’ query histories to personalize their predictions. Chapter 6 deals with the use of language model adaptation for context-specific text generation. We measure context-specificity by checking if the text sampled from or generated by the model matches the properties of the specified context. We show that when the adaptation is weaker (as in the ConcatCell) then context-specificity suffers; the FactorCell model gives clear wins for context-specificity. We propose a metric based on text classification that is predictive of human ratings for text generation performance and avoids many of the headaches of prior evaluation techniques. Finally, Chapter 7 concludes the thesis by summarizing the contributions and suggesting 8 future directions for additional research. 9 Chapter 2 BACKGROUND This chapter gives the necessary background information on language modeling and prior literature on adaptation. The contributions of the thesis build on these ideas. 2.1 General Language Modeling Background Language models compute a probability distribution over word sequences w1:n where each wi is drawn from a vocabulary V . Typically, the probability is factored using the chain rule: P(w1:n) = P(w1) ∏n i=2 P(wi|w1:i−1). The w1:i−1 term is often referred to as the history. Dealing with data sparseness is fundamental to language modeling because there will always be many valid word sequences that are not observed in the training data. One way to categorize language models is to look at how they deal with the sparseness or generalization problem: n-gram models use a back-off strategy, maximum entropy models rely regularization techniques, and the different classes of neural networks generalize by finding low dimensional continuous space representations of language. In the basic n-gram language model only the most recent n−1 words from the history are considered (Bahl et al., 1978). In the common case that n = 3, known as a trigram model, P(wi|w1:i−1) is approximated as P(wi|wi−1,wi−2). Even with this simplifying assumption, sparseness is a problem. Estimating the parameters of an n-gram language model using maximum likelihood fails to assign proper estimates to the n-grams that are unobserved in the training data, and there will be many such n-grams. The remedy is to borrow from the probability mass given to some of the observed n-grams and distribute it to unobserved ones (Katz, 1987). The redistribution is done recursively, reducing the size of the n-gram history in each step. The methods used for the back-off smoothing has been improved over time 10 (Kneser and Ney, 1995; ?), but the basic idea has remained the same. One improvement was to add skip-grams, which are like n-grams except they can skip over some words to look farther back in the history (Siu and Ostendorf, 2000; Shazeer et al., 2015). A variant of skip-grams is to identify long range triggers such as by using information from a syntactic parse (Bellegarda, 2000). The class language model is an extension of the standard n-gram model that operates on a set of word equivalence classes (Brown et al., 1992). For a trigram, the probabilities are factored as P(wi|wi−1,wi−2) = P(wi|ci)P(ci|ci−1,ci−2). The use of classes reduces the effective size of the vocabulary when estimating P(ci|ci−1,ci−2) and thereby improves the reliability of those estimates and possibly allowing for use of a higher n-gram order. This is the advantage of the class language model formulation. However the gain in reliability is associated with a loss of detail that can lead to mixed results. The difficulty, for several decades, in finding a practical improvement over n-gram lan- guage models, despite their obvious weaknesses, was described as being a “source of consid- erable irritation” to researchers in the field (Jelinek, 1991). A decade later, Rosenfeld (2000) calls it ironic that “the most popular language models (n-grams) take no advantage of the fact that what’s being modeled is language.” And yet, even now with many alternatives to choose from, n-gram models continue to be heavily used. Their strengths are that they make few assumptions other than the Markov property, training the models is as fast as counting n-grams, and they use a vast number of parameters to memorize idiomatic and exceptional expressions. 2.1.1 Vocabulary & Input Representation An important design choice in language modeling is to define the vocabulary. Typically, the vocabulary is set by taking the set of all words that appear at least k times in the training corpus for some small k. Over time, as more powerful computers and bigger text corpora became available, the typical vocabulary size used in applications has likewise increased. Figure 2.1 shows the change in vocabulary size for large vocabulary continuous speech recog- 11 nition systems over time with a doubling in vocabulary size approximately every four years. Similar increases are observed for other applications. Figure 2.1: Vocabulary size in large vocabulary continuous speech recognition systems over time. Exploding vocabulary sizes lead to exponential growth in model size and exponential increases in the time required for speech decoding. This trend is obviously not sustainable. Dealing with the exponential growth is a challenge but also an opportunity for considerable amounts on innovation. Research is proceeding into neural systems that can decode audio directly into text one character at a time (Graves and Jaitly, 2014; Chan et al., 2015; Bahdanau et al., 2016; Maas et al., 2015). The language model can be a character n-gram model trained separately or an LSTM that is part of the larger neural network and trained end-to-end. When training end- to-end, the objective is to minimize the error rate instead of aiming to minimize perplexity. Machine translation is also moving away from using word-based language models (Lee et al., 12 2017; Chung et al., 2016). Google’s machine translation system uses a small vocabulary of around 30,000 tokens that are a mix of words and subwords (Wu et al., 2016). Character-based models are far from a new concept. Subword models have found use before especially when working with highly inflected or low-resource languages (Tucker et al., 1994; Creutz et al., 2007; Saraçlar et al., 2010); however, there has been a recent surge of interest in character- and subword-based models. This thesis makes a point of testing on both word- and character-level models so as to have maximum impact on future applications. 2.1.2 Evaluation & Metrics Language model evaluation is often done on a held-out test set using perplexity as a metric: perplexity = exp(− 1 N N∑ i=1 log p(wi)). (2.1) See Equation 2.1. Perplexity was always recognized as a “crude” metric (Jelinek, 1990, 1991) but is still widely used because it offers a quick and easy task-independent way of evaluating a language model. Perplexity is not the only factor that matters when comparing models, however. Models can be interesting for other reasons such as speed (Brants et al., 2007). In some cases, the models are evaluated using downstream metrics like word error rate in speech recognition or BLEU score in machine translation (Kirchhoff and Yang, 2005) instead of using perplex- ity. For text generation, human evaluations can be high quality but are costly to perform. Perplexity will be the main evaluation metric used in this thesis due to the desire to be task-independent and the need to control costs, but some experiments with other metrics are included. Recently, some language modeling papers have used “dynamic evaluation”, whereby the model is allowed to continue to train on the test data after making predictions for each segment (Krause et al., 2017). This is a form of online updating and it helps to adapt to shifts between the training and the test data. Dynamic evaluation makes less sense for certain applications such as speech recognition because it can reinforce errors in the transcription. 13 Language models can be fairly compared without the use of dynamic evaluation. Thus, we do not make use of it in this thesis except that we do use a form of online updating for one of our experiments that will be introduced in Chapter 5, where the application makes sense. 2.2 Neural Language Modeling Continuous-space language models such as the neural probabilistic language model (Ben- gio et al., 2003a) obtain an advantage over n-gram models because they share information between n-grams by projecting to a low-dimensional continuous space. Recurrent neural network (RNN) language models (Mikolov et al., 2010) extend that advantage by permitting the incorporation of information from arbitrarily long word histories. The RNN language model in its basic form has three layers: an input layer, a recurrent layer, and an output layer. The input layer learns a word embedding matrix E ∈ Rp×|V | that consists of a p- dimensional vector for each word in the vocabulary V . If the input w1,w2, . . .wn is rep- resented as one-hot encoded vectors then multiplying by E gives a sequence of word em- beddings, ew1,ew2, . . .ewn = Ew1,Ew2, . . .Ewn. The recurrent layer uses a weight matrix W ∈ Rq×(p+q) and a bias vector b1 ∈ Rq to transform the word embedding for the current step ewt and its own output from the previous step ht−1 into a hidden state vector ht that summarizes the sequence up to that point. The formula is given by Equation 2.2 where σ is the activation function, typically the hyperbolic tangent or a restricted linear unit (ReLU): ht = σ(W[ewt,ht−1] + b1). (2.2) The output layer uses the hidden state vector ht to estimate a probability distribution yt over the vocabulary for wt+1: yt = softmax(E Tht + b2). (2.3) The bias vector b2 ∈ R|V | acts as a prior on the unigram distribution and is a crucial part of the output layer. If the word embedding size p is not the same as the recurrent layer 14 dimensionality q then a linear projection can be inserted and the word embedding matrix E from the input layer can be reused in the output layer to save on parameters and increase generalizability (Press and Wolf, 2017; Inan et al., 2016). In this case, the equation for the output layer would be yt = Softmax(E T Pht + b2). The parameters of the model, E, W, b1, and b2, can all be learned jointly via backpropagation through time towards the objective of maximizing the likelihood of the data. One limitation of neural language models, as originally proposed, is that the training time scales poorly with the size of the vocabulary. A variety of techniques were developed as workarounds. Neural models were trained to predict only the words from a subset of the vocabulary known as a shortlist (Schwenk, 2004), and the predictions from the neural model were interpolated with an n-gram model that could handle a full-sized vocabulary. Shortly thereafter, hierarchical neural models were developed. These train faster with only a small decrease in perplexity (Morin and Bengio, 2005). Techniques such as importance sampling (Bengio et al., 2003b) and noise contrastive estimation (Mnih and Kavukcuoglu, 2013) make it possible to quickly train full size vocabulary models without a hierarchy. The method used for training large vocabulary models in this thesis is the sampled softmax from Jean et al. (2014). The sampled softmax constrains the vocabulary to a small random subset when computing the loss for each sequence avoiding the need to backpropagate to the full vocabulary for every weight update. For a thorough review of methods for training large vocabulary neural language models, see (Chen et al., 2016). The basic RNN has a flaw known as the vanishing/exploding gradient problem that pre- vents it from learning to use information from far back in the history (Pascanu et al., 2013). In practice, this is mitigated by using alternate architectures such as long short-term memory (LSTM) or the gated recurrent unit (GRU) (Sundermeyer et al., 2012). These architectures use gating mechanisms to control the flow of information and importantly they permit the preservation of information from the hidden state over time. Our experiments make exten- sive use of these RNN variants. Other techniques have been developed to further increase the stability of the RNN. In some of our experiments, we make use of layer normalization 15 which involves normalizing the first and second moments of the W[ewt,ht−1] + b1 term in Equation 2.2 (Ba et al., 2016). In the early 1990s it was noted that n-gram language models benefit from boosting the probability of words observed in previous utterances or in earlier parts of a document because word usage is “bursty” (Jelinek et al., 1991). Once a word is observed in a given document, the probability of seeing it again is greatly increased. Models that boost the probability of recently seen words are called cache language models (Kuhn and De Mori, 1990). These techniques have been extended to RNN language models by allowing the model to look back at the previous hidden states from the same sentence or document (Merity et al., 2017b; Grave et al., 2017a). Usually, state-of-the-art language modeling results are reported both with and without the inclusion of a cache, e.g. Merity et al. (2017a), because the gain from a cache is considered to be orthogonal to other modeling improvements. The experiments in this thesis are reported without the inclusion of a cache in order to focus on our contributions to adaptation. 2.3 Language Model Adaptation There is a long history of adapting n-gram language models starting from early work on mixture modeling (DeMori and Federico, 1999; Bellegarda, 2004). Since this thesis builds on neural models, the review of prior work will only cover these methods. The long-established mixture techniques also apply to neural network models (Irie et al., 2018), but our focus is on techniques that are specific to neural network adaptation. We are most interested in cases with explicit representations of context, however, one adaptation method, model fine-tuning, does not require the use of a context embedding. The language model is first trained on general background data and then learning is briefly continued on smaller in-domain data to “fine-tune” the weights (Gangireddy et al., 2016; Zhang et al., 2017). Fine-tuning suffers from the possibility of catastrophic forgetting, where the model loses access to the information it learned from training on the background data (Goodfellow et al., 2013). One way of dealing with catastrophic forgetting is to freeze portions 16 of the model during fine-tuning. Some approaches combine fine-tuning with the addition of a new linear transformation in between the hidden and output layers (Deena et al., 2016), or occasionally a non-linear transformation is used instead (Ma et al., 2017). This is motivated in part by similar approaches for adapting acoustic models in speech recognition (Gemello et al., 2007). Adaptation of neural language models has two parts: representing context information and using the context representation to alter the model predictions. We discuss these next. 2.3.1 Context Representations For neural language model adaptation, context information is typically represented as an embedding vector. Neural networks have the advantage in that they are quite versatile in the types of inputs that they can accept. Another advantage is that the network can learn the context representations as part of the end-to-end language modeling task. Some of the types of context that could be or have been used are 1. Topic model vectors that summarize the topic of long documents (Mikolov and Zweig, 2012). 2. Context is the title of a TED talk and it is represented as an embedding using either bag-of-words features or using an RNN or CNN (Hoang et al., 2016). 3. Context is a one-hot encoded vector indicating an Amazon product identifier and an- other one-hot vector indicating the sentiment of a review of that product (Tang et al., 2016). The context information can be categorical, numeric, textual information, or a combina- tion of these. It could even be composed of attributes that are predicted using a machine learning system based on other features. The majority of experiments in this thesis use categorical context variables but the methods all apply to other cases. 17 We assume the availability of contextual information (metadata or other side information) that is represented as a set of context variables f1:n = f1,f2, . . .fn, from which we produce a k-dimensional representation in the form of an embedding, c ∈ Rk. Each of the context variables, fi, represents some type of information or metadata about the sequence and can be either categorical or numerical. We adopt the strategy from Tang et al. (2016) of combining information from multiple context variables using a simple neural network. This strategy is well-suited for the types of context variables that we will see in our experiments, particularly high dimensional contexts such as hotel identity in a review or user identity in a query completion task. For each context variable fi, we learn an associated embedding matrix Fi, i = 1, . . . ,n. If n = 1 then the embedding can directly be used as the context representation. Otherwise, a single layer neural network is used to combine the embeddings from the individual variables. c = tanh( ∑ i MiFifi + b0) (2.4) In some cases the tanh activation function is replaced with the ReLU function instead. Mi and b0 are parameters learned by the model. The context embedding, c, is used for adapting both the hidden and the output layer of the RNN. 2.3.2 Adaptation Mechanisms Mikolov and Zweig (2012) were the first to propose a method for adapting RNN language models by augmenting both Equations 2.2 and 2.3 with an extra term. The adaptation depends on having a summary of the context information contained in an embedding vector c ∈ Rk. To adapt the recurrent layer they concatenate the context embedding c with the word embedding at every step of the input, which we show in the next chapter is equivalent to an additive context-dependent bias. We refer to this form of adaptation as the ConcatCell. ht = σ(W[ewt,ht−1,c] + b1). (2.5) 18 To adapt the output layer, an adaptation term Gc is added, yt = softmax(E Tht + Gc + b2), (2.6) which has the effect of altering the softmax bias in a context dependent way. We refer to models that use this form of adaptation as the SoftmaxBias model. SoftmaxBias adaptation appears to be a reasonable approach in cases where the context concerns topic, since we know that the unigram distribution is often sufficient to capture topical information. As we will show later on, SoftmaxBias adaptation alone is not sufficient to get the best results. This is particularly obvious when dealing with character-level models, where the unigram distribution carries little information about topic or style. In the special case where the context variable is discrete and of low cardinality then the bias vector can be adapted by directly learning independent bias vectors for each context, i.e. replacing the context embedding c with a one-hot encoded vector. When the cardinality of the context variables is high then learning independent bias vectors carries a high memory cost. Although it is not often framed in this way, the Gc term acts as a low-rank approxi- mation to the strategy of learning independent bias vectors. We will use both strategies in this thesis, as the situation warrants. These two adaptation strategies have been adopted for a variety of tasks, including per- sonalization, adapting to television show genres (Chen et al., 2015), adapting to long range dependencies in a document (Ji et al., 2015), etc. See Table 2.1 for a listing of more prior work, showing which of these methods were employed. As shown in the table, a variety of contexts have been used. Topic based adaptation is popular but categorical variables are also used like product identity or sentiment level. When doing personalization, the context can be an identifier for the person (Li et al., 2016), or alternatively the information can be given as a bag-of-words representation of the persons prior utterances or writings (Wen et al., 2013). Few studies have tested the relative merits of adapting at the recurrent layer versus the output layer. Ji et al. (2015) compares the two approaches, which they refer to as ccDCLM 19 M o d e l C C S B C o n te x t T o p ic R N N (D ie n g et a l. , 2 0 1 6 ) X T o p ic m o d el C o n te x t A w a re G en er a ti o n (T a n g et a l. , 2 0 1 6 ) X P ro d u ct id en ti ty a n d se n ti m en t G en er a ti v e T ex t C la ss ifi ca ti o n (Y o g a ta m a et a l. , 2 0 1 7 ) X M is ce ll a n eo u s C o n tr o ll in g S ty le in T ex t G en er a ti o n (F ic le r a n d G o ld b er g , 2 0 1 7 ) X M o v ie re v ie w st y li st ic fe a tu re s C o n te x t D ep en d en t L M (M ik o lo v a n d Z w ei g , 2 0 1 2 ) X X T o p ic m o d el L a n g u a g e M o d el P er so n a li za ti o n (W en et a l. , 2 0 1 3 ) X X S o ci a l m ed ia te x t M u lt i- g en re S p ee ch R ec o g n it io n (C h en et a l. , 2 0 1 5 ) X X T o p ic m o d el C o n te x tu a l L S T M (G h o sh et a l. , 2 0 1 6 ) X T o p ic m o d el P er so n a B a se d C o n v er sa ti o n (L i et a l. , 2 0 1 6 ) X P er so n a li ze d re sp o n se g en er a ti o n F ea tu re -b a se d R N N L M a d a p ta ti o n (D ee n a et a l. , 2 0 1 6 ) X M u lt i- g en re b ro a d ca st sp ee ch T a b le 2 .1 : A p p ro a ch es to R N N la n g u a g e m o d el a d a p ta ti o n in p ri o r w o rk . T h e X in d ic a te s th e u se o f th e C o n ca tC el l (C C ) o r th e S o ft m a x B ia s (S B ) a d a p ta ti o n st ra te g y . 20 and coDCLM. They find that both give similar perplexities. SoftmaxBias wins by 3% on one dataset and by less than 1% on the other. However, adapting the recurrent layer does better at an auxiliary sentence ordering task. Differences between models that are more prominent in other metrics besides perplexity is a theme that we will return to later on in the thesis. They do not test any models that adapt both the recurrent and output layers. Hoang et al. (2016) also consider adapting at the hidden layer vs. at the softmax layer. They report a small advantage towards adapting the output layer but the comparison is only made on a single dataset. It should be noted that not all models fit cleanly into the above framework, although it is the dominant paradigm. The (Hoang et al., 2016) model differs from the SoftmaxBias approach because it use an extra perceptron layer in the output. Luan et al. (2016) use the recurrent layer bias approach plus an extra context-dependent linear projection in between the recurrent and the output layer. Wen et al. (2015) uses a custom gating architecture to adapt to dialogue states. 21 Chapter 3 EXPLORING CONTEXT-AWARE RNNS In this chapter1, we show that the popular RNN hidden layer adaptation strategy de- scribed in the previous chapter corresponds to a static additive bias term. Then, we study the impact of adapting RNNs at the recurrent layer versus the output layer using different architectures and techniques. Starting with the unadapted RNN that makes no use of con- text information, we consider two mechanisms each of adapting the recurrent layer and the output layer respectively. One of the methods for adapting the recurrent layer is a novel mul- tiplicative rescaling of the hidden state dimensions. Using experiments on three datasets, we make some observations about what factors make a dataset more or less amenable to adap- tation. These studies provided the groundwork that led to the proposal of the FactorCell model. 3.1 Additive Bias Adaptation As described in Section 2.3.2, the standard approach to recurrent layer adaptation is to include (via concatenation) the context embedding as an additional input to the recurrent layer. When the context embedding is constant across the whole sequence, it is easy to show that this concatenation is equivalent to using a context-dependent bias at the recurrent layer: ht = σ(Ŵ[ewt,ht−1,c] + b1) = σ(W[ewt,ht−1] + Qc + b1) = σ(W[ewt,ht−1] + b ′ 1), (3.1) 1The content of this chapter draws from our previously published work (Jaech and Ostendorf, 2017) 22 where Ŵ = [W Q] and b′ = Qc + b is the context-dependent bias, formed by adding a linear projection of the context embedding. Thus, for this scenario where the context only changes sporadically, concatenating the context embedding with the input to the recurrent layer is the same as using a context-dependent bias vector. Some people perform the concatenation explicitly despite its inefficiencies compared to directly learning the context-dependent bias because modern deep learning libraries do not always make it easy to alter the internal workings of the RNN or LSTM. 3.2 Models We consider two methods each for adapting the recurrent and output layers respectively. In total there are 24 = 16 total possible models that can be constructed by enabling or disabling each of the four adaptations. All of the models with adaptation incorporate a context embedding c using the method described in Section 2.3.1. 3.2.1 Adapting the recurrent layer We consider two methods for adapting the recurrent layer. The first is the ConcatCell from Equation 3.1. It can be implemented by simply concatenating the context vector c with the word embedding ewt at each timestep at the input to the recurrent layer, but the effect of this adaptation is to apply a linear additive shift to the recurrent layer bias. To increase the adaptability of the hidden layer, we introduce a context-dependent multi- plicative rescaling of the hidden layer weights. The method is inspired from Ha et al. (2017), where a similar multiplicative scaling is used to dynamically adjusting the parameters of a language model in response to the previous words in the sentence. Using this row rescaling technique on top of the additive adaptation from Equation 2.4, the equation becomes ht = σ(Cc�W[ewt,ht−1] + Qc + b1) (3.2) where C ∈ Rq×k is a new model parameter and � is the elementwise multiplication operator. The element-wise multiplication is a low-cost operation and can even be pre-calculated so 23 that model evaluation can happen with no extra computation compared to a vanilla RNN. 3.2.2 Adapting the output layer One way of adapting the output layer is to let each context have its own bias vector. This requires the use of a matrix of size |V | × |C|, where |V | is the size of the vocabulary and |C| is the total number of possible contexts. This may be intractable when both |V | and |C| are large. Mikolov and Zweig (2012) use a low-rank factorization of the adaptation matrix, replacing the |V |×|C| matrix with the product of a matrix G of size |V |×k and a context embedding c of size k. yt = softmax(E Tht + Gc + b2) (3.3) The total number of parameters is now a much more manageable O(|V | + ∑ i |Ci|) instead of O( ∑ i |V ||Ci|), where Ci is the cardinality of the i-th context variable. The advantage of a low-rank adaptation is that it forces the model to share information between similar contexts. The disadvantage, is that important differences between similar contexts can be lost. We employ feature hashing to reduce the memory requirements but retain some of the benefits of having an individual bias term for each context-word pair. The context-word pairs are hashed into buckets and individual bias terms are learned for each bucket. The hashing technique relies on having direct access to the context variables f1:n. Representing context as a latent topic distribution precludes the use of this hashing adaptation. The choice of hashing function is motivated by what is easy and fast to perform inside the Tensorflow computation graph framework2. If w is a word identifier (ID) and f1:n are context variable ID’s, then the hash table index is computed as hi(w,fi) = wr0 + firi mod l (3.4) where l is the size of the hash table and r0 and the ri’s are all fixed random integers. The 2Newer versions of Tensorflow make it easier to do feature hashing than what we describe here. 24 value of l is usually set to a large prime number. The function H : Z → R maps hash indices to hash values and is implemented as a simple array. Since l is much smaller than the total number of inputs, there will be many hash collisions. Hash collisions are known to negatively effect the perplexity (Mikolov et al., 2011). To deal with this issue, we restrict the hash table to context-word pairs that are observed in the training data. A Bloom filter data structure records which context-word pairs are eligible to have entries in the hash table. The design of this data structure trades off a compact representation of set membership against a small probability of false positives (Bloom, 1970; Talbot and Brants, 2008; Xu et al., 2011). A small amount of false positives is relatively harmless in this application, because they do not impair the ability of the Bloom filter to eliminate almost all of the hash collisions. The function β : Z → [0, 1] is used by the Bloom filter to map hash indices to binary values. B(w,fi) = 16∏ j=1 β(hi,j(w,fi)) The hash functions hi,j are defined in the same way as the hi’s above except that they use distinct random integers and the size of the table, l, can be different. Because β is a binary function, the product B(w,fi) will always be zero or one. Thus, any word-context pairs not found in the Bloom filter will have their hash values set to zero. The final expression for the hashed adaptation term is given by Hash(w,f1:n) = n∑ i=1 H(hi(w,fi))B(w,fi) (3.5) yt = softmax(E Tht + Gc + b2 + Hash(wt,f1:n)) (3.6) 3.3 Data The experiments make use of three corpora chosen to give a diverse perspective on adaptation in language modeling. Summary information on the training set for each source (Reddit, Twitter, and SCOTUS) is provided in Table 3.1 and each source is discussed individually 25 Source Size Vocabulary Context (Dimensions) Reddit 8,000K 68,000 words Subreddit (5800) Twitter 77K 194 chars Language (9) SCOTUS 864K 18,000 words Case (1765), Speaker (2276), Role (3) Table 3.1: Number of sentences, vocabulary size and context variables for the three corpora. below. The Reddit and SCOTUS data are tokenized and lower-cased using the standard NLTK tokenizer (Bird et al., 2009). Reddit Reddit is the world’s largest online discussion forum and is comprised of thousands of active subcommunities dedicated to a wide variety of themes. Our training data is 8 million sentences (100 million words) from Reddit comments during the month of April 2015. Only the first sentence from each comment was used. The 68,000 word vocabulary is selected by taking all tokens that occur at least 20 times in the training data. The remaining tokens are mapped to a special UNK token leaving us with an out of vocabulary rate of 2.3%. The validation data and test data are each contain one eighth the number of sentences as the training data. The context variable is the identity of the subreddit, i.e. community, that the comment came from. There are 5,800 subreddits with at least 50 training sentences. The remaining ones are grouped together in an UNK category. The largest subreddit occupies just 4.5% of the data. By using a large number of subreddits, we highlight an advantage of model adaptation which is to be able to use a single unified model instead of training thousands of separate models for each individual community. Similarly, using context dependent bias vec- tors for this data instead of the hash adaptation would require learning 400 million additional parameters. 26 Twitter The Twitter training data has 77,000 Tweets (848,000 words) each annotated with one of nine languages: English, German, Italian, Spanish, Portuguese, Basque, Catalan, Galician, and French. The corpus was collected by combining resources from published data for language identification tasks during the past few years. Tweets labeled as unknown, ambiguous, or containing code-switching were not included. The data is unbalanced across languages with more than 32% of the Tweets being Spanish and the smallest four languages (Italian, German, Basque, and Galician) each representing less than 1.5% of the total. There are 194 unique character tokens in the vocabulary. Graphemes that are surrogate-pairs in the UTF-16 encoding, such as emoji, are split into multiple vocabulary tokens. No preprocessing or tokenization is performed on this data except that newlines were replaced with spaces for convenience. The validation and test data have 12,000 and 15,000 Tweets respectively. SCOTUS Approximately 864,000 utterances (16 million words) of training data spanning arguments from 1990-2011. These are speech transcripts from arguments before the United States Supreme Court. Utterances are labeled with the case being argued (n=1,765), the speaker ID (n=2,276), and the speaker role (justice, advocate, or unidentified). These three context variables are defined in the same way as in Hutchinson et al. (2013), where a small portion of this data was used in language modeling experiments. The vocabulary size is around 18,000 words. Utterances longer than 45 words (90th percentile) were split into smaller utterances.3 The validation and test data were each one eighth the size of the training data. 3.4 Experiments We used an LSTM with coupled input and forget gates for a 20% reduction in computation time (Greff et al., 2016). Dropout was used as a regularizer on the input and outputs of the recurrent layer as described in Zaremba et al. (2014). For the large vocabulary experiments, 3Occasionally, the advocates go on for hundreds of words without interruption. Including these utterances would slow down the training. 27 Parameter Reddit SCOTUS Twitter Batch Size 400 300 200 Word Embed. 200 200 30 LSTM Size 240 240 200 Dropout 0% 15% 10% Neg. Samples 100 100 NA Total Params. 14M 4M 300K Table 3.2: Summary of Key Hyperparamters we used a sampled softmax loss to speed up training. A summary of the key hyperparameters for each class of experiments is given in Table 3.2. We conducted some preliminary experiments to tune the different hyperparameters for each dataset. Then, we fixed the values of these hyperparameters and only varied the adaptation method in each experiment. The total parameter column in this table is based on the unadapted model. Adapted models will have more parameters depending on the type of adaptation. When using hash adaptation of the output layer, the size of the Bloom filter is 100 million and the size of the hash table is 80 million. The model is implemented using the Tensorflow library. Optimization is done using Adam with a learning rate of 0.001. Each model trained in under three days using 8 CPU threads. Although the model is trained as a language model, it can be used as a generative text classifier. The classification rule is given by argmaxfi ∑ j log p(wj|w1:j−1,ck 6=i,fi). When there are multiple context variables, we treat all but one of them as known values and attempt to identify the unknown one. It is not necessary to compute the probabilities over the full vocabulary to do text classification. The sampled softmax criteria can be used to greatly speed up evaluation of the classifier provided that a) the same negative samples are reused for each class and b) the number of negative samples is increased to around 10% of the vocabulary. 28 Hidd. Output × + LR Hash PPL ∆PPL AUC N N N N 75.2 – – N N N Y 69.6 7.3% 76.5 N N Y N 68.0 9.5% 75.5 N N Y Y 66.9 11.0% 78.4 N Y N N 68.4 9.0% 76.1 N Y N Y 66.9 11.0% 78.9 N Y Y N 68.0 9.6% 75.3 N Y Y Y 66.5 11.5% 78.4 Y N N N 69.4 7.7% 75.9 Y N Y N 68.8 8.5% 75.9 Y N Y Y 67.2 10.6% 78.9 Y Y N N 69.0 8.2% 76.7 Y Y N Y 67.5 10.2% 79.0 Y Y Y N 68.3 9.1% 75.7 Y Y Y Y 67.1 10.7% 79.2 Table 3.3: Perplexities and Classification Avg. AUCs for Reddit Models 3.4.1 Reddit Experiments The size of the subreddit embeddings was set to 25. Table 3.3 gives the perplexities and average AUCs for subreddit detection for different adapted models. The evaluation data contains 60,000 sentences. For comparison, an unadapted 4-gram Kneser-Ney model trained on the same data has a perplexity of 119. The models with the best perplexity do not use multiplicative adaptation of the hidden layer, but it is useful in the detection experiments. We can inspect the context embeddings learned by the model to see if it is exploiting sim- 29 Pittsburgh Python NBA Atlanta CSharp Warriors Montana JavaScript Rockets MadisonWI CPP Questions Mavericks Baltimore CPP NBASpurs Table 3.4: Nearest neighbors to selected subreddits in the context embedding space. ilarities between subreddits in the way that we expect. Table 3.4 lists the nearest neighbors by Euclidean distance to three selected subreddits. The nearest neighbors are intuitively reasonable. For example, the closest subreddits to Pittsburgh are communities created for other big cities and states. The Python subreddit is close to other programming language communities, and the NBA subreddit is close to the communities for individual NBA teams. The number of subreddits is large enough that apply a generative classifier to the full set is impractical. We used a smaller subset of subreddit and to avoid bias picked the same ones used in another study (Tran and Ostendorf, 2016). This caused the classification task to turn into a detection one and therefore the decision rule is slightly different from what was described above. The subreddit detection involves predicting the subreddit a given comment came from with eight subreddits to choose from (AskMen, AskScience, AskWomen, Atheism, ChangeMyView, Fitness, Politics, and Worldnews) and nine distractors (Books, Chicago, NYC, Seattle, ExplainLikeImFive, Science, Running, NFL, and TodayILearned). To make a classification decision we evaluate the perplexity of each comment under the assumption that it belongs to each of the eight subreddits. We use z-score normalization across the eight perplexities to create a score for each class. The predictions are evaluated by averaging the AUC of the eight individual ROC curves. The best model for the classification task uses all four types of adaptation. The multiplicative adaptation of the hidden layer is clearly useful for classification even though it does not help with perplexity. The perplexities for selected large subreddits are listed in Table 3.5. It can be seen that 30 the relative gain from adaptation is largest when the topic of the subreddit is more narrowly focused. The biggest gains were achieved for subreddits dedicated to specific sports, TV shows, or video games. Whereas, the gains were smallest for subreddits like Videos or Funny for which the content tends to be more diverse. The knowledge that a sentence came from a pro-wrestling subreddit effectively provides more information about the text than the analogous piece of knowledge for the Pics or Videos subreddit. This would seem to indicate that further gains could be possible if additional contextual information could be provided. An alternative explanation, that subreddits with fewer sentences in the training data receive more benefit from adaptation, is not supported by the data. 3.4.2 Twitter experiments The Twitter evaluation was done on a set of 14,960 Tweets. The language context embedding vector dimensionality was set to 8. When both the vocabulary and the number of contexts are small, as in this case, there is no danger of hash collisions. We disable the Bloom filter making the hash adaptation essentially equivalent to having context-dependent bias vectors. Table 3.6 reports the results of the experiments on the Twitter corpus. We compute both the perplexity and measure the performance of the models on a language identification task. In terms of perplexity, the best models do not make use of the multiplicative hidden layer adaptation, consistent with the results from the Reddit corpus. In general, the improvement in perplexity from adaptation is small (less than 5%) on this corpus compared to our other experiments where we saw relative improvements two to four times as big. This is likely because the LSTM can figure out by itself which language it is modeling early on in the sequence, capture that in the hidden state, and adjust its predictions accordingly. Our best model, using multiplicative adaptation of the hidden layer, achieves an accuracy of 94.2% on this task. That is a 19% relative reduction in the error rate from the best model without multiplicative adaptation. Sometimes there can be little to no perplexity improvement between the unadapted and adapted models. This can be explained if the provided context variables are mostly redundant 31 Subreddit Base. PPL Adapt. PPL ∆PPL Description FlashTV 90.5 68.2 24.6% A popular TV show shield 99.4 77.3 22.2% A tv show GlobalOffensive 97.1 79.3 18.3% A PC video game nba 103.3 86.4 16.3% National Basketball Association SquaredCircle 85.7 71.7 16.3% Professional Wrestling Fitness 50.1 42.3 15.5% Exercise and fitness hockey 85.5 72.4 15.2% Professional hockey leagueoflegends 71.1 61.0 14.3% A PC video game pcmasterrace 71.7 62.0 13.5% PC gaming nfl 84.2 74.0 12.2% National Football League AskWomen 62.1 55.3 10.9% Questions for women news 70.8 65.0 8.2% General news stories and discussion worldnews 85.7 79.7 7.1% Global news discussion AskMen 69.4 66.7 3.9% Questions for men gaming 79.0 76.1 3.7% General video games interest group pics 74.0 71.8 3.0% Funny or interesting pictures videos 62.9 61.1 2.9% Funny or interesting videos funny 72.6 70.8 2.5% Sharing humorous content Table 3.5: Comparison of perplexities per subreddit 32 Hidden Output × + LR Hash PPL Acc. F1 N N N N 6.44 – – N N N Y 6.43 56.1 44.0 N N Y N 6.37 49.7 36.6 N N Y Y 6.34 57.0 44.5 N Y N N 6.23 91.6 84.2 N Y N Y 6.25 92.5 84.4 N Y Y N 6.21 91.4 82.9 N Y Y Y 6.15 92.8 85.2 Y N N N 6.90 93.9 85.6 Y N N Y 6.39 93.6 85.9 Y N Y N 6.28 93.2 85.1 Y N Y Y 6.31 93.7 86.1 Y Y N N 6.28 92.5 84.7 Y Y N Y 6.30 93.7 86.3 Y Y Y N 6.54 94.2 86.3 Y Y Y Y 6.35 93.3 85.9 Table 3.6: Results on Twitter data. 33 given the previous tokens in the sequence. To investigate this further, we trained a logistic regression classifier to predict the language using the state from the LSTM at the last time step on the unadapted model as a feature vector. Using just 30 labeled examples per class it is possible to get 74.6% accuracy. Furthermore, we find that a single dimension in the hidden state of the unadapted model is often enough to distinguish between different languages even though the model was not given any supervision signal. This finding is consistent with previous work that showed that individual dimensions of LSTM hidden states can be strong indicators of concepts like sentiment (Karpathy et al., 2015; Radford et al., 2017). Figure 3.1 visualizes the value of the dimension of the hidden layer that is the strongest indicator of Spanish on three different code-switched tweets. Code-switching is not a part of the training data but it provides a compelling visualization of the ability of the unsupervised model to quickly recognize the language. The fact that it is so easy for the unadapted model to pick-up on the identity of the contextual variable fits with our explanation for the small relative gain in perplexity from the adapted models in these two tasks. Figure 3.1: The value of the dimension of the LSTM hidden state in an unadapted model that is the strongest indicator for Spanish text for three different code-switched Tweets. 34 3.4.3 SCOTUS experiments Table 3.7 lists the results for the experiments on the SCOTUS corpus. The size of the context embeddings are 9, 15, and 8 for the case, speaker, and role variables respectively. For calculating perplexity we use a 60,000 sentence evaluation set. For the classification experiment we selected 4,000 sentences from the test data from eleven different justices and attempted to classify the identity of the justice. The perplexity of the distribution of judges over those sentences is 8.9 (11.0 would be uniform). So, the data is roughly balanced. When classifying justices, the model is given the case context variable, but we do not make any special effort to filter candidates based on who was serving on the court during that time, i.e. all eleven justices are considered for every case. For both the perplexity and classification metrics, the hash adaptation makes a big dif- ference. The model that uses only hash adaptation and no hidden layer adaptation has a better perplexity than any of the model variants that use both hidden adaptation and low-rank adaptation of the output layer. To ascertain which of the context variables have the most impact, we trained additional models with using different combinations of context variables. The model architecture is the one that uses all four forms of adaptation. Results are listed in Table 3.8. The most useful variable is the indicator for the case. The role variable is highly redundant—almost every speaker only appears in a single role. Therefore, it is not surprising that the speaker variable is more useful to the model than the role. In Table 3.9 we list sentences generated from the fully adapted model (same one as the last line in Table 3.7) using beam search. The value of the context variable for the Case is held fixed while we explore different values for the Speaker and Role variables. Anecdotally, we see that the model captures some information about John Roberts role as chief justice. The model learns that Justice Breyer tends to start his questions with the phrase “I mean” while Justice Kagan tends to start with “Well”. Roberts and Kagan appear in our data both as justices and earlier as advocates. 35 Hidden Output × + LR Hash PPL ∆PPL ACC N N N N 37.3 – – N N N Y 31.2 16.5% 29.6 N N Y N 32.9 12.0% 26.2 N N Y Y 30.0 19.6% 28.4 N Y N N 33.2 11.0% 20.0 N Y N Y 29.9 19.8% 26.8 N Y Y N 32.7 12.4% 25.4 N Y Y Y 29.8 20.3% 31.1 Y N N N 33.3 10.7% 17.1 Y N N Y 29.8 20.1% 28.4 Y N Y N 32.3 13.4% 24.5 Y N Y Y 29.2 21.7% 32.4 Y Y N N 33.2 11.0% 18.9 Y Y N Y 29.8 20.1% 29.6 Y Y Y N 32.2 13.7% 26.1 Y Y Y Y 29.4 21.1% 31.9 Table 3.7: Results on the SCOTUS data in terms of perplexity and classification accuracy (ACC) for the justice identification task. 36 Case Spkr. Role PPL N N N 37.3 N N Y 36.5 N Y N 33.6 N Y Y 33.3 Y N N 31.5 Y N Y 30.3 Y Y N 29.6 Y Y Y 29.4 Table 3.8: Perplexities for different combinations of context variables on the SCOTUS corpus. Spkr. Role Sentence Roberts J. We’ll hear argument first this morning in Ayers. Breyer J. I mean, I don’t think that’s right. Kagan J. Well, I don’t think that’s right. Kagan A. Mr. Chief Justice, and may it please the court: Bork A. --No, I don’t think so, your honor. Table 3.9: Sentences generated from the adapted model using beam search under different assumptions for speaker and role contexts. 37 3.5 Comparison to Related Work The multiplicative rescaling of the recurrent layer weights is used in the Hypernetwork model (Ha et al., 2017). The focus of this model is to allow the LSTM to adjust automatically depending on the context of the previous words. This is different from our work in that we are adapting based on contextual information external to the word sequence. Gangireddy et al. (2016) also use a rescaling of the hidden layer for adaptation but it is done as a fine-tuning step and not during training like our model. The RNNME model from Mikolov et al. (2011) uses feature hashing to train a maximum entropy model alongside an RNN language model. The setup is similar to our method of using hashing to learn context-dependent biases. However, there are a number of differences. The motivation for the RNNME model was to speedup training of the RNN, not to compen- sate for the inadequacy of low-rank output layer adaptation, which had yet to be invented. Furthermore, Mikolov et al. (2011) do not use context dependent features in the max-ent component of the RNNME model, nor do they have a method for dealing with hash collisions such as our use of Bloom filters. The idea of having one part of a language model be low-rank and another part to be an additive correction to the low-rank model has been investigated in other work (Eisenstein et al., 2011b; Hutchinson et al., 2013; Parikh et al., 2014). In both of these cases, the correction term is encouraged to be sparse by including an L1 penalty. Our implementation did not promote sparsity in the hash adaptation features but this idea is worth further consideration.4 The hybrid LSTM and count based language model is an alternative way of correcting for a low-rank approximation (Neubig and Dyer, 2016). 3.6 Summary While our results suggest that there is not a one-size-fits-all approach to language model adaptation, it is clear that we improve over the standard adaptation approach. The model 4See Appendix A for a deeper look at this idea. 38 from Mikolov and Zweig (2012), equivalent to using just additive adaptation on the hidden layer and low-rank adaptation of the output layer, is outperformed for all three datasets at both the language modeling and classification tasks. The combined low-rank and hash adaptation of the output layer were consistently required to get the best perplexity. For the classification tasks, the multiplicative hidden layer adaptation is clearly useful, as is the combined low-rank and hash adaptation of the output layer. Importantly, there is not always a strong relationship between perplexity and classification scores. Our results may have implications for work on text generation where it can be more desirable to have more control over the generation rather than the lowest perplexity model. This issue is explored further in Chapter 6. Our investigation of the language context in the Twitter experiments gives a useful take- away: context variables that are easily predictable from the text alone are unlikely to be helpful. More studies are needed to get a more complete understanding about what types of context variables will provide the most benefit. To that end, additional contexts are explored in subsequent chapters. Based on the results from the SCOTUS experiments, we know that an additive transfor- mation of the bias by itself is not always the best way to adapt the recurrent layer. This motivated us to look for a new approach that would give more consistent perplexity gains so that we could be confident recommending its use in most situations. Further investigation led us to develop the FactorCell model, which is the subject of the next chapter. 39 Chapter 4 FACTOR CELL MODEL In this chapter we introduce the FactorCell model for adapting the recurrent layer of an RNN language model.1 This is a major contribution of the thesis. Instead of taking context as an additional input to the model, we conceive of adaptation in a totally new way where the model generates a custom recurrent layer for any context. This is accomplished using a low-rank decomposition in order to control the extent of parameter sharing between contexts, which is important for handling high-dimensional, sparse contexts. The experiments in this chapter will show that the FactorCell improves perplexity and that it also has qualitative differences that set it apart from other models. The FactorCell model generalizes the ConcatCell and remedies one of its major weak- nesses. When there is a large amount of data available per context then there is less need to share information between contexts. Likewise, where there are many contexts and less training data per context it is better to do more parameter sharing. The ConcatCell is not able to trade-off between these scenarios. It always shares almost all of its parameters across contexts. In contrast, the FactorCell rank hyperparameter allows complete control over how much sharing there is between contexts. Aside from perplexity, computation cost is always a consideration. A reliable and easy way to reduce perplexity is to increase the recurrent layer dimension. In latency-constrained applications (most industry speech recognition systems have strict latency constraints), the recurrent state dimension is limited. By design, the FactorCell permits pre-computation and caching so that its overall computational cost is negligibly more than the much simpler ConcatCell. All of the benefits of more adaptation are delivered with no extra latency. 1This chapter draws on content from some of our previously published work (Jaech and Ostendorf, 2018a). 40 The FactorCell is an alternative to the multiplicative transform explored in Chapter 3, which has the advantage of dedicating more parameters to the adaptation and affecting a bigger change on the recurrent layer weights. Unlike the multiplicative transform, we find that the FactorCell model consistently improves perplexity. 4.1 FactorCell Model Our model uses adaptation in both the recurrent layer and in the bias vector of the output layer. In this section we describe methods for adapting the recurrent layer and the softmax layer, showing that our proposed model is a generalization of most prior methods. 4.1.1 Adapting the recurrent layer Our proposed model extends the ConcatCell by using a context-dependent weight matrix W′ = W + A, in place of the generic weight matrix W. (We refer to W as generic because it is shared across all context settings.) The adaptation matrix, A, is generated by taking the product of the context embedding vector against a set of left and right basis tensors to produce a rank r matrix. The left and right adaptation basis tensors are given as ZL ∈ Rk×(p+q)×r and ZR ∈ Rr×q×k. The two basis tensors together can be thought of as holding k different rank r matrices, Aj = ZL,jZR,j, each the size of W. By taking the product between c and the corresponding tensor modes of ZL and ZR (using ×i to denote the mode-i tensor product, i.e., the product with the i-th dimension of the tensor), the context determines the weighted combination of the k matrices: A = (c×1 ZL)(ZR ×3 cᵀ). (4.1) (Figure 4.1 is a visualization of the FactorCell architecture.) The number of degrees of freedom of A is controlled by the dimension k of the context vector and the rank r of the k weight matrices. The rank is treated as a hyperparameter and controls the extent to which the model relies on the generic weight matrix W versus behaves in a more context-specific manner. 41 Figure 4.1: Illustration of the FactorCell architecture. We call this model the FactorCell because the weight matrix has been adapted by adding a factored component. The ConcatCell model is a special case of the FactorCell where ZL and ZR are set to zero. In summary, the proposed model is given by: ht = σ(W ′[ewt,ht−1] + b ′ 1) W′ = W + (c×1 ZL)(ZR ×3 c) b′1 = Qc + b1. (4.2) If the context is known in advance, W′ can be precomputed, in which case applying the RNN at test time requires no more computation than using an unadapted RNN of the same size. This means that for a fixed sized recurrent layer, the FactorCell model can have many more parameters than the ConcatCell model but hardly any increase in computational cost. When adapting with the FactorCell method, we find it necessary to also include the shift of the bias in the softmax output as described in Equation 2.6. 4.1.2 LSTM FactorCell Equations Only trivial changes are needed to use the FactorCell method on an LSTM instead of a vanilla RNN. Here, we list the equations for an LSTM with coupled input and forget gates, which is what was used in our experiments. The weight matrix W from Equation 2.2 is now size 3q × (p + q) and b is dimension 3q, where 3 is the number of gates. Likewise, ZR from Equation 4.1 is made to be of size 42 r × 3q × k. The weight matrix W′ is as defined in Equation 4.2 and after computing its product with the input [wt,ht−1], the result is split into three vectors of equal size: it, ft, and ot [it,ft,ot] = W ′[ewt,ht−1] + b1, (4.3) which are used in the input gate, the forget gate, and the output gate, respectively. Using these three vectors we perform the gating operations to compute ht using the memory cell mt as follows: ft ← sigmoid(ft + 1.0) mt = mt−1 �ft + (1.0 −ft) � tanh(it) ht = tanh(mt) � sigmoid(ot) (4.4) Note that Equation 3.1, which shows that a context vector concatenated with input is equivalent to an additive bias term, extends to equation 4.3. In other words, in the LSTM version of the ConcatCell model, the context vector effectively introduces an extra bias term for each of the three gates. 4.2 Data Name Train Dev Test Vocab Docs. Context AGNews 4.6M 0.2M 0.3M 54,492 115K 4 Newspaper sections DBPedia 28.7M 0.3M 3.6M 84,341 555K 14 Entity categories TripAdvisor 127.2M 2.6M 2.6M 88,347 843K 3.5K Hotels/5 Sentiment Yelp 91.5M 0.7M 7.1M 57,794 645K 5 Sentiment EuroTwitter∗ 5.3M 0.8M 1.0M 194 80K 9 Languages GeoTwitter∗ 51.7M 2.2M 2.2M 203 604K Latitude & Longitude Table 4.1: Dataset statistics: Dataset size in words (* or characters) of Train, Dev and Test sets, vocabulary size, number of training documents, and context variables. 43 The experiments make use of six datasets: four targeting word-level sequences, and two targeting character sequences. The character studies are motivated by the growing interest in character-level models in both speech recognition and machine translation (Hannun et al., 2014; Chung et al., 2016). By using multiple datasets with different types of context, we hope to learn more about what makes a dataset amenable to adaptation. The datasets range in size from over 100 million words of training data to 5 million characters of training data for the smallest one. When using a word-based vocabulary, we preprocess the data by lowercasing, tokenizing and removing most punctuation. We also truncate sentences to be shorter than a maximum length of 60 words for AGNews and DBPedia and 150 to 200 tokens for the remaining datasets. Summary information is provided in Table 4.1, including the training, development, and test data sizes in terms of number of tokens, vocabulary size, number of training documents (i.e. context samples), and the context variables (f1:n). The largest dataset, TripAdvisor, has over 800 thousand hotel review documents, which adds up to over 125 million words of training data. The first three datasets (AGNews, DBPedia, and Yelp) have previously been used for text classification (Zhang et al., 2015). These consist of newspaper headlines, encyclopedia entries, and restaurant and business reviews, respectively. The context variables associated with these correspond to the newspaper section (world, sports, business, sci & tech) for each headline, the page category on DBPedia (out of 14 options such as actor, athlete, building, etc.), and the star rating on Yelp (from one to five). For AgNews, DBPedia, and Yelp we use the same test data as in previous work. Our fourth dataset, from TripAdvisor, was previously used for language modeling and consists of two relevant context variables: an identifier for the hotel and a sentiment score from one to five stars (Tang et al., 2016). Some of the reviews are written in French or German but most are in English. There are 4,333 different hotels but we group all the ones that do not occur at least 50 times in the training data into a single entity, leaving us with around 3,500. These four datasets use word-based vocabularies. We also experiment on two Twitter datasets: EuroTwitter and GeoTwitter. EuroTwitter 44 is the same as the Twitter data used in the previous chapter and consists of 80 thousand Tweets labeled with one of nine languages: (English, Spanish, Galician, Catalan, Basque, Portuguese, French, German, and Italian). The corpus was created by combining portions of multiple published datasets for language identification including Twitter70 (Jaech et al., 2016), TweetLID (Zubiaga et al., 2014), and the monolingual portion of Tweets from a code- switching detection workshop (Molina et al., 2016). The GeoTwitter data contains Tweets with latitude and longitude information from England, Spain, and the United States.2 The latitude and longitude coordinates are given as numerical inputs. This is different from the other five datasets that all use categorical context variables. 4.3 Experiments with Different Contexts The goal of the experiments is to show that the FactorCell model can deliver improved performance over current approaches for multiple language model applications and a variety of types of contexts. Specifically, results are reported for context-conditioned perplexity and generative model text classification accuracy, using contexts that capture a range of phenomena and dimensionalities. Test set perplexity is the most widely accepted method for evaluating language models, both for use in recognition/translation applications and generation. It has the advantage that it is easy to measure and is widely used as a criterion for model fit, but the limitation that it is not directly matched to most tasks that language models are directly used for. Text classification using the model in a generative classifier is a simple application of Bayes rule: ω̂ = arg max ω p(w1:T |ω)p(ω) (4.5) where w1:T is the text sequence, p(ω) is the class prior, which we assume to be uniform. Classification accuracy provides additional information about the power of a model, even if it is not being designed explicitly for text classification. Further, it allows us to be able to directly compare our model performance against previously published text classification 2Data was accessed from http://followthehashtag.com. 45 benchmarks. Although the most effective models for text classification have generally been discriminative, generative models can be competitive when the available training data is small or text samples are short (Yogatama et al., 2017), and we find that the FactorCell makes the generative model more competitive. Note that the use of classification accuracy for evaluation here involves counting errors associated with applying the generative model to independent test samples. This differs from the accuracy criterion used for evaluating context-sensitive language models for text generation based on a separate discriminative classifier trained on generated text (Ficler and Goldberg, 2017; Hu et al., 2017). We discuss this further in Section 4.5 and Chapter 6. The experiments compare the FactorCell model (equations 4.2 and 2.6) to two popular alternatives, which we refer to as ConcatCell (equations 2.5 and 2.6) and SoftmaxBias (equa- tion 2.6). As noted earlier, the SoftmaxBias method is a simplification of the ConcatCell model, which is in turn a simplification of the FactorCell model. The SoftmaxBias method impacts only the output layer and thus only unigram statistics. Since bag-of-word models provide strong baselines in many text classification tasks, we hypothesize that the Soft- maxBias model will capture much of the relative improvement over the unadapted model for word-based tasks. However, in small vocabulary character-based models, the unigram dis- tribution is unlikely to carry much information about the context, so adapting the recurrent layer should become more important in character-level models. We expect that performance gains will be greatest for the FactorCell model for sources that have sufficient structure and data to support learning the extra degrees of freedom. Another possible baseline would use models independently trained on the subset of data for each context. This is the “independent component” case in (Yogatama et al., 2017). This will fail when a context variable takes on many values (or continuous values) or when training data is limited, because it makes poor use of the training data, as shown in that study. While we do have some datasets where this approach is plausible, we feel that its limitations have been clearly established. 46 4.3.1 Implementation Details The RNN variant that we use is an LSTM with coupled input and forget gates (Melis et al., 2018). The different model variants are implemented3 using the Tensorflow library. The model is trained with the standard negative log likelihood loss function, i.e. minimizing cross entropy. Dropout was used as a regularizer in the recurrent connections (Semeniuta et al., 2016). Training is done using the Adam optimizer with a learning rate of 0.001. For the models with word-based vocabularies, a sampled softmax loss is used with a unigram proposal distribution and sampling 150 words at each time-step (Jean et al., 2014). The classification experiments use a sampled softmax loss with a sample size of 8,000 words. This is an order of magnitude faster to compute with a minimal effect on accuracy. AgNews DBPedia EuroTwtr GeoTwtr Trip Yelp Word Embed 150 114-120 35-40 42-50 100 200 LSTM dim 110 167-180 250 250 200 200 Steps 4.1-5.5K 7.5-8.0K 6.0-8.0K 6.0-11.1K 8.4-9.9K 7.2-8.8K Dropout 0.5 1.00 0.95-1.00 0.99-1.00 0.97-1.00 1.00 Ctx. Embed 2 12 3-5 8-24 20-30 2-3 Rank 12 19 2 20 12 9 Table 4.2: Selected hyperparameters for each dataset. When a range is listed it means that a different values were selected for the FactorCell, ConcatCell, SoftmaxBias or Unadapted models. Hyperparameter tuning was done based on minimizing perplexity on the development set and using a random search. Hyperparameters included word embedding size e, recurrent state size d, context embedding size k, and weight adaptation matrix rank r, the number of training steps, recurrent dropout probability, and random initialization seed. We conducted more than 700 tuning experiments with iterative refinements. The number of experiments 3Code available at http://github.com/ajaech/calm. 47 per dataset vaires between 74 and 190. The selected hyperparameter values are listed in Table 4.2. For any fixed LSTM size, the FactorCell has a higher count of learned parameters compared to the ConcatCell. However, during evaluation both models use approximately the same number of floating-point operations because W′ only needs to be computed once per sentence. Because of this, we believe limiting the recurrent layer cell size is a fair way to compare between the FactorCell and the ConcatCell. 4.3.2 Word-based Models AGNews DBPedia TripAdvisor Yelp Model PPL ACC PPL ACC PPL ACC PPL ACC Unadapted 96.2 – 44.1 – 51.6 – 67.1 – SoftmaxBias 95.1 90.6 40.4 95.5 48.8 51.9 66.9 51.6 ConcatCell 93.8 89.7 39.5 97.8 48.3 56.0 66.8 56.9 FactorCell 92.3 90.6 37.7 98.2 48.2 58.2 66.2 58.8 Table 4.3: Perplexity and classification accuracy on the test set for the four word-based datasets. Perplexities and classification accuracies for the four word-based datasets are presented in Table 4.3. In each of the four datasets, the FactorCell model gives the best perplexity. For classification accuracy, there is a bigger difference between the models, and the FactorCell model is the most accurate on three out of four datasets and tied with the SoftmaxBias model on AgNews. For DBPedia and TripAdvisor, most of the improvement in perplexity relative to the unadapted case is achieved by the SoftmaxBias model with smaller relative improvements coming from the increased power of the ConcatCell and FactorCell models. For Yelp, the perplexity improvements are small; the FactorCell model is just 1.3% better than the unadapted model. From (Yogatama et al., 2017), we see that for AGNews, much more so than for other 48 datasets, the unigram statistics capture the discriminating information, and it is the only dataset in that work where a naive Bayes classifier is competitive with the generative LSTM for the full range of training data. The fact that the SoftmaxBias model gets the same accuracy as the FactorCell model on this task suggests that topic context may benefit less from adapting the recurrent layer. For the DBPedia and Yelp datasets, the FactorCell model beats previously reported classification accuracies for generative models (Yogatama et al., 2017). However, it is not competitive with state-of-the-art discriminative models on these tasks with the full training set. With less training data, it probably would be, based on the results in (Yogatama et al., 2017). The numbers in Table 4.3 do not adequately convey the fact that there are hyperparame- ters with an effect on perplexity that is greater than the sometimes small relative differences between models. Even the seed for the random weight initialization can have a “major im- pact” on the final performance of an LSTM (Reimers and Gurevych, 2017). We use Figure 4.2 to show how the three classes of models perform across a range of hyperparameters. The figure compares perplexity on the x-axis with accuracy on the y-axis with both metrics computed on the development set. Each point in this figure represents a different instance of the model trained with random hyperparameter settings and the best results are in the upper right corner of each plot. The color/shape differences of the points correspond to the three classes of models: FactorCell, ConcatCell, and SoftmaxBias. Within the same model class but across different hyperparameter settings, there is much more variation in perplexity than in accuracy. The LSTM cell size is mainly responsible for this; it has a much bigger impact on perplexity than on accuracy. It is also apparent that the models with the lowest perplexity are not always the ones with the highest accuracy. Notably, improvements in perplexity are associated with a decrease in accuracy for the SoftmaxBias models on the DBPedia, TripAdvisor, and Yelp datasets. In Bowman et al. (2016), it was observed that the more powerful the decoder of a variational auto-encoder was the more likely it was to ignore the prior information given by the encoder. A similar effect 49 Figure 4.2: Accuracy vs. perplexity for different classes of models on the four word-based datasets. is happening here where when the recurrent layer size is increased then the model relies less on the unigram prior provided by the adapted softmax bias vector. See Section 4.3.4 for further hyperparameter analysis. Figure 4.3 is a visualization of the per-word log likelihood ratios between a model as- suming a 5 star review and the same model assuming a 1 star review. Likelihoods were computed using an ensemble of three models to reduce variance. The analysis is repeated for each class of model. Words highlighted in blue are given a higher likelihood under the 5 star assumption. Unigrams with strong sentiment such as “lovely” and “friendly” are well-represented by all three models. The reader may not consider the tokens “craziness” or “5-8pm” to be strong indicators of a positive review but the way they are used in this review is representative of 50 how they are typically used across the corpus. As expected, the ConcatCell and FactorCell model capture the sentiment of multi-token phrases. As an example, the unigram “enough” is 3% more likely to occur in a 5 star review than in a 1 star review. However, “do enough” is 30 times more likely to appear in a 5 star review than in a 1 star review. In this example, the FactorCell model does a better job of handling the word “enough.” 4.3.3 Character-based Models Next, we evaluate the EuroTwitter and GeoTwitter models using both perplexity and a classification task. For EuroTwitter, the classification task is to identify the language. With GeoTwitter, it is less obvious what the classification task should be because the context values are continuous and not categorical. We selected six cities and then assigned each sentence the label of the closest city in that list while still retaining the exact coordinates of the Tweet. There are two cities from each country: Manchester, London, Madrid, Barcelona, New York City, and Los Angeles. Tweets from locations further than 300 km from the nearest city in the list were discarded when evaluating the classification accuracy. The classification task is sufficient to investigate the properties of the language model but, unlike some prior work, it is not designed to capture geographic lexical variations in an easily interpretable manner (Eisenstein et al., 2010) nor is it designed to be efficient at geolocation (Han et al., 2014). Perplexities and classification accuracies are presented in Table 4.4. The FactorCell model has the lowest perplexity and the highest accuracy for both datasets. Again, the FactorCell model clearly improves on the ConcatCell as measured by classification accuracy. Consistent with our hypothesis, adapting the softmax bias is not effective for these small vocabulary character-based tasks. The SoftmaxBias model has small perplexity improvements (< 1%) and low classification accuracies. Figure 4.4 compares perplexity and classification accuracy for different hyperparameter settings of the character-based models. Again, we see that it is possible to trade-off some 51 SoftmaxBias ConcatCell FactorCell Figure 4.3: Log likelihood ratio between a model that assumes a 5 star review and the same model that assumes a 1 star review. Blue indicates a higher 5 star likelihood and red is a higher likelihood for the 1 star condition. 52 EuroTwitter GeoTwitter Model PPL ACC PPL ACC Unadapted 6.35 – 4.64 – SoftmaxBias 6.29 43.0 4.63 29.9 ConcatCell 6.17 91.5 4.54 42.2 FactorCell 6.07 93.3 4.52 63.5 Table 4.4: Perplexity and classification accuracies for the EuroTwitter and GeoTwitter datasets. perplexity for gains in classification accuracy. For EuroTwitter, if tuning is done on accuracy rather than perplexity then the accuracy of the best model is as high as 95%. 4.3.4 Hyperparameter Analysis The hyperparameter with the strongest effect on perplexity is the size of the LSTM. This was consistent across all six datasets. The effect on classification accuracy of increasing the LSTM size was mixed. Increasing the context embedding size generally helped with accuracy on all datasets, but it had a more neutral effect on TripAdvisor and Yelp and increased perplexity on the two character-based datasets. For the FactorCell model, increasing the rank of the adaptation matrix tended to lead to increased classification accuracy on all datasets and seemed to help with perplexity on AGNews, DBPedia, and TripAdvisor. Figure 4.5 compares the effect on perplexity of the LSTM parameter count and the Fac- torCell rank hyperparameters. Each point in those plots represents a separate instance of the model with varied hyperparameters. In the right subplot of Figure 4.5, we see that increasing the rank hyperparameter improves perplexity. This is consistent with our hypothesis that increasing the rank can let the model adapt more. The variance is large because differences in other hyperparameters (such as hidden state size) also have an impact. In the left subplot we compare the performance of the FactorCell with the ConcatCell as 53 Figure 4.4: Accuracy vs. Perplexity for different classes of models on the two character-based datasets. the size of the word embeddings and recurrent state change. The x-axis is the size of the W recurrent weight matrix, specifically 3(e + d)d for an LSTM with 3 gates. Since the adapted weights can be precomputed, the computational cost is roughly the same for points with the same x-value. For a fixed-size hidden state, the FactorCell model has a better perplexity than the ConcatCell. Since performance can be improved both by increasing the recurrent state dimension and/or by increasing rank, we examined the relative benefits of each. The perplexity of a FactorCell model with an LSTM size of 120K will improve by 5% when the rank is increased from 0 to 20. To get the same decrease in perplexity by changing the size of the hidden state would require 160K parameters, resulting in a significant computational advantage for the FactorCell model. Using a one-hot vector for adapting the softmax bias layer in place of the context em- bedding when adapting the softmax bias vector tended to have a large positive effect on accuracy leaving perplexity mostly unchanged. Recall from Section 2.3 that if the number of values that a context variable can take on is small then we can allow the model to choose between using the low-dimensional context embedding or a one-hot vector. This option is 54 Figure 4.5: Comparison of the effect of LSTM parameter count and FactorCell rank hyper- parameters on perplexity for DBPedia. not available for the TripAdvisor and the GeoTwitter datasets because the dimensionality of their one-hot vectors would be too large.4 The method of adapting the softmax bias is the main explanation for why some ConcatCell models performed significantly above/below the trendline for DBPedia in Figure 4.2. We experimented with an additional hyperparameter on the Yelp dataset, namely the inclusion of layer normalization (Ba et al., 2016). (We had ruled-out using layer normaliza- tion in preliminary work on the AGNews data before we understood that AGNews is not representative, so only one task was explored here.) Layer normalization significantly helped the perplexity on Yelp (≈ 2% relative improvement) and all of the top-performing models on the held-out development data had it enabled. 4.4 Analysis for Sparse Contexts The TripAdvisor data is an interesting case because the original context space is high di- mensional (3500 hotels × 5 user ratings) and sparse. Since the model applies end-to-end learning, we can investigate what the context embeddings learn. In particular, we looked at location (hotels are from 25 cities in the United States) and class of hotel, neither of which 4Another option for the TripAdvisor data is to use the hashing method from Section 3.2.2. 55 are input to the model. All of what it learns about these concepts come from extracting information from the text of the reviews. To visualize the embedding, we used a 2-dimensional principal component analysis (PCA) projection of the embeddings of the 3500 hotels. We found that the model learns to group the hotels based on geographic region; the projected embeddings for the largest cities are shown in Figure 4.6, plotting the 1.5σ ellipsoid of the Gaussian distribution of the points. (Actual points are not shown to avoid clutter.) Not only are hotels from the same city grouped together, cities that are close geographically appear close to each other in the embedding space. Cities in the Southwest appear on the left of the figure, the West coast is on top and the East coast and Midwest is on the right side. This is likely due in part to the impact of the region on activities that guests may mention, but there also appears to be a geographic sampling bias in the hotel class that may impact language use. Figure 4.7 shows the projected hotel embeddings in the same space as Figure 4.6 except that the points are now colored based on the hotel class. Class is a rating from an independent agency that indicates the level of service and amenities that customers can expect to receive at a hotel. Whereas, the star rating is the average score given to each establishment by the customers who reviewed it. Hotel class does not determine star rating although they are correlated (r = 0.54). The dataset does not contain a uniform sample of hotel classes from each city. The hotels included from Boston, Chicago, and Philly are almost exclusively high class and the ones from L.A. and San Diego happen to be low class, so the embedding distributions also reflect hotel class: lower class hotels towards the top left and higher class hotels towards the bottom right. The visualization for the ConcatCell and SoftmaxBias models are similar. Another way of understanding what the context embeddings represent is to compute the softmax bias projection Gc and examine the words that experience the biggest increase in probability. We show three examples in Table 4.5. In each case, the top words are strongly related to geography and include names of neighborhoods, local attractions, and other hotels in the same city. The top boosted words are relatively unaffected by changing the rating. 56 Figure 4.6: Distribution of a PCA projection of hotel embeddings from the TripAdvisor FactorCell model showing the grouping of the hotels by city. 57 Figure 4.7: Distribution of a PCA projection of the hotel embeddings from the TripAdvisor FactorCell model showing the grouping of hotels by class. 58 (Recall that the hotel identifier and the user rating are the only two inputs used to create the context embedding.) This table combined with the other visualizations indicates that location effects tend to dominate in the output layer, which may explain why the two models adapting the recurrent network seem to have a bigger impact on classification performance. Hotel City Class Rating Top Boosted Words Amalfi Chicago 4.0 5 amalfi, chicago, allegro, burnham, sable, michigan, acme, conrad, tal- bott, wrigley BLVD Hotel Suites Los Angeles 2.5 3 hollywood, kodak, highland, univer- sal, reseda, griffith, grauman’s, bev- erly, ventura Four Points Sheraton Seattle 3.0 1 seattle, pike, watertown, deca, nee- dle, pikes, pike’s monorail, uw, safeco Table 4.5: The top boosted words in the Softmax bias layer for different context settings in a FactorCell model. 4.5 Comparison to Related Work The studies that most directly relate to our work are neural models that correspond to special cases of the more general FactorCell model, including those that leverage what we call the SoftmaxBias model (Dieng et al., 2016; Tang et al., 2016; Yogatama et al., 2017; Ficler and Goldberg, 2017) and others that use the ConcatCell approach (Mikolov and Zweig, 2012; Wen et al., 2013; Chen et al., 2015; Ghosh et al., 2016). The FactorCell model is distinguished by having an additive (factored) context-dependent transformation of the recurrent layer weight matrix. A related additive context-dependent transformation has been proposed for log-bilinear sequence models (Eisenstein et al., 2011a; 59 Hutchinson et al., 2015), but these are less powerful than the RNN. Factored tensors have been successfully used in other NLP applications such as dependency parsing (Lei et al., 2014). A somewhat different use of low-rank factorization has previously been used to reduce the parameter count in an LSTM LM (Kuchaiev and Ginsburg, 2017), finding that the reduced number of parameters leads to faster training. Much of the work on context-adaptive neural language models has focused on incorpo- rating document or topic information (Mikolov and Zweig, 2012; Ji et al., 2015; Ghosh et al., 2016; Dieng et al., 2016), where context is defined in terms of word or n-gram statistics. Our work differs from these studies in that the context is defined by a variety of sources, including discrete and/or continuous metadata, which is mapped to a context vector in end-to-end training. Context-sensitive language models for text generation tend to involve other forms of context similar to the objective of our work, including speaker characteristics (Luan et al., 2016; Li et al., 2016), dialog act (Wen et al., 2015), sentiment and other fac- tors (Tang et al., 2016; Hu et al., 2017), and style (Ficler and Goldberg, 2017). Our work is distinctive in assessing performance over a broad variety of context variables. 4.6 Summary In summary, this chapter has introduced a new model for adapting (or controlling) a lan- guage model depending on contextual metadata. The FactorCell model extends prior work with context-dependent RNNs by using the context vector to generate a low-rank, factored, additive transformation of the recurrent cell weight matrix. Experiments with six tasks show that the FactorCell model matches or exceeds performance of alternative methods in both perplexity and text classification accuracy. Findings hold for a variety of types of context, including high-dimensional contexts, and the adaptation of the recurrent layer is particularly important for character-level models. For many contexts, the benefit of the FactorCell model comes with essentially no additional computational cost at test time, since the transforma- tions can be pre-computed. Analyses of a dataset with a high-dimensional sparse context vector show that the model learns context similarities to facilitate parameter sharing. 60 An adapted language model needs to memorize information about the unique language patterns of each context. One way of thinking about the difference between the ConcatCell and the FactorCell models is to ask where in the model is that information encoded. The context embedding c has too few bits to hold much information by itself. The same is true for the ConcatCell’s Q matrix from Equation 3.1. The information must be held inside the shared weight matrix W. This exposes a weakness of the ConcatCell model. If only one context is being used at a time why should the weights for all the contexts be active all the time? The FactorCell model has many extra parameters stored in the ZL and ZR tensors. It is able to offload context specific information to these locations and only use it when it is needed. The models evaluated here were tuned to minimize perplexity, as is typical for language modeling. In analyses of performance with different hyperparameter settings, we find that perplexity is not always positively correlated with accuracy, but the criteria are more often correlated for approaches that adapt the recurrent layer. While not surprising, the results raise concerns about using perplexity as the sole evaluation metric for context-aware lan- guage models. More work is needed to understand the relative utility of these objectives for language model design, which we address in Chapter 6. In real applications, it is possible for the meaning of the contexts to shift over time and the adapted model becomes out-of-date. To fix this, the model needs to be replenished by re-training with new data or it can be updated continuously using online learning. We address the latter of these two strategies in the next chapter. 61 Chapter 5 PERSONALIZED QUERY AUTO-COMPLETION 5.1 Background This chapter demonstrates the benefits of the FactorCell model on the real-world task of query auto-completion (QAC).1 QAC is a feature used by search engines that provides a list of suggested queries for the user as they are typing. For instance, if the user types the prefix “mete” then the system might suggest “meters” or “meteorite” as completions. This feature can save the user time and reduce cognitive load (Cai et al., 2016). Most approaches to QAC are extensions of the Most Popular Completion (MPC) algo- rithm (Bar-Yossef and Kraus, 2011). MPC suggests completions based on the most popular queries in the training data that match the specified prefix. One way to improve MPC is to consider additional signals such as temporal information (Shokouhi and Radinsky, 2012; Whiting and Jose, 2014) or information gleaned from a users’ past queries (Shokouhi, 2013). This chapter deals with the latter of those two signals, i.e. personalization. Personaliza- tion relies on the fact that query likelihoods are drastically different among different people depending on their needs and interests. Recently, (Park and Chiba, 2017) suggested a significantly different approach to QAC. In their work, completions are generated from a character LSTM language model instead of by ranking completions retrieved from a database, as in the MPC algorithm. This approach is able to complete queries whose prefixes were not seen during training and has significant memory savings over having to store a large query database. Building on this work, we consider the task of personalized QAC using an LSTM language model, combining the obvious advantages of personalization with the effectiveness of the 1The content of this chapter draws from our previously published work (Jaech and Ostendorf, 2018b). 62 language model in handling rare and previously unseen prefixes. The model must learn how to extract information from a user’s past queries and use it to adapt the generative model for that person’s future queries. User information is held in the form of a low- dimensional embedding that represents the person’s interests and latent demographic factors. The experiments demonstrate that by allowing a greater fraction of the parameters to change in response to the user embeddings, the FactorCell has an advantage over the traditional approach to RNN language model adaptation that increases as more examples from the user are seen. This task is different from the experiments described in Chapter 4 because new contexts (users) can emerge after the model has been trained and deployed. We introduce a mechanism to do online learning of the user embeddings, something that has not been addressed in prior work on language model adaptation. Table 5.1 provides an anecdotal example from the trained FactorCell model to demonstrate the intended behavior. The table shows the top five completions for the prefix “ba” in a cold start scenario and again after the user has completed five sports related queries. In the warm start scenario, the “baby names” and “babiesrus” completions no longer appear in the top five and have been replaced with “basketball” and “baseball”. In the online learning scenario, the FactorCell model makes efficient use of the user query history to quickly improve the quality of the auto-completions. While the standard implementation of MPC can not handle unseen prefixes, there are variants which do have that ability. Park and Chiba (2017) find that the neural LM out- performs MPC even when MPC has been augmented with the approach from Mitra and Craswell (2015) for handling rare prefixes. There has also been work on personalizing MPC (Shokouhi, 2013; Cai et al., 2014). We did not compare against these specific models because our goal was to show how personalization can improve the already-proven generative neural model approach. RNN’s have also previously been used for the related task of next query suggestion (Sordoni et al., 2015). Wang et al. (2018) show how spelling correction can be integrated into an RNN language model query auto-completion system and how the completions can be generated in real 63 Cold Start Warm Start 1 bank of america bank of america 2 barnes and noble basketball 3 babiesrus baseball 4 baby names barnes and noble 5 bank one baltimore Table 5.1: Top five completions for the prefix ba for a cold start model with no previous queries from that user and a warm model that has seen the queries espn, sports news, nascar, yankees, and nba. time using a GPU. Our method of updating the model during evaluation resembles work on dynamic evaluation for language modeling (Krause et al., 2017), but differs in that only the user embeddings (latent demographic factors) are updated. If the rest of the model is allowed to update then either some user embedding can become stale or there will be a large memory cost to hold different versions of the model for each person. One possibility would be to allow the full model to update and also implement a policy to invalidate stale user embeddings but that is beyond the scope of this work. 5.2 Model Adaptation depends on learning an embedding for each user, which we discuss in Section 5.2.1, and then using that embedding to adjust the weights of the recurrent layer, discussed in Section 5.2.2. 5.2.1 Learning User Embeddings During training, we learn an embedding for each of the users. We think of these embeddings as holding latent demographic factors for each user. Users who have less than 15 queries in the training data (around half the users but less than 13% of the queries) are grouped 64 together as a single entity, user1, leaving k users. The user embeddings matrix Uk×m, where m is the user embedding size, is learned via back-propagation as part of the end-to-end model. The embedding for an individual user is the ith row of U and is denoted by ui. The user embedding plays the same role as the context embedding from Section 2.3.1 that elsewhere is denoted by c. It is important to be able to apply the model to users that are not seen during training. This is done by online updating of the user embeddings during evaluation. When a new person, userk+1 is seen, a new row is added to U and initialized to u1. Each person’s user embedding is updated via back-propagation every time they select a query. When doing online updating of the user embeddings, the rest of the model parameters (everything except U) are frozen. The learning rate for the online updating should be different than the one used for training the rest of the model. When training the full model, the goal is to converge to a global minimum. During online updating the user embeddings do not converge to a fixed point but continue to track the query history. The optimal learning rate must be found using validation data. 5.2.2 Recurrent Layer Adaptation We consider three model architectures which differ only in the method for adapting the recurrent layer. First is the unadapted LM, analogous to the model from Park and Chiba (2017), which does no personalization. The second architecture is the ConcatCell, which concatenates a user embedding to the character embedding at every step of the input to the recurrent layer. For the third model, we test the FactorCell’s ability to let the user embedding transform the weights of the recurrent layer. Unlike the previous chapter, we do not incorporate the ConcatCell adaptation of the recurrent layer bias when using the FactorCell model. We found that this was unnecessary in preliminary experiments. When operating at the character-level, the unigram distribution is not particularly in- formative of the differences between users. This is unlike the word-level models, where the 65 unigram distribution can carry topic information. For this reason, we do not employ any of the techniques from Chapter 4 for adapting the softmax bias vector. 5.3 Data The experiments make use of the AOL Query data collected over three months in 2006 (Pass et al., 2006). The first six of the ten files were used for training. This contains approximately 12 million queries from 173,000 users for an average of 70 queries per user (median 15). A set of 240,000 queries from those same users (2% of the data) was reserved for tuning and validation. From the remaining files, one million queries from 20,000 users are used to test the models on a disjoint set of users. The chronological ordering of the queries is ignored during training but respected during testing. Our results are not directly comparable to Park and Chiba (2017) or Mitra and Craswell (2015) due to differences in the partitioning of the data and the method for selecting random prefixes. Prior work partitions the data by time instead of by user. Splitting by users is necessary in order to properly test personalization over longer time ranges. 5.4 Experiments 5.4.1 Experiment Details The vocabulary consists of 79 characters including special start and stop tokens. Models were trained for six epochs. The Adam optimizer is used during training with a learning rate of 10−3 (Kingma and Ba, 2014). The language model is a single-layer character-level LSTM with coupled input and forget gates and layer normalization (Melis et al., 2018; Ba et al., 2016). We do experiments on two model configurations: small and large. The small models use an LSTM hidden state size of 300 and 20 dimensional user embeddings. The large models use a hidden state size of 600 and 40 dimensional user embeddings. Both sizes use 24 dimensional character embeddings. For the small sized models, we experimented with different values of the FactorCell rank hyperparameter between 30 and 50 dimensions 66 finding that bigger rank is better. The large sized models used a fixed value of 60 for the rank hyperparemeter. During training only and due to limited computational resources, queries are truncated to a length of 40 characters. The model2 is implemented using Tensorflow. When updating the user embeddings during evaluation, we found that it is easier to use an optimizer without momentum. We use Adadelta (Zeiler, 2012) and tune the online learning rate to give the best perplexity on a held-out set of 12,000 queries, having previously verified (See Figure 5.1) that perplexity is a good indicator of performance on the QAC task. The learning rate is the only parameter that needs to be tuned for online learning. Prefixes are selected uniformly at random with the constraint that they contain at least two characters in the prefix and that there is at least one character in the completion. To generate completions using beam search, we use a beam width of 100 and a branching factor of 4. Results are reported using mean reciprocal rank (MRR), the standard method of evaluating QAC systems. It is the mean of the reciprocal rank of the true completion in the top ten proposed completions. The reciprocal rank is zero if the true completion is not in the top ten. Neural models are compared against an MPC baseline. Following (Park and Chiba, 2017), we remove queries seen less than three times from the MPC training data to avoid excessive memory usage. 5.4.2 Results The training objective, minimizing perplexity, differs from the evaluation metric, maximizing the mean reciprocal rank of the query completion. Fortunately, the mismatch between met- rics is not large. As shown in Figure 5.1, perplexity correlates with MRR on the development data. Table 5.2 compares the performance of the different models against the MPC baseline on a test set of one million queries from a user population that is disjoint with the training set. 2Code is available at http://github.com/ajaech/query completion. 67 Figure 5.1: Perplexity versus MRR on the development data for different classes of models. Results are presented separately for prefixes that are seen or unseen in the training data. If the prefix was not seen in the training data then the query is guaranteed to be relatively rare. Consistent with prior work, the neural models do better than the MPC baseline. The personalized models are better than the unadapted one, and the FactorCell model is the best overall in both the big and small sized experiments. Figure 5.2 shows the relative improvement in MRR over an unpersonalized model versus the number of queries seen per user. Both the FactorCell and the ConcatCell show continued improvement as more queries from each user are seen, and the FactorCell outperforms the ConcatCell by an increasing margin over time. In the long run, we expect that the system will have seen many queries from most users. Therefore, the right side of Figure 5.2, where the FactorCell is up to 2% better than the ConcatCell, is more representative of the relative performance of the two systems. Since the data was collected over a limited time frame and half of all users have fifteen or fewer queries, the results in Table 5.2 do not reflect the full benefit of personalization. Figure 5.3 shows the MRR for different prefix and query lengths. We find that longer 68 Size Model Seen Unseen All MPC .292 .000 .203 Unadapted .292 .256 .267 (S) ConcatCell .296 .263 .273 FactorCell .300 .264 .275 Unadapted .324 .286 .297 (B) ConcatCell .330 .298 .308 FactorCell .335 .298 .309 Table 5.2: MRR reported for seen and unseen prefixes for small (S) and big (B) models. Figure 5.2: Relative improvement in MRR over the unpersonalized model versus queries seen using the large size models. Plot uses a moving average of width 9 to reduce noise. 69 Figure 5.3: MRR by prefix and query lengths for the large FactorCell and unadapted models with the first 50 queries per user excluded. prefixes help the model make longer completions and (as expected) shorter completions have higher MRR. Comparing the personalized model against the unpersonalized baseline, we see that the biggest gains are for short queries and prefixes of length one or two. We found that one reason why the FactorCell outperforms the ConcatCell is that it is able to pick up sooner on the repetitive search behaviors that some users have. This commonly happens for navigational queries like when someone searches for the name of their favorite website once or more per day. At the extreme tail there are users who search for nothing but free online poker. Both models do well on these highly predictable users but the FactorCell is generally a bit quicker. We conducted an analysis to better understand what information is represented in the user embeddings and what makes the FactorCell different from the ConcatCell. From a cold start user embedding we ran two queries and allowed the model to update the user embedding. Then, we ranked the most frequent 1,500 queries based on the ratio of their likelihood from before and after updating the user embeddings. Tables 5.3, 5.4, and 5.5 show the queries with the highest relative likelihood of the adapted vs. unadapted models after two related search queries: “high school softball” and 70 FactorCell ConcatCell 1 high school musical horoscope 2 chris brown high school musical 3 funnyjunk.com homes for sale 4 funbrain.com modular homes 5 chat room hair styles Table 5.3: The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “high school softball” and “math homework help”. “math homework help” for Table 5.3, “Prada handbags” and “Versace eyewear” for Table 5.4, and “discount flights” and “Yellowstone vacation packages” for Table 5.5. In all cases, the FactorCell model examples are more semantically coherent than the ConcatCell examples. In the first case, the FactorCell model identifies queries that a high school student might make, including entertainment sources and a celebrity entertainer popular with that demographic. In the second case, the FactorCell model chooses retailers that carry woman’s apparel and those that sell home goods. While these companies’ brands are not as luxurious as Prada or Versace, most of the top luxury brand names do not appear in the top 1,500 queries and our model may not be capable of being that specific. There is no obvious semantic connection between the highest likelihood ratio phrases for the ConcatCell; it seems to be focusing more on orthography than semantics (e.g. “home” in the first example). In the last case, the FactorCell suggests a map and four different airlines and the ConcatCell suggests a flight tracker (unlikely to be useful to someone looking to buy a ticket), a bank, and three reformulations of the original “discount flights” query. Not shown are the queries which experienced the greatest decrease in likelihood. For the “high school” case, these included searches for travel agencies and airline tickets—websites not targeted towards the high school age demographic. 71 FactorCell ConcatCell 1 neiman marcus craigslist nyc 2 pottery barn myspace layouts 3 jc penney verizon wireless 4 verizon wireless jensen ackles 5 bed bath and beyond webster dictionary Table 5.4: The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “prada handbags” and “versace eyewear”. FactorCell ConcatCell 1 yahoo maps flight tracker 2 delta airlines suntrust bank 3 alaska airlines cheap tickets 4 us airways airline tickets 5 southwest airlines cheap flights Table 5.5: The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “discount flights” and “yellowstone vacation packages”. 72 5.5 Summary Our experiments show that the LSTM model can be improved using personalization. The method of adapting the recurrent layer clearly matters and we obtained an advantage by using the FactorCell model. The reason the FactorCell does better is in part attributable to having two to three times as many parameters in the recurrent layer as either the ConcatCell or the unadapted models. By design, the adapted weight matrix W′ only needs to be computed at most once per query and is reused many thousands of times during beam search. As a result, for a given number of floating point operations, the FactorCell model outperforms the ConcatCell model for LSTM adaptation. The cost for updating the user embeddings is similar to the cost of the forward pass and depends on the size of the user embedding, hidden state size, FactorCell rank, and query length. In addition, we note that updates can be less frequent to reduce costs and in most cases there will be time between queries for updates. We showed that language model personalization can be effective even on users who are not seen during training. The benefits of personalization are immediate and increase over time as the system continues to leverage the incoming data to build better user representations. The approach can easily be extended to include time as an additional conditioning factor. We leave the question of whether the results can be improved by combining the language model with MPC for future work. 73 Chapter 6 CONTEXT-SPECIFIC TEXT GENERATION Context-specific generation is when a generative model is designed such that stylistic, topical, or other properties of the text can be specified when sampling from the model. One of the weaknesses of RNN language models is that it is difficult to control the content or the style of the text that they produce. It is natural to expect that if a language model is conditioned on a particular context then it will produce text that is appropriate for that context. Conditioning, or adapting to, context is one way of controlling generation (Ficler and Goldberg, 2017) and that is the application that we explore in this chapter. We conduct experiments and analysis using the Yelp restaurant review and the TripAd- visor hotel review datasets from Section 4.2. These experiments reuse the models that were trained for the experiments in Section 4.3, including models adapted using the SoftmaxBias, ConcatCell, and FactorCell strategies. The focus of the experiments in this chapter is to highlight differences in context-specificity between these models by generating reviews from the models and checking if the text is appropriate for the given context. The generated reviews are judged using a classifier that has been trained for that purpose. The automatic judgments are supplemented by human ratings to confirm the reliability of the classifier. There are ways of producing context-specific text without generating it from a language model such as by retrieving examples with matching context from the training data or combining retrieval with a neural edit model (Li et al., 2018). Our context-specificity metric will highly reward these types of models or any model that produces easy to classify content even if that content is unnatural or an outlier. Retrieval-based methods may work well in certain applications but others require the use of a language model. This chapter focuses specifically on language models. We emphasize that a context-specificity metric by itself 74 is not enough to judge the quality of a language model. Low perplexity is important for good generation. However, a language model that has low perplexity and is also capable of context-specific generation will have more applications than a non-context-specific model. 6.1 Text Generation To generate text from an RNN, we use the model to predict the probability of observing each possible word as the next token in the sequence starting by inputting a special start of sentence token for w1. The next token probabilities are given by the yt vector from Equation 2.3. One option is to take the next word as the highest probability entry in yt, known as greedy decoding. Making greedy decisions can lead us down a path that is globally suboptimal and may not result in finding the sequence with the highest overall likelihood. Instead, using an algorithm known as beam search, several possibilities are considered for the next word and are placed in a queue that holds the most likely word sequences found so far. Beam search can find many high probability sentences but oftentimes it suffers from a lack of diversity. The highest probability sequences will all have considerable overlap with each other. This is not desirable in text generation because we generally want multiple texts with non-trivial differences to pick from and also because excessive repetition is tedious for human readers. A simple tweak that leads to more diverse generation is to stochastically expand the beam by sampling l items from the multinomial distribution given by yt instead of deterministically picking the top l words. When we use this technique we refer to it as a stochastic beam search. We can trade-off between the stochastic and deterministic versions of beam search by using a temperature parameter T in the softmax function as follows: softmax(xi) = exp(−xi T )∑ j exp(− xj T ) . (6.1) Lowering the temperature T increases the concentration of probability mass at the head of the distribution. If the temperature is too high then the generated text can be nonsensical. If it is too low then the generation lacks diversity. In the limit when T goes to zero, it 75 becomes equivalent to greedy decoding. So, finding the right balance is important. Stochastic beam search deals with repetition between sampled texts. RNN language models also tend to use the same phrase repetitively within a single sentence or document, e.g. a restaurant review that says “The food was good and the service was great and the food was good and...”. To deal with this we use a heuristic that does not consider any beam expansion that would create a repeated trigram. Several sophisticated techniques and heuristics exist (such as adding random penalties to subsets of the vocabulary (Juuti et al., 2018) or having special models trained to enforce relevance and avoid contradiction and repetition (Holtzman et al., 2018)) but banning repeated trigrams is sufficient for our purpose of making a comparison between adaptation methods. 6.2 Illustrative Examples To start, we provide anecdotal examples in Tables 6.1, 6.2, and 6.3 illustrating the behavior of each adaptation technique when generating Yelp restaurant reviews given a star rating. In each case, a sentence template was selected and beam search decoding was used to find the highest probability word sequence that fits the gap in the template. This was repeated once for each of the possible star ratings from one to five. The sentence completions from the FactorCell model are better matched to the specified star ratings. Table 6.2 is probably the best example for the intensity of the selected adjectives to match with the corresponding star ratings. On the other hand, when using the ConcatCell model there is not a clear distinction between ratings. The ConcatCell picks the same completion for multiple star ratings (“loved it” in Table 6.1 and “great!” in Table 6.2) and in Table 6.3 it gets the wrong polarity for the 5 star rating completion. 76 Context FactorCell ConcatCell SoftmaxBias 5 stars it was amazing loved it ! it was amazing ! 4 stars loved it ! loved it ! it was delicious 3 stars it was great ! loved it ! had a blast 2 stars it was horrible ! had a blast was disappointed 1 star it was horrible ! it was horrible ! was disappointed Table 6.1: Top completions for the sentence “My boyfriend and I ate here and !” after conditioning on each star rating. Context FactorCell ConcatCell SoftmaxBias 5 stars amazing ! great ! delicious 4 stars great ! great ! delicious 3 stars good ! great ! delicious 2 stars just meh mediocre mediocre 1 star awful mediocre mediocre Table 6.2: Top completions for the sentence “This was my first time coming here and the food was ” after conditioning on each star rating. Context FactorCell ConcatCell SoftmaxBias 5 stars be back never go here never go here 4 stars be back go never go here 3 stars go back go never go here 2 stars never go here never never go here 1 star never go here never never go Table 6.3: Top completions for the sentence “I will again” after conditioning on each star rating. 77 6.3 Experiments and Analysis 6.3.1 Yelp Restaurant Reviews To quantify the differences between models, we sampled 2,000 Yelp reviews from 36 different models. The models differ in their adaptation method but also in their LSTM size and other hyperparameters including random word embedding sizes, epochs, context embedding size and FactorCell rank. Having variation in the hyperparameter settings allows us to analyze the relationship between perplexity and context-specificity, among other things. The generation used stochastic beam search with a temperature of 1.0, a beam width of 8 and a total beam size of 120. Only the first 70 words of the review were generated and the beam was constrained to not allow repeated trigrams as a heuristic to avoid excessive repetition. A discriminative classifier was trained to predict star rating on the same data that was used to train the language models. We used the fastText model because it is near state-of-the-art and it can be efficiently trained and evaluated (Grave et al., 2017b). This classifier was used to judge the generated reviews to see if they conform to their given ratings. This strategy was used by Hu et al. (2017) to evaluate “controllable” text generation. It is also similar to the text generation evaluation used by Ficler and Goldberg (2017) except that for some attributes they used hand-crafted decision rules instead of a statistical classifier and for other attributes deemed too difficult to classify they used human evaluation. There are multiple ways to report the judgment of the discriminative classifier. The most straightforward is accuracy, which is the percentage of times where the most probable rating according to the classifier matched the context given to the language model. In addition to this metric, we report results in terms of mean absolute deviation (the average absolute difference between the desired rating and the predicted rating). Achieving perfect accuracy is not realistic for this task because there is some natural ambiguity between adjacent star ratings. The fastText classifier has an accuracy of 63% and a median absolute deviation (MAD) of 0.44 stars on the Yelp test data. Figure 6.1 plots the context classification metric from Chapter 4 (accuracy of classifying 78 Figure 6.1: Context classification accuracy versus generation context-specificity for each type of adaptation on the Yelp data. actual reviews using the language model as a classifier) against the generation context- specificity metric (accuracy of classifying automatically generated text with a classifier trained on actual reviews). Each point in this plot is an independent model trained with random hyperparameters. The context classification accuracy is highly predictive of the generation context-specificity. Thus, it is not surprising that the FactorCell models are the most controllable followed by the ConcatCell models and then the SoftmaxBias models. As shown in Figure 6.2, there is a strong correlation between FactorCell rank and context- specificity (r = 0.8) but no relationship (r = 0.00) with perplexity. Having low perplexity is obviously important for high-quality generation but low perplexity by itself is not an indicator that the generated text will match the specified conditions. We included some additional models here that were no used in the experiments in Chapter 4 and were trained with bigger (up to double the size) LSTM hidden states. This is to address the question of whether the gap between models is reduced simply by training a bigger model. The bigger LSTM states do help with perplexity, but we find little to no impact on context-specificity. Table 6.4 lists the perplexity, automatically assessed generation accuracy, and the mean 79 Figure 6.2: Plot of FactorCell rank and perplexity against generation context-specificity accuracy for 14 FactorCell models on the Yelp restaurant data. Model ACC MAD PPL Unadapted 21.3 1.63 67.9 SoftmaxBias 41.1 .88 68.1 ConcatCell 59.6 .44 66.8 FactorCell 69.5 .33 68.1 Table 6.4: Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the Yelp data. 80 absolute deviation for each of the four classes of models. Using human raters on a subset of the generated reviews, we confirm that the automatic judgments are correlated with human judgments with r = 0.823. Nine raters judged a set of 145 reviews for this analysis. The raters were able to guess the intended star rating 64% of the time for the FactorCell compared to only 59% of the time for the ConcatCell. 6.3.2 TripAdvisor Hotel Reviews We conduct a similar analysis using the TripAdvisor hotel review data and models from Chapter 4 and check that the hotel class property matches with the generated text. Hotel class is a rating that is reflective of the quality of the service and amenities offered by the hotel and is measured in half integer increments. Only the identity of the hotels and the star rating (sentiment) of each review is provided during training and any information the model learns about class must be derived from the text of the review. Hotel reviews are sampled using a temperature of 1.0, a beam width of 4 and a beam size of 50. Only the first 72 to 100 words of each review were generated and 1,000 samples were collected from 54 independent models. When generating the hotel reviews we pick a random hotel identifier for each one and fix the sentiment at five stars. Like before, these models use a range of hyperparameters in order to study the relationship between the capacity of the model and its performance. The fastText classifier predicts hotel class with an accuracy of 53% and a mean absolute deviation of 0.36 on the TripAdvisor test data. Using this classifiers to judge the generated reviews, we find that the FactorCell is the the most specific for the hotel class. The complete metrics are in Table 6.5. Besides their lack of context-specificity, the unadapted models also suffer from lack of diversity. The unadapted models tend to default to generating reviews that match the largest region in the data, New York City. Up to 35% of the generated reviews from the unadapted models mention the Empire State Building (compared to 2% to 3% for the adapted models) even though only 30% of the training data are from hotels in New York City and only 1.6% of that 30% even mention the Empire State Building. Conditioning on context 81 Model ACC MAD PPL Unadapted 21.3 .77 51.9 SoftmaxBias 30.1 .61 48.7 ConcatCell 30.4 .57 48.8 FactorCell 32.2 .53 48.9 Table 6.5: Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the TripAdvisor data. information makes it easier to guide the generation away from the mode and towards more diverse samples. As shown in Figure 6.3, increasing the FactorCell rank helps with context-specificity of the hotel class just as it did with controlling the star rating of the Yelp reviews in Figure 6.2. However, instead of a lack of correlation with perplexity, we see that models with better perplexity also do better at context-specificity. Same as for the Hotel reviews, the accuracy of the models when used as generative text classifiers is predictive of their performance for context-specific generation. We show this relationship in Figure 6.4. We used the TripAdvisor models to test if the difference in context-specificity between the models was due to the quality of the representation learned in the hotel embeddings. We trained a gradient boosted decision tree to predict the hotel class using the hotel embedding as features. As we saw in Figure 4.7, the hotel embeddings naturally group together by class even though they hotel class labels were not available during training. Using five-fold cross validation on the 2,254 embeddings, we found little to no difference between models with both the ConcatCell and FactorCell correctly predicting the class in 67.1% of cases versus 65.3% for the SoftmaxBias model. 82 Figure 6.3: Context-specificity of hotel class versus FactorCell rank and perplexity in gener- ated reviews using the models learned on the TripAdvisor data. Figure 6.4: Context classification accuracy versus generation context-specificity for each type of adaptation on the TripAdvisor data. 83 6.4 Summary Recall that the ConcatCell model only adapts the recurrent layer bias vector. Assuming 200-dimensional word embeddings and a 400 dimensional LSTM, only 1,200 or 0.17% of the recurrent layer parameters change depending on the context. In contrast, a FactorCell model with the same embedding size and recurrent layer dimension and a rank of 25 has 6.25% of its recurrent layer parameters changing depending on context. That’s more than 37 times as many as the ConcatCell. When so few of the parameters are adapted, the result is a blurring of the distinctions between different contexts. We saw this qualitatively in the restaurant review sentence completions and quantitatively when measuring the context-specificity of the restaurant review star rating and the hotel review class. In Chapter 4, we used a context classification metric whereby the language model was used as a generative text classifier on held-out data in order to test its ability to understand the relationship between context and language. The experiments in this chapter show that the context classification metric is a useful predictor of text generation context-specificity. This means that we can assess the relative context-specificity of models independently of the generation hyperparameters (beam size, temperature, etc.) and without the need to construct a classifier or use human ratings to do judgments. This is useful because there are cases where a high quality classifier may not be readily available. For human ratings, not only are they expensive and time-consuming to collect, but there are also certain contexts where humans may be poor judges without special experience or training. Our method of evaluating context-specificity of the generated text has some limitations. Because the fastText classifier is trained on the same data as the language model, it has the potential to reward the language model for memorizing portions of the training data instead of generating novel sentences. Additionally, as we mentioned in the introduction to this chapter, context-specificity alone is not enough to evaluate a language model because it does not measure the ability of the model to generate well-formed text that is free from excessive repetitions and self-contradictions. We know from Chapter 4 that the FactorCell 84 model does have a perplexity that is comparable or better than baseline methods. However, a more comprehensive assessment is needed to take into account all of the desired attributes. 85 Chapter 7 CONCLUSIONS AND FUTURE DIRECTIONS The fact that language changes so dramatically from one situation to the next is both a challenge and an opportunity. It is a challenge because changes in context that may seem trivial to us as humans are enough to break a language model. It is an opportunity because dramatic changes means that large benefits can be accrued through adaptation. As an illus- tration, the 2017 Amazon Alexa prize encouraged people to speak to their device using an open-ended conversational style. This was a change from the task-oriented style that peo- ple normally used to speak with their home assistants. Initially, a generic language model was used for the conversational interactions. When a language model targeting conversa- tional speech was introduced the recognition word error rate dropped by one third (Ram et al., 2018). Adjusting the language model was crucial to enabling people to have long conversations with Alexa. Adaptation is vital to the success of language modeling in real applications and is the reason why this work can have high impact. In this chapter we summarize the contributions of the thesis and suggest possible directions for future work. 7.1 Summary of Contributions Our main contribution is the introduction of the FactorCell model, a new way of thinking about how to adapt the recurrent layer. The FactorCell model moves beyond the paradigm set by Mikolov and Zweig (2012) and used many times since then. The benefits of the FactorCell model include: • A re-conception of the mechanism for recurrent layer adaptation (using context to transform the model rather than as an additional input); 86 • Permitting context to affect a greater change in the model parameters (and therefore its predictions) than what is possible using prior methods; • Including a rank hyperparameter that smoothly trades off between the case where most information is shared between contexts and the opposite situation where the contexts are mostly modeled independently; • Maintaining the property of the ConcatCell approach that the adapted weights can be precomputed and cached resulting in negligible computational overhead compared to an unadapted baseline; and • Off-loading of context specific information from the main recurrent weight matrix to special adaptation tensors, which helps prevent the blurring of the boundary between related contexts. We presented experimental results on nine distinct datasets. Three of them used character- level vocabularies and the other six operated at the word level. One dataset comprised tran- scripts of spoken language; some involved informal written language; and some were more formal. The biggest dataset had over 100 million words of text for training and the smallest was just 5 million characters. In every situation where it was tested, the FactorCell model gave the best perplexity. However the benefits are greater for some contexts. The diversity of data helped to increase the robustness of our conclusions and led us to make some guide- lines about when different aspects of adaptation become more or less important. Specifically, adapting the softmax bias is most effective for topic-based contexts and potentially less ef- fective when the recurrent layer dimensionality is large. Also, the contexts that are most amenable to adaptation are specific and are not easily predicted from a short window of text. An idea that we encountered more than once is that there are differences between models that are not captured by perplexity. In Chapters 3 and 4, we saw that a more flexible adap- tation of the recurrent layer helped the models better distinguish contexts in classification 87 experiments and Chapter 6 showed how models that are better at distinguishing contexts can be used for context-specific generation. There were two places where we saw clear qualitative differences between the FactorCell and the ConcatCell. In Chapter 5, the FactorCell discovered semantic associations between search queries whereas the ConcatCell failed to find such associations and sometimes seemed to focus more on orthography. In Chapter 6, the ConcatCell blurred the boundary between star rating levels (associated with sentiment) and the FactorCell was able to make cleaner distinctions. Taken together, these indicate the FactorCell provides a greater benefit than the improvement in perplexity would suggest. This is likely because the gains are on less frequently used words, and perplexity is dominated by frequent words. We showed how online learning lets the adapted model take advantage of contexts that emerge after training. The method of online adaptation that we introduced in Chapter 5 is generic; it will work for multiple adaptation strategies, including the ConcatCell. However, the FactorCell makes better use of the data stream to continue to increase the quality of its predictions over time. We made some observations about what factors make a dataset more or less amenable to adaptation. These include, from Chapter 3, that context variables that are easily predictable from the text alone are unlikely to be helpful and, from Chapter 4, that for topic related contexts adapting the softmax bias may be more successful. We also found that having a larger sized recurrent layer can lessen the impact of adapting the softmax bias. 7.2 Future Directions There are several promising future directions that build on the work in this thesis. In all six tasks that are explored in Chapter 4, all context factors are available for all training and testing samples. In some scenarios, it may be possible for some context factors to be missing. A simple solution for handling this is to use the expected value for the missing variable(s), since this is equivalent to using a weighted combination of the adaptation matrices for the different possible values of the missing variables. 88 The experiment scenarios all used metadata to specify context, since this type of context can be more sensitive to data sparsity and has been less studied. In contrast, in many prior studies of language model adaptation, context is specified in terms of text samples, such as prior user queries, prior sentences in a dialog, other documents related in terms of topic or style, etc. The FactorCell framework introduced here is also applicable to this type of context, but the best encoding of the text into an embedding (e.g. using bag of words, sequence models, etc.) is likely to vary with the application and remains an area of study for future work. The generation experiments that we conducted were generating text from scratch with no objective other than to obey the constraints imposed by conditioning on the context. In applications, such as machine translation, the generation is conditioned on a source sequence in an encoder-decoder model (also called sequence-to-sequence, or seq2seq) (Cho et al., 2014). The FactorCell model is applicable in these situations as well and one future direction is to test how it performs in that setting. Prior work has already shown that a simple language model that incorporates contextual data can provide gains in machine translation (Drexler et al., 2014). The data we used consisted of one or two context variables such as latitude and longitude or a hotel identifier and star rating. In these cases, it was reasonable to expect the context embedding to summarize the relevant information from both variables. If there were many more context variables then that assumption may no longer hold. More work is needed to find an appropriate dataset that has multiple context variables and to investigate appropriate methods of providing the context information to the FactorCell model. In Chapter 3, we showed that using hashing to adapt the output layer bias benefits per- plexity for high dimensional contexts. This indicates that the low-rank adaptation of the bias does not fully model the context-dependent language. We suggested that the model could benefit from a sparse correction to the low-rank adaptation and that could be accom- plished by including an L1 penalty term to the adaptation parameters. This idea is partially explored in Appendix A, but more work is needed to investigate the benefits of sparsity in 89 language model adaptation. Our focus was on language modeling, but RNNs are widely used in many other natural language processing tasks and in other domains that have little or nothing to do with language such as acoustic modeling (Graves et al., 2006), time-series analysis, music generation (Goel et al., 2014), and more. For example, some work on other natural language processing tasks, such as spoken language understanding, have already made use of the ConcatCell style adaptation (Mesnil et al., 2015). Personalization would be of high utility in these applications. This thesis looked at query completion, but there are often text prediction applications that this work is relevant to, including next word prediction for mobile text input and augmentative communication for people with disabilities. For augmentative communications, the method would also apply to icon-based communication, which relies on a language model with icons instead of words or characters (Dudy and Bedrick, 2018). It is difficult to predict what future state-of-the-art language models will look like. Re- current neural networks have been reliable winners for the past few years. There is active work on alternate architectures that remedy some of the perceived weaknesses (mostly lack of easy parallelism) of the RNN, such as the Quasi-Recurrent Neural Network (Bradbury et al., 2017) and the Transformer Network (Vaswani et al., 2017). It seems likely that the adaptation techniques from this thesis will apply to those architectures as well, but it remains to be tested. 90 BIBLIOGRAPHY Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Yoshua Bengio, et al. 2016. End-to- end attention-based large vocabulary speech recognition. In Proc. ICASSP, pages 4945– 4949. Lalit Bahl, James Baker, Paul Cohen, Fred Jelinek, Burn Lewis, and R Mercer. 1978. Recog- nition of continuously read natural corpus. In Proc. ICASSP, volume 3, pages 422–424. Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion. In Proc. WWW, pages 107–116. Jerome R Bellegarda. 2000. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8):1279–1296. Jerome R Bellegarda. 2004. Statistical language model adaptation: review and perspectives. Speech Communication, 42(1):93–108. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003a. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155. Yoshua Bengio, Jean-Sébastien Senécal, et al. 2003b. Quick training of probabilistic neural nets by importance sampling. In Proc. AISTATS. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media. 91 Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Com- munications of the ACM, 13(7):422–426. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21. Association for Computational Linguistics. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural networks. Proc. ICLR. Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proc. EMNLP. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479. Ivan Bulyko, Mari Ostendorf, and Andreas Stolcke. 2003. Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In Proc. HLT-NAACL, pages 7–9. Fei Cai, Maarten De Rijke, et al. 2016. A survey of query auto-completion in information retrieval. Foundations and Trends in Information Retrieval, 10(4):273–363. Fei Cai, Shangsong Liang, and Maarten De Rijke. 2014. Time-sensitive personalized query auto-completion. In Proc. CIKM, pages 1599–1608. William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015. Listen, attend and spell. arXiv preprint arXiv:1508.01211. Ciprian Chelba and Noam Shazeer. 2015. Sparse non-negative matrix language modeling 92 for geo-annotated query session data. In Proc. Automatic Speech Recognition and Under- standing (ASRU), 2015 IEEE Workshop on, pages 8–14. Ciprian Chelba, Xuedong Zhang, and Keith B Hall. 2015. Geo-location for voice search language modeling. In Proc. Interspeech, pages 1438–1442. Wenlin Chen, David Grangier, and Michael Auli. 2016. Strategies for training large vocab- ulary neural language models. In Proc. ACL. Xie Chen, Tian Tan, Xunying Liu, Pierre Lanchantin, Moquan Wan, Mark JF Gales, and Philip C Woodland. 2015. Recurrent neural network language model adaptation for multi- genre broadcast speech recognition. In Proc. InterSpeech. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representa- tions using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proc. ACL. Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Ṽesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph- based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP), 5(1):3. Salil Deena, Madina Hasan, Mortaza Doulaty, Oscar Saz, and Thomas Hain. 2016. Com- bining feature and model-based adaptation of RNNLMs for multi-genre broadcast speech recognition. In Proc. Interspeech, pages 2343–2347. Renato DeMori and Marcello Federico. 1999. Language model adaptation. In Computational models of speech pattern processing, pages 280–303. 93 Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2016. TopicRNN: a recurrent neural network with long-range semantic dependency. arXiv preprint arXiv:1611.01702. Jennifer Drexler, Pushpendre Rastogi, Jacqueline Aguilar, Benjamin Van Durme, and Matt Post. 2014. A Wikipedia-based corpus for contextualized machine translation. In Proc. LREC, pages 3593–3596. Shiran Dudy and Steven Bedrick. 2018. Compositional language modeling for icon-based augmentative and alternative communication. Northwest NLP Regional Workshop. J. Eisenstein, A. Ahmed, and E. Xing. 2011a. Sparse additive generative models of text. In Proc. ICML. Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. 2011b. Sparse additive generative models of text. In Proc. ICML. Jacob Eisenstein, Brendan O’Connor, Noah A Smith, and Eric P Xing. 2010. A latent variable model for geographic lexical variation. In Proc. EMNLP, pages 1277–1287. Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proc. EMNLP Workshop on Stylistic Variation. Siva Reddy Gangireddy, Pawel Swietojanski, Peter Bell, and Steve Renals. 2016. Unsuper- vised adaptation of recurrent neural network language models. In Proc. Interspeech, pages 2333–2337. Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori. 2007. Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Com- munication, 49(10-11):827–835. Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck. 2016. Contextual LSTM models for large scale NLP tasks. arXiv preprint arXiv:1602.06291. 94 James R Glass, Timothy J Hazen, D Scott Cyphers, Igor Malioutov, David Huynh, and Regina Barzilay. 2007. Recent progress in the MIT spoken lecture processing project. In Proc. Interspeech, pages 2553–2556. Kratarth Goel, Raunaq Vohra, and JK Sahoo. 2014. Polyphonic music generation by mod- eling temporal dependencies using a RNN-DBN. In Proc. International Conference on Artificial Neural Networks, pages 217–224. Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017a. Improving neural language models with a continuous cache. In Proc. ICLR, volume abs/1612.04426. Edouard Grave, Tomáš Mikolov, Armand Joulin, and Piotr Bojanowski. 2017b. Bag of tricks for efficient text classification. In EACL. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connec- tionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, pages 369–376. ACM. Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML, volume 14, pages 1764–1772. Klaus Greff, Rupesh K Srivastava, Jan Koutńık, Bas R Steunebrink, and Jürgen Schmid- huber. 2016. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems, 28:2222–2232. David Ha, Andrew Dai, and Quoc V Le. 2017. Hypernetworks. In Proc. ICLR. Yoni Halpern, Keith Hall, Vlad Schogol, Michael Riley, Brian Roark, Gleb Skobeltsyn, and 95 Martin Baeuml. 2016. Contextual prediction models for speech recognition. In Proc. Interspeech. Bo Han, Paul Cook, and Timothy Baldwin. 2014. Text-based twitter user geolocation pre- diction. J. Artif. Intell. Res., 49:451–500. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cong Duy Vu Hoang, Trevor Cohn, and Gholamreza Haffari. 2016. Incorporating side infor- mation into recurrent neural network language models. In Proc. HLT-NAACL. Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, and Yejin Choi. 2018. https://openreview.net/forum?id=r1lfpfZAb Learning to write by learning the objective. Bo-June Paul Hsu and James Glass. 2008. N-gram weighting: reducing training data mis- match in cross-domain language model estimation. In Proc. EMNLP, pages 829–838. Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In Proc. ICML, pages 1587–1596. B. Hutchinson, M. Ostendorf, and M. Fazel. 2015. A sparse plus low-rank exponential language model for limited resource scenarios. IEEE Trans. Audio, Speech and Language Processing, 23(3):494–504. Brian Hutchinson, Mari Ostendorf, and Maryam Fazel. 2013. Exceptions in language as learned by the multi-factor sparse plus low-rank language model. In Proc. ICASSP, pages 8580–8584. Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462. 96 Kazuki Irie, Shankar Kumar, Michael Nirschl, and Hank Liao. 2018. RADMM: Recurrent adaptive mixture model with applications to domain robust language modeling. Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, and Noah A Smith. 2016. Hierarchical character-word models for language identification. In EMNLP Workshop on NLP for Social Media,. Aaron Jaech and Mari Ostendorf. 2015. Leveraging Twitter for low-resource conversational speech language modeling. arXiv preprint arXiv:1504.02490. Aaron Jaech and Mari Ostendorf. 2017. http://ajaech.me/adeeehllnorru Improving context aware language models. arXiv preprint arXiv:1704.06380. Aaron Jaech and Mari Ostendorf. 2018a. Low-rank RNN adaptation for context-aware lan- guage modeling. TACL. Aaron Jaech and Mari Ostendorf. 2018b. Personalized language model for query auto- completion. In Proc. ACL. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. In Proc. ACL-IJCNLP. Fred Jelinek. 1990. Self-organized language modeling for speech recognition. Readings in speech recognition, pages 450–506. Fred Jelinek. 1991. Up from trigrams. In Proc. Eurospeech. Fred Jelinek, Bernard Mérialdo, Salim Roukos, and M. Strauss. 1991. A dynamic language model for speech recognition. In Proc. HLT. Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. 2015. Docu- ment context language models. arXiv preprint arXiv:1511.03962. 97 Mika Juuti, Bo Sun, Tatsuya Mori, and N Asokan. 2018. Stay on-topic: Generating context- specific fake restaurant reviews. arXiv preprint arXiv:1805.02400. Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and understanding recurrent networks. In Proc. ICLR. Slava Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE transactions on acoustics, speech, and signal processing, 35(3):400–401. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Katrin Kirchhoff and Mei Yang. 2005. Improved language modeling for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 125–128. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language mod- eling. In Proc. ICASSP, volume 1, pages 181–184. Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. 2017. Dynamic evalua- tion of neural sequence models. arXiv preprint arXiv:1709.07432. Oleksii Kuchaiev and Boris Ginsburg. 2017. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722. Roland Kuhn and Renato De Mori. 1990. A cache-based natural language model for speech recognition. IEEE transactions on pattern analysis and machine intelligence, 12(6):570– 583. Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural ma- chine translation without explicit segmentation. TACL, 5:365–378. 98 Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi S. Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In ACL. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. Proc. ACL. Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Delete, retrieve, generate: A simple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437. Bing Liu and Ian Lane. 2017. Dialog context language modeling with recurrent neural networks. In Proc. ICASSP, pages 5715–5719. Yi Luan, Yangfeng Ji, and Mari Ostendorf. 2016. LSTM based conversation models. arXiv preprint arXiv:1603.09457. Min Ma, Michael Nirschl, Fadi Biadsy, and Shankar Kumar. 2017. Approaches for neural- network language model adaptation. In Proc. Interspeech 2017, pages 259–263. Andrew L Maas, Ziang Xie, Dan Jurafsky, and Andrew Y Ng. 2015. Lexicon-free conversa- tional speech recognition with neural networks. In Proc. NAACL. Gábor Melis, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language models. In Proc. ICLR. Gideon Mendels, Erica Cooper, Victor Soto, Julia Hirschberg, Mark Gales, Kate Knill, Anton Ragni, and Haipeng Wang. 2015. Improving speech recognition and keyword search for low resource languages using web data. In Proc. INTERSPEECH. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017a. Regularizing and opti- mizing LSTM language models. arXiv preprint arXiv:1708.02182. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017b. Pointer sen- tinel mixture models. In Proc. ICLR. 99 Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani- Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):530–539. Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan Černockỳ. 2011. Strate- gies for training large scale neural network language models. In Automatic Speech Recog- nition and Understanding (ASRU), IEEE Workshop on, pages 196–201. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proc. Interspeech. Tomáš Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In Proc. SLT, pages 234–239. Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In Proc. CIKM, pages 1755–1758. Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Proc. NIPS, pages 2265–2273. Giovanni Molina, Fahad AlGhamdi, Mahmoud Ghoneim, Abdelati Hawwari, Nicolas Rey- Villamizar, Mona Diab, and Thamar Solorio. 2016. Overview for the second shared task on language identification in code-switched data. In Second Workshop on Computational Approaches to Code Switching, pages 40–49. Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Proc. AISTATS, volume 5, pages 246–252. Graham Neubig and Chris Dyer. 2016. Generalizing and hybridizing count-based and neural language models. In Proc. EMNLP. 100 Miles Osborne, Ashwin Lall, and Benjamin Van Durme. 2014. Exponential reservoir sampling for streaming language models. Proc. ACL. Robert Östling and Jörg Tiedemann. 2017. Continuous multilinguality with language vectors. Proc. EACL. Ankur P Parikh, Avneesh Saluja, Chris Dyer, and Eric P Xing. 2014. Language modeling with power low rank ensembles. In Proc. EMNLP. Dae Hoon Park and Rikio Chiba. 2017. A neural language model for query auto-completion. In Proc. SIGIR, pages 1189–1192. Razvan Pascanu, Tomáš Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proc. ICML, pages 1310–1318. Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. InfoScale, 152:1. Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proc. EACL. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Anton Ragni, Edgar Dakin, Xie Chen, Mark J F Gales, and Kate M Knill. 2016. Multi- language neural network language models. In Proc. Interspeech. Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational AI: The science behind the alexa prize. In Proc. NIPS. Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proc. EMNLP. 101 Giuseppe Riccardi and Allen L Gorin. 2000. Stochastic language adaptation over time and state in natural spoken dialog systems. IEEE Transactions on Speech and Audio Process- ing, 8(1):3–10. Roni Rosenfeld. 1995. Optimizing lexical and ngram coverage via judicious use of linguistic data. In Proc. Eurospeech. Roni Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language mod- eling. Computer, Speech and Language, 10:187–228. Roni Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270–1278. Murat Saraçlar, Tunga Güngör, et al. 2010. Morphology-based and sub-word language mod- eling for Turkish speech recognition. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5402–5405. Sarah E. Schwarm, Ivan Bulyko, and Mari Ostendorf. 2004. Adaptive language modeling with varied sources to cover new vocabulary items. IEEE Trans. Speech and Audio Processing, 12:334–342. Holger Schwenk. 2004. Efficient training of large neural networks for language modeling. In 2004 IEEE International Joint Conference on Neural Networks, volume 4, pages 3059– 3064. Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2016. Recurrent dropout without memory loss. In Proc. COLING. Noam Shazeer, Joris Pelemans, and Ciprian Chelba. 2015. Sparse non-negative matrix lan- guage modeling for skip-grams. In Proc. Interspeech, pages 1428–1432. Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proc. SIGIR, pages 103–112. 102 Milad Shokouhi and Kira Radinsky. 2012. Time-sensitive query auto-completion. In Proc. SIGIR, pages 601–610. Manhung Siu and Mari Ostendorf. 2000. Variable n-grams and extensions for conversational speech language modeling. IEEE Transactions on Speech and Audio Processing, 8(1):63– 75. Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context- aware query suggestion. In Proc. CIKM, pages 553–562. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Proc. Interspeech. David Talbot and Thorsten Brants. 2008. Randomized language models via perfect hash functions. In Proc. ACL. Jian Tang, Yifan Yang, Sam Carton, Ming Zhang, and Qiaozhu Mei. 2016. Context- aware natural language generation with recurrent neural networks. arXiv preprint arXiv:1611.09900. Trang Tran and Mari Ostendorf. 2016. Characterizing the language of online communities and its relation to community reception. In Proc. EMNLP. Bo-Hsiang Tseng, Hung-yi Lee, and Lin-Shan Lee. 2015. Personalizing universal recurrent neural network language model with user characteristic features by social network crowd- sourcing. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 84–91. Roger CF Tucker, Michael J Carey, and Eluned S Parris. 1994. Automatic language identi- fication using sub-word models. In Proc. ICASSP, volume 1, pages I–301. 103 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NIPS, pages 6000–6010. Po-Wei Wang, J. Zico Kolter, Vijai Mohan, and Inderjit S. Dhillon. 2018. Realtime query completion via deep language models. In Proc. ICLR. Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned LSTM-based natural language generation for spo- ken dialogue systems. In Proc. EMNLP. Tsung-Hsien Wen, Aaron Heidel, Hung-yi Lee, Yu Tsao, and Lin-Shan Lee. 2013. Recurrent neural network based language model personalization by social network crowdsourcing. In Proc. Interspeech, pages 2703–2707. Stewart Whiting and Joemon M Jose. 2014. Recent and robust query auto-completion. In Proc. WWW, pages 971–982. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Kr̃ikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine trans- lation. arXiv preprint arXiv:1609.08144. Puyang Xu, Sanjeev Khudanpur, and Asela Gunawardana. 2011. Randomized maximum entropy language models. In Proc. Automatic Speech Recognition and Understanding (ASRU), IEEE Workshop on, pages 226–230. Dani Yogatama, Chris Dyer, Wang Ling, and Phil Blunsom. 2017. Generative and discrimi- native text classification with recurrent neural networks. arXiv preprint arXiv:1703.01898. Dani Yogatama, Chong Wang, Bryan R Routledge, Noah A Smith, and Eric P Xing. 2014. Dynamic language models for streaming text. TACL, 2:181–192. 104 Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network reg- ularization. arXiv preprint arXiv:1409.2329. Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Weinan Zhang, Ting Liu, Yifa Wang, and Qingfu Zhu. 2017. Neural personalized response generation as domain adaptation. arXiv preprint arXiv:1701.02073. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proc. NIPS, pages 649–657. Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo, José Ramom Pichel Campos, Iñaki Alegŕıa Loinaz, Nora Aranberri, Aitzol Ezeiza, and Vı́ctor Fresno-Fernández. 2014. Overview of TweetLID: Tweet language identification at sepln 2014. In Proc. TweetLID@ SEPLN, pages 1–11. 105 Appendix A SPARSE CORRECTIONS FOR OUTPUT LAYER ADAPTATION In Chapter 3, we saw that adapting the output layer using a low rank term combined with a hash table that stored coefficients for individual words was helpful. The low rank term exploits similarities between contexts to boost the probability of words when appropriate. We hypothesized that there may be certain words that are unique to a particular context and are not well modeled by a low rank adaptation. The purpose of the hashed coefficients was to serve as a correction term to the low rank adaptation. In this appendix, we investigate a sparse correction term, where sparsity is achieved by using an L1 penalty during training, as an alternate method of modeling exceptions to the low rank adaptation. A.1 Sparse plus low-rank softmax bias adaptation We test two methods of forming embeddings for low rank adaptation of the output layer. One is to use a neural network as in Equation 2.4 to make a single embedding that summarizes all context variables. The second method is to learn individual embedding matrices for each of the n context variables. In this case Equation 2.6 is replaced with yt = softmax(E Tht + n∑ i=1 Gici + b2), (A.1) which has individual adaptation terms for each context variable. The sparse correction to the low rank adaptation bias has a similar form except that the low-dimensional context embeddings, ci are replaced by one-hot encoded vectors, oi. yt = softmax(E Tht + n∑ i=1 Gici + n∑ i=1 Aioi + b2), (A.2) 106 The Ai matrices will have a rank equal to the cardinality of the i-th context variable and therefore are much larger than the Gi matrices whose rank is set by the dimensionality of ci. To encourage sparsity in the Ai matrices we apply a soft threshold operator: u(s,λ) =   s + λ s < −λ 0 −λ ≤ s ≤ λ s−λ λ < s (A.3) This has the identical effect as introducing an L1 penalty term in the objective except that it sets coefficients to be exactly zero if they are within a certain range. During tuning we search for the optimal penalty term λi for each of the four context variables. The soft thresholding operation u(Aioi,λi) is applied (the function is applied element-wise to each entry in Aioi). We make use of the same TripAdvisor data from Chapter 4 except that instead of using two context variables (hotel identifier and review sentiment), we have four: hotel identifier, month of stay, year of stay, and region. There are 2,918 hotels, 14 years (from 1999 to 2012), 12 months, and 25 regions (major cities in the United States). We hypothesized that the more detailed context would have more specialized language, which might have more potential to benefit from sparse correction. Because of the success with the hash table, we hypothesized that sparse terms would be most useful at the output layer. Thus, we use no adaptation at the recurrent layer for the following experiments. The vocabulary size is fixed to 10,000 words (11% the size of the one used on the same data in Chapter 4) so that the model can be trained quickly using a full softmax loss. Using a sampled softmax loss may interfere with the application of the added regularization term on the softmax bias vector. We trained 59 models using a random search strategy for hyperparameter tuning. All models used a fixed word embedding and GRU dimension of 72 and a batch size of 64. Using a smaller GRU dimension, as we did here, places more importance on the softmax layer and gives the adaptation a better chance of having an impact. The hyperparameters that varied between models were the dimensionality of the embeddings of the four context variables, 107 LR Sparse PPL No No 44.0 Yes No 41.0 No Yes 41.4 Yes Yes 41.2 Table A.1: Perplexity on the validation set of models with no adaptation and varying soft- max adaptation strategies. Results are not comparable to those in Chapter 4 because of a difference in vocabulary size. the λ’s for each context variable, and whether to use or disable each of the two forms of adaptation. For a model that uses no low-rank adaptation, the sparse matrix used to adapt the bias for the region variable contains 29 million parameters, 96% of the total. This is intended to be over-parameterized so that the regularization will be useful. All of our results are on the validation set and not the test set. As seen in Table A.1, the best model used a low rank adaptation and no sparse correction. In theory, including the sparse correction should be no worse than without it as long as the λ’s are properly tuned. The fact that performance was worse for the low-rank case indicates that more experiments are needed for proper tuning. Among the models that did use a sparse correction term, we found that the models that worked best had λ’s equal to or so close to zero that the regularization had no effect. There was no apparent benefit to having an L1 penalty on this data set with this size of model and vocabulary. A second finding is that using separate embeddings for each context variable as in Equa- tion A.1 is better for adapting the softmax layer than combining them using a neural network hidden layer as in Equations 2.4 and 2.6. The difference was small (perplexities of 41.0 ver- sus 41.3) but consistent. This indicates that the optimization of the context embedding parameters could be improved. In predictive text applications, language models are used to suggest the next words 108 that someone might want to type in order to speed-up the process of typing on a mobile device. The top-3 next word prediction accuracy as a metric is a better fit to this task than perplexity. We computed the top-3 accuracy for the language models used in these experiments and found that using the “sparse”1 adaptation strategy had a bigger impact on the accuracy than perplexity. The best model had a top-3 accuracy of 46.35% and used both low-rank and sparse adaptation compared to 46.27% for the best model using only low rank adaptation. The difference is small but significant (p < 0.001). A.2 L1 Penalty for Bias Layer Fine-Tuning We mentioned model fine-tuning as a type of adaptation in Section 2.3.1. In this section we test if including an L1 penalty on the bias term is helpful when fine-tuning a pre-trained language model to match the style of a small set of data from another domain. We created small datasets for fine-tuning by selecting the subset of TripAdvisor reviews that mention the word “Hilton” and those that are for hotels in Boston. A third dataset was created by selecting a random subset of (mostly restaurant) reviews from the Yelp dataset and then a fourth used only the reviews from the Yelp dataset that were for a dentist office. For the pre-trained model, we used the best unadapted TripAdvisor model from Section A.1. The fine-tuning was done by continuing training of that model on the new smaller dataset until it began to over-fit. In a realistic scenario there would not be a large validation set available to check for over-fitting but in this experiment we are just aiming for a proof- of-concept. We tried selectively freezing layers of the pre-trained model and also varied the size of the training data and the scale of the L1 penalty. In each case the best result was obtained by using zero penalty and early stopping as the only regularizer. 1It is not actually sparse because the weight on the L1 regularization term was near zero. 109 A.3 Summary After conducting these experiments we were unable to find a benefit from using an L1 penalty term to learn a sparse correction to a low-rank model in adapting the softmax bias. There are multiple possible explanations for this negative result. • It is possible that the data we used is not the right one for this technique. If we had used another dataset where the size or vocabulary was different then it is possible that having a sparse correction would be helpful. • Our use of random search for tuning the regularization penalties could be improved. We might have seen a different result with better tuning. • We purposefully created a model that was over-parameterized. Over-parameterization can lead to over-fitting unless regularization is used. Using the L1 penalty is equivalent to setting a Laplacian prior distribution on the parameters. The regularization encour- ages the parameters to not be too far from zero on average. However, even without regularization, the parameter values do not grow too large during any practical length of time. The training data constrains the parameter values in the bias from becom- ing too large in the positive direction and. For the negative direction, the bias terms for words that are never used for particular contexts do not reach negative infinity in because their gradients decrease exponentially faster than the parameters themselves. It turns out that stopping gradient descent after a finite amount of time is a more effective means of regularizing the parameters than applying an L1 penalty. • We hypothesized that sparse corrections would be most useful at the softmax layer, but it is possible that they are helpful at the recurrent layer or for adapting the word embeddings. • We assumed that not adapting the recurrent layer would give the sparse softmax bias adaptation a better chance of success. It is possible this assumption is incorrect and 110 that having a sparse correction is only helpful after the low-rank component has been accounted for by adapting the recurrent layer. In summary, we did not find a use for L1 regularization for adapting the softmax bias vector. More work is needed to confirm these experiments and to understand why.