Efficient Contextual Representation Learning With Continuous Outputs Liunian Harold Li†, Patrick H. Chen∗, Cho-Jui Hsieh∗, Kai-Wei Chang∗ †Peking University ∗University of California, Los Angeles liliunian@pku.edu.cn, patrickchen@g.ucla.edu {chohsieh, kwchang}@cs.ucla.edu Abstract Contextual representation models have achieved great success in improving various downstream natural language processing tasks. However, these language-model-based encoders are dif- ficult to train due to their large parameter size and high computational complexity. By carefully examining the training procedure, we observe that the softmax layer, which pre- dicts a distribution of the target word, often in- duces significant overhead, especially when the vocabulary size is large. Therefore, we re- visit the design of the output layer and consider directly predicting the pre-trained embedding of the target word for a given context. When applied to ELMo, the proposed approach achieves a 4-fold speedup and eliminates 80% trainable parameters while achieving competitive per- formance on downstream tasks. Further anal- ysis shows that the approach maintains the speed advantage under various settings, even when the sentence encoder is scaled up. 1 Introduction In recent years, text representation learning ap- proaches, such as ELMo (Peters et al., 2018a), GPT (Radford et al., 2018), BERT (Devlin et al., 2019), and GPT-2 (Radford et al., 2019), have been developed to represent generic contextual information in natural languages by training an encoder with a language model objective on a large unlabelled corpus. During the training process, the encoder is given part of the text and asked to predict the missing pieces. Prior studies show that the encoders trained in this way can capture generic contextual information of the input text and improve a variety of downstream tasks significantly. However, training contextual representations is known to be a resource-hungry process. For example, ELMo is reported to take about 2 weeks to train on a one-billion-token corpus with a vocabulary of 800,000 words using three GPUs.1 This slow training procedure hinders the development cycle, prevents fine-grained param- eter tuning, and makes training contextual repre- sentations inaccessible to the broader community. Recent work also raises concerns about the envi- ronmental implications of training such large models (Strubell et al., 2019). In addition, the suc- cess of these models stems from a large amount of data they used. It is challenging, if not impossible, to train a contextual representation model on a larger corpus with tens or hundreds of billions of tokens. In this work, we explore how to accelerate contextual representation learning. We identify the softmax layer as the primary cause of inefficiency. This component takes up a considerable portion of all trainable parameters (80% for ELMo) and consumes a huge amount of training time. However, it is often not needed in the final model as the goal of contextual representation learning is to build a generic encoder. Therefore, it is rather a waste to allocate extensive computational resources to the softmax layer. Inspired by Kumar and Tsvetkov (2019), we con- sider learning contextual representation models with continuous outputs. In the training process, the contextual encoder is learned by minimizing the distance between its output and a pre-trained target word embedding. The constant time com- plexity and small memory footprint of the output layer perfectly serve our desire to decouple learn- ing contexts and words and devote most com- putational resources to the contextual encoder. In addition, we combine the approach with open- vocabulary word embeddings such that the model can be trained without the need to pre-define a 1https://github.com/allenai/bilm-tf/ issues/55. 611 Transactions of the Association for Computational Linguistics, vol. 7, pp. 611–624, 2019. https://doi.org/10.1162/tacl a 00289 Action Editor: Luke Zettlemoyer. Submission batch: 1/2019; Revision batch: 6/2019; Published 9/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 https://github.com/allenai/bilm-tf/issues/55 https://github.com/allenai/bilm-tf/issues/55 https://doi.org/10.1162/tacl_a_00289 closed word set as the vocabulary. We also provide an alternative interpretation of learning contextual encoders with continuous outputs that sheds light on how the pre-trained embedding could affect the performance of the model. We conduct a comprehensive empirical study to analyze the proposed approach and several existing methods that are originally proposed to reduce the complexity of the output layer in lan- guage models, such as the adaptive softmax, and the sub-word methods. We incorporate these ap- proaches with ELMo and conduct a comprehen- sive study to compare them in terms of training speed and performance on five downstream tasks. We demonstrate that the proposed approach ef- fectively reduces the training time and trainable parameters while maintaining competitive perfor- mance compared with the baselines. Our approach also exhibits consistent computational advanxtage under different conditions (e.g., with different vo- cabulary sizes, with different sentence encoders, and with different number of GPUs). Source code is available athttps://github. com/uclanlp/ELMO-C. 2 Background and Related Work Contextual representation We review contex- tual representation models from two aspects: how they are trained and how they are used in downstream tasks. CoVe (McCann et al., 2017) uses the source lan- guage encoder from a machine translation model as a contextual representation model. Peters et al. (2018a) advocate for the use of larger unlabelled corpora and proposes ELMo, a forward and a back- ward LSTM-based (Hochreiter and Schmidhuber, 1997) language model, whereas GPT (Radford et al., 2018) and GPT-2 (Radford et al., 2019) build a language model with the Transformer (Vaswani et al., 2017). BERT (Devlin et al., 2019) intro- duces the masked language model and provides deep bidirectional representation. There are two existing strategies for applying pre-trained contextual representations to down- stream tasks: 1) feature-based and 2) fine-tuning. In the feature-based approach, fixed features are extracted from the contextual encoder (e.g., ELMo, CoVe) and inserted as an input into a task-specific model. In the fine-tuning approach, the contextual encoder is designed as a part of the network architecture for downstream tasks, and its parameters are fine-tuned with the down- stream task. BERT is designed for the fine-tuning approach but it is also evaluated with the feature- based approach. GPT-2 is a scaled-up version of GPT and exhibits strong performance under zero-shot settings. Speeding up language models training Con- siderable efforts have been devoted to accelerat- ing the training process of language models. One line of research focuses on developing faster sequence encoder architectures such as CNN (Kim et al., 2016; Dauphin et al., 2017), QRNN (Bradbury et al., 2016), SRU (Lei et al., 2018), and the Transformer (Vaswani et al., 2017). These architectures have been extensively used for learning language representations (Radford et al., 2018; Devlin et al., 2019; Tang et al., 2018). Another line of work focuses on the large- vocabulary issue, as a large and ever-growing vo- cabulary results in an intractable softmax layer. Our work falls into the second line and we review existing solutions in detail. Several studies for language modeling focus on directly reducing the complexity of the soft- max layer. Following Kumar and Tsvetkov (2019), we group them into two categories: sampling- based approximations and structural approxima- tions. Sampling-based approximations include the sampled softmax (Bengio et al., 2003) and NCE (Mnih and Teh, 2012). The sampled softmax ap- proximates the normalization term of softmax by sampling a subset of negative targets, and NCE replaces the softmax with a binary classifier. On the other hand, structural approximations such as the hierarchical softmax (Morin and Bengio, 2005) and the adaptive softmax (Grave et al., 2016), form a structural hierarchy to avoid expensive nor- malization. The adaptive softmax, in particular, groups words in the vocabulary into either a short- list or clusters of rare words. For frequent words, a softmax over the short-list would suffice, which reduces computation and memory usage signifi- cantly. The adaptive softmax has been shown to achieve results close to those of the full softmax while maintaining high GPU efficiency (Merity et al., 2018). Regarding contextual representation models, ELMo used the sampled softmax and GPT and BERT resorted to a subword method. Specifi- cally, they used WordPiece (Wu et al., 2016) or BPE (Sennrich et al., 2016) to split the words into 612 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 https://github.com/uclanlp/ELMO-C https://github.com/uclanlp/ELMO-C subwords and the language models were trained to take subwords as input and also predict sub- words. This method is efficient and scalable, as the subword vocabulary can be kept small. One potential drawback of these subword-level lan- guage models, however, is that they produce rep- resentations for fragments of words. Therefore, it takes extra effort to generate word-level repre- sentations (see the discussion in Section 4.2). The high cost of the softmax layer has also been noted in the sentence representation learning literature. Following the success of Word2Vec (Mikolov et al., 2013), methods such as SkipThought (Kiros et al., 2015) have been developed to learn distributed sentence representations by predicting the context sentences of a given sentence, which involves sequentially decoding words of the target sentences. Jernite et al. (2017) and Logeswaran and Lee (2018) notice the inefficiency of the softmax layer during decoding and propose to use discriminative instead of generative objectives, eliminating the need for decoding. However, these approaches are not directly applicable to contex- tual representation learning. 3 Approach A contextual representation model, at its core, is a language model pre-trained on a large unlabeled corpus. In the following, we review the objective of language models and the architectures of exist- ing contextual representation models. We then introduce the proposed model. Language model objective Given a set of text sequences as the training corpus, we can construct a collection of word-context pairs (w, c), and the goal of a language model is to predict the word w based on the context c. In a forward language model, the context c is defined as the previous words in the sequence, whereas for a backward language model, the context of a word is defined as the following words. For a masked language model, some words in the input sentence are masked (e.g., replaced by a [MASK] token) and the objective is to predict the masked words from the remainder. Different contextual representa- tion models optimize different objectives. For example, ELMo trains a forward and backward language model and BERT trains a masked- language model. Model architecture A typical neural language model consists of three parts: 1) an input layer, 2) a sequence encoder, and 3) a softmax layer. Given a word-context pair (w, c), the input layer uses a word embedding or a character-CNN model (Kim et al., 2016) to convert the input words in c into word vectors. Then the sequence encoder embeds the context into a context vector c ∈ Rm using a multi-layer LSTM (Hochreiter and Schmidhuber, 1997), a Gated CNN (Dauphin et al., 2017), or a Transformer (Vaswani et al., 2017). The softmax layer then multiplies the context vector c with an output word embedding2 W ∈ RV ×m and uses a softmax function to produce a conditional distribution p(w|c) over the vocabulary of size V . In a language model, the learning objective l(w, c) for (w, c) is then expressed as: l(w, c) = − log p(w|c) = − log softmax(cW T ) = −c · w + log ∑ w′ exp(c · w′), (1) where w ∈ Rm is a row from W corresponding to the target word w and the second term sums over the vocabulary. After the model is trained, the contextual representations are generated from the latent states of the sequence encoder. For example, ELMo combines the hidden states of the LSTMs to generate contextualized word embedding for each word in a sentence. We refer the reader to Peters et al. (2018a) for details. Note that the size of W and the computational complexity of the second term in Eq. (1) scale linearly to the vocabulary size, V . Therefore, when V is large, the softmax layer becomes the speed bottleneck. Our approach The scaling issue of softmax also occurs in other language generation and sequence- to-sequence models. In the literature, several ap- proaches have been proposed to approximate the softmax layer or bypass it with a subword method (see Section 2). Recently, Kumar and Tsvetkov (2019) propose to treat the context vector as con- tinuous outputs and directly minimize the distance 2The dimension of the original output from the sequence encoder may not match the dimension of the output word embedding. In that case, a projection layer is added after the original sequence encoder to ensure that the two dimensions match. 613 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 between the context vector and the pre-trained word embedding associated with the target word, l(w, c) = d(c, w) (2) The distance function l could be the L2 distance ‖c − w‖2, the cosine distance c·w ‖c‖‖w‖ or a prob- abilistic distance metric. We argue that the idea of learning with con- tinuous outputs particularly suits contextual rep- resentation learning. As the goal is to obtain a strong contextual encoder, it makes sense to use a pre-trained output word embedding and decouple learning the contextual encoder and the output embedding. In the remainder of this section, we discuss the computational efficiency of the pro- posed approach and its combination with the open- vocabulary word embedding. We also provide an alternative way to interpret training contextual en- coders with continuous outputs. 3.1 Computational Efficiency The continuous output layer has a reduced arith- metic complexity and trainable parameter size. We illustrate these improvements and how they contribute to reducing the overall training time of a contextual representation model in the following. For comparison, we include the sampled softmax, the adaptive softmax, and the subword method in the discussion. 3.1.1 Learning with Continue Outputs Arithmetic complexity The arithmetic com- plexity (i.e., FLOPs) of evaluating loss with con- tinue outputs (i.e., Eq. 2) takes O(m), as we only need to calculate the distance between two m-dimensional vectors. The complexity of the sampled softmax is proportional to the number of negative samples per batch. When the vocabulary is huge, a large number of negative samples are needed (Jozefowicz et al., 2016). For the adaptive softmax, the time complexity is determined by the capacities of the short-list and the rare-word clus- ters, which grows sub-linearly to the vocabulary size. The complexity of the subword method is determined by the subword vocabulary size. In contrast, the time spent on the continuous output layer and loss evaluation remains constant with respect to the vocabulary size and is negligible. Trainable parameter size The output word embedding usually takes up a huge part of the parameters of a language model. For example, the softmax layer in ELMo trained on the One Billion Word Benchmark (Chelba et al., 2013) takes up more than 80% of the trainable parameters of the entire model. Even if an approximation such as the sampled softmax is used, the number of trainable parameters is not reduced. Approaches like the adaptive softmax reduce the dimension of softmax embedding for rare words, the trainable parameter size of which is effectively reduced but still remains sizable. For a model trained on the same corpus (Grave et al., 2016), the adaptive softmax still amounts to 240 million parameters whereas the sequence encoder has only around 50 million parameters. On the contrary, we learn a contextual encoder with Eq. (2) using a pre- trained word embedding, reducing the trainable parameters besides the encoder from tens or hun- dreds of millions to zero. 3.1.2 Overall Training Time We now discuss how the efficiency improvements to the output layer contribute to the reduction of the overall training time, in the context of synchronous stochastic gradient descent training on multiple GPUs. In general, the following three factors determine the training time. Arithmetic complexity The arithmetic com- plexity of a model includes the complexity of the forward and backward propagation on the in- put layer, the sequence encoder, and the output layer. It also includes the overhead of the opti- mization algorithm such as gradient clipping and model updates. The complexity of this optimiza- tion overhead is often proportional to the number of parameters that need updating. With the con- tinuous output layer, not only the arithmetic com- plexity but also the optimization overhead are reduced. GPU memory consumption The training time is also affected by GPU memory consumption, as less GPU memory consumption leads to larger batch size. For the same amount of data and hard- ware resource, larger batch size means better parallelism and less training time. Our approach exhibits small GPU memory footprints, due to reductions of the arithmetic complexity (with fewer intermediate results to keep) and trainable parameter size (with fewer parameters to store). As a result, training with continuous outputs is 2 to 4 times more memory-efficient than with the softmax layer (see Section 5.2). 614 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Note that as the output word embedding is fixed, we can keep that embedding in the main memory and only load the required part to the GPU memory. Despite the fact that this comes with an overhead of moving part of the output word embedding from CPU to GPU memory at each iteration, the benefit of parallelism often dominates over the communication overhead on mainstream hardware, where the GPU memory is often comparatively limited. We also note that larger batch size may lead to difficulty in opti- mization. Several methods have been developed to ease the large-batch training issue (Goyal et al., 2017; You et al., 2018). We show that these meth- ods are sufficient for resolving the optimization difficulty in our experiment (Section 4). Communication cost To train large neural net- work models, using multiple GPUs almost becomes a necessity. In addition, one way to scale up current systems is to increase the number of GPUs used. In such cases, the communication cost across GPUs needs to be taken into consideration. The cost occurs from synchronizing the parameters and their gradients across GPUs, which is proportional to the size of parameters that need to be updated. For the sampled softmax, due to the use of the sparse gradient, the communication cost is pro- portional to the number of the sampled words. For the adaptive softmax and the subword language model, the communication cost is proportional to the trainable parameter size. The continuous output layer, on the other hand, incurs little com- munication cost across GPUs. 3.2 Open-Vocabulary Training We utilize the open-vocabulary word embedding as both the input and output layer embedding. Open- vocabulary word embeddings, such as the FastText embedding and the MIMICK model (Pinter et al., 2017), utilize character or subword information to provide word embeddings. They could represent an unlimited number of words with a fixed number of parameters. As a result, we can train contextual encoders with an open vocabulary, which means we do not need to pre-define a closed word set as the vocabulary and the model can be trained on any sequences of words. Open-vocabulary input layer To be easily ap- plied in various tasks, the contextual encoder usu- ally has an open-vocabulary input layer. ELMo uses a character-CNN but it is relatively slow. Thus we use a pre-trained open-vocabulary word embedding as the input layer of the contextual encoder, reducing the time complexity of the input layer to a negligible level. This also aligns with the main spirit of our approach, which is to spend computational resources on the most important part, the sequence encoder. Open-vocabulary output layer For the soft- max layer, including efficient variants such as the adaptive softmax, the output vocabulary has to be pre-defined so that the normalization term can be calculated. As the softmax layer’s arithmetic complexity and parameter size grow when the vo- cabulary size grows, the vocabulary is often trun- cated to avoid expensive computation. With the continuous output layer, we can con- duct training on an arbitrary sequence of words, as long as the output embedding for those words can be derived. This can be achieved by using the open-vocabulary embedding. This feature is espe- cially attractive if we are training on domains or languages with a long-tail word distribution such as the biomedical domain, where truncating the vocabulary may not be acceptable. 3.3 Interpretation of Learning Contextual Encoders with Continuous Outputs In the following, we justify the intuition behind learning with continue outputs and discuss how the pre-trained word embedding affects the per- formance of the model. Language models are essentially modeling the word-context conditional probability matrix, that is, A ∈ RN×V where Ac,w = p(w|c), N is the number of all possible contexts, and V is the vocabulary size (Levy and Goldberg, 2014; Yang et al., 2017). The continuous output layer can be viewed as modeling A after using the word embedding as a projection matrix. To illustrate this, consider the global objective of the layer with the cosine distance:3 L = ∑ (w,c) #(w, c)l(w, c) = − ∑ (w,c) #(w, c)c · w = − ∑ c #(c)c· ∑ w p(w|c)w, = − ∑ c #(c)c· ∑ w Ac,ww, 3For simplicity, we take the cosine distance as a running example but the conclusions hold for other distance functions. 615 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Model Input Sequence Encoder Output ELMO CNN LSTM Sampled Softmax ELMO-C (OURS) FASTTEXTCC LSTM w/ LN CONT w/ FASTTEXTCC ELMO-A FASTTEXTCC LSTM w/ LN Adaptive Softmax ELMO-Sub Subword LSTM w/ LN Softmax ELMO-CONEB FASTTEXTONEB LSTM w/ LN CONT w/ FASTTEXTONEB ELMO-CRND FASTTEXTCC LSTM w/ LN CONT w/ Random Embedding ELMO-CCNN Trained CNN LSTM w/ LN CONT w/ Trained CNN ELMO-CCNN-CC Trained CNN LSTM w/ LN CONT w/ FASTTEXTCC ELMO-CCC-CNN FASTTEXTCC LSTM w/ LN CONT w/ Trained CNN Table 1: Specifications of variants of ELMo models compared in Sections 4 and 5. CONT means the model has continuous outputs. LN means layer normalization. where #(w, c) is the number of occurrences of the pair (w, c) in the corpus. We assume all vectors (c and w) are normalized. To optimize the inner product between c and∑ w p(w|c)w, we essentially align the direction of context vector c with the expected word vector under context c, ∑ w p(w|c)w = Ew∼p(w|c)w. In other words, given a word embedding matrix W ∈ RV ×d, our approach aligns c with the cor- responding row (AW )c,: in AW . Therefore, the ob- jective can be viewed as conducting multivariate regression to approximate (AW )c,: given the context. Based on this view, the performance of the contextual representation model depends on how much information of the original matrix A is preserved after projection with W . In the spirit of PCA (Jolliffe, 2011), to keep the variance of A, we would like to have (AW )c,: and (AW )c′,: distant from each other if c and c′ are very different. Therefore, a pre-trained word embedding, which projects words with different meanings into different positions in space, is a natural choice for the projection matrix W and can help preserve much of the variance of A. 4 Experiment We accelerate ELMo with the proposed approach and show that the resulting model ELMO-C is computationally efficient and maintains competi- tive performance, compared to the original ELMo model (ELMO), an ELMo variant with the adap- tive softmax (ELMO-A4), and another variant with the subword method (ELMO-Sub). 4We include ELMO-A instead of a model with sampled softmax because the adaptive softmax has been shown to have superior performance (Grave et al., 2016). 4.1 Setup Models In the following, we introduce the mod- els in detail. Table 1 provides a summary. The original ELMo consists of a character-CNN as the input layer, a forward and backward LSTM with projection as the sequence encoder, and a sampled softmax as the output layer. Adagrad (Duchi et al., 2011) is used as the optimizer. We conduct experiments using the reimplementation of ELMO in AllenNLP (Gardner et al., 2018) and build the others upon it. The key difference between ELMO-C and ELMO is that ELMO-C produces continuous outputs and we train it with the cosine distance loss. A FastText embedding trained on Common Crawl (Mikolov et al., 2017) (FASTTEXTCC) is used as the output embedding. Based on preliminary experiments, we also make three minor changes: 1) we use FASTTEXTCC as the input layer as it is more efficient than the character-CNN model; 2) we add a layer norm (Ba et al., 2016) after the projection layer of the LSTM to improve the convergence speed; 3) we use Adam with the learning rate schedule from Chen et al. (2018) to help training with a large batch size. Our main goal is to study how different output layers affect the training speed and performance, which cannot be achieved by just comparing ELMO-C and ELMO, due to the aforementioned minor changes to ELMO-C. Therefore, we intro- duce two additional baseline models (ELMO-A and ELMO-Sub), which differ from ELMO-C in a minimal way. Specifically, their sequence en- coders and training recipes are kept the same as ELMO-C. Thus ELMO-C, ELMO-A, and ELMO-Sub can be directly compared. 616 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 ELMOORG BASE FASTTEXTCC ELMO ELMO-A ELMO-Sub ELMO-C Time − − − 14 x 3 5.7 x 4 3.9 x 4 2.5 x 4 Batch − − − 128 256 320 768 Params − − − 499M 196M 92M 76M SNLI 88.7 88.0 87.7 88.5 88.9 87.1 88.8 Coref NA NA 68.90 72.9 72.9 72.4 72.9 SST-5 54.7 51.4 51.30 ± 0.77 52.96 ± 2.26 53.58 ± 0.77 53.02 ± 2.08 53.80 ± 0.73 NER 92.22 90.15 90.97 ± 0.43 92.51 ± 0.28 92.28 ± 0.20 92.17 ± 0.56 92.24 ± 0.10 SRL 84.6 81.4 80.2 83.4 82.7 82.4 82.4 Table 2: Computational efficiency of the main competing models and their performance on five NLP benchmarks. Time is the overall training time in Days x Cards format. Batch is the maximal batch size per card. Params is the number of trainable parameters in millions. Due to the small test sizes for NER and SST-5, we report mean and standard deviation across three runs. Our approach (ELMO-C) exhibits better computational efficiency and shows comparable performance compared with ELMO, ELMO-A, and ELMO-Sub. ELMO-A uses the adaptive softmax as its output layer. We carefully choose the hyper-parameters of the adaptive softmax to obtain an efficient yet strong baseline. It has only half of the parameters of the one reported in Grave et al. (2016) but achieves a perplexity of 35.8, lower than ELMO’s 39.7. ELMO-Sub takes subwords as input and also predicts subwords. Thus, unlike other models, its vocabulary consists of around 30,000 subwords created using BPE (Sennrich et al., 2016). For this reason, a lookup-table-style embedding rather than FASTTEXTCC is used as its input layer and a vanilla softmax is used as its output layer. Its input and output word embedding are tied and trained from scratch. For reference, we also list the results of ELMo and the baseline reported in Peters et al. (2018a) as ELMOORG and BASE. However, these models are evaluated using different configurations. Finally, we also include FASTTEXTCC a (non-contextual) word embedding model, as another baseline. All contextual representation models are trained on the One Billion Word Benchmark for 10 epochs and the experiments are conducted on a workstation with 8 GeForce GTX 1080Ti GPUs, 40 Intel Xeon E5 CPUs, and 128G main memory. Downstream benchmarks We follow Peters et al. (2018a) and use the feature-based approach to evaluate contextual representations on down- stream benchmarks. ELMo was evaluated on six benchmarks and we conduct evaluations on five of them. SQuAD (Rajpurkar et al., 2016) is not available for implementation reasons.5 In the following, we briefly review the benchmarks and task-specific models. For details please refer to Peters et al. (2018a). • SNLI (Bowman et al., 2015): The textual entailment task seeks to determine whether a ‘‘hypothesis’’ can be entailed from a ‘‘premise’’. The task-specific model is ESIM (Chen et al., 2017). • Coref: Coreference resolution is the task of clustering mentions in text that refer to the same underlying entities. The data set is from CoNLL 2012 shared task (Pradhan et al., 2012) and the model is from Lee et al. (2018). Note that we use an improved version of the Coref system (Lee et al., 2017) used in Peters et al. (2018a). • SST-5 (Socher et al., 2013): The task in- volves selecting one of five labels to describe a sentence from a movie review. We use the BCN model from McCann et al. (2017). • NER: The CoNLL 2003 NER task (Sang and De Meulder, 2003) consists of newswire from the Reuters RCV1 corpus tagged with four different entity types. The model is a biLSTM-CRF from Peters et al. (2018a), similar to Collobert et al. (2011). • SRL: Semantic role labeling models the predicate-argument structure of a sentence. It 5The SQuAD experiment in Peters et al. (2018a) was conducted with an implementation in TensorFlow. The experiment setting is not currently available in AllenNLP (https://github.com/allenai/allennlp/ pull/1626), nor can it be easily replicated in PyTorch. 617 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 https://github.com/allenai/allennlp/pull/1626 https://github.com/allenai/allennlp/pull/1626 seeks to answer ‘‘Who did what to whom’’. The model is from He et al. (2017) and the data set is from Pradhan et al. (2013). For SNLI, SST-5, NER, and SRL, we use the same downstream models as in Peters et al. (2018a) re-implemented in AllenNLP.6 For Coref, Peters et al. (2018a) uses the model from Lee et al. (2017) and we use an improved model (Lee et al., 2018) from the same authors. For all the tasks, the hyper-parameters are tuned to maximize the performance for the original ELMo and all models are tested under the same configurations. 4.2 Main Results We report the main results in Table 2. Our ap- proach (ELMO-C) enjoys a substantial compu- tational advantage while maintaining competitive or even superior performance, compared to ELMO, ELMO-A, and ELMO-Sub. Model efficiency For model efficiency, the statistics of ELMO is reported by the original authors and they use three GTX 1080 Tis. We train ELMO-A, ELMO-Sub, and ELMO-C using four GTX 1080 Tis. Roughly speaking, compared with ELMO, ELMO-C is 4.2x faster and 6x more memory-efficient. To give a clear view of the speedup the CONT layer brings, we compare ELMO-C with ELMO-A. ELMO-A differs from ELMO-C only in the output layer. Still, ELMO- C has a 2.28x speed advantage and is 3x more memory-efficient. Compared with ELMO-Sub, our approach holds a 1.56x speed advantage and is 2x more memory-efficient. The results here only show the overall efficiency of our approach under the setting of ELMo and a detailed analysis of the efficiency is desirable, which we provide in Section 5.2. Performance on downstream tasks ELMO-C works especially well on semantic-centric tasks, such as SNLI, Coref, and SST-5. However, for tasks that required a certain level of syntactic information, such as NER and SRL (He et al., 2018), ELMO-C suffers from slight performance degradation compared with ELMO, but it remains competitive with ELMO-A and ELMO-Sub. We suspect that the performance degradation is related to the pre-trained embedding and conduct further analysis in Section 5.1. 6For SRL, the score reported by AllenNLP is lower than the score reported by CoNLL official script. In addition, we notice that the performance of ELMO-Sub is unstable. It shows satisfying per- formance on SST-5, NER, and SRL. However, it lags behind on Coref and even fails to outper- form FASTTEXTcc on SNLI. ELMO-Sub provides subword-level contextual representations, which we suspect could be the cause of unstable perfor- mance. Specifically, to get the representation for a word in evaluation on word-level tasks, we follow Devlin et al. (2019) to use the representation of its first subword. This could be sub-optimal if precise word-level representation is desired (e.g., the suf- fix of a word is an important feature). These results are consistent with the observation in Kitaev and Klein (2018). They find that special design has to be made to apply BERT to constituency parsing because of the subword segmentation. However, we notice that the scope of our experiment is lim- ited. It is likely that when ELMO-Sub is scaled up or used with the fine-tuning method, the afore- mentioned issue is alleviated—we leave that to future work. 5 Analysis We conduct analysis regarding the effect of the pre-trained word embedding on the performance of the contextual encoder. We also investigate the contributions of different factors to the overall training time and study the speedup of our ap- proach under various conditions. 5.1 Effect of the Pre-trained Embedding We show the effect of the pre-trained embedding by introducing several variants of ELMO-C (sum- marized in Table 1) and list their performance in Table 3. Quality of the pre-trained embedding We aim to understand how the quality of the pre- trained output word embedding W affects the performance of the contextual encoder. To study this, we train a FastText word embedding on the One Billion Word Benchmark, a much smaller corpus than Common Crawl. We then train an ELMO-C variant, ELMO-CONEB , by using this em- bedding in the input and output layers. Com- paring it to ELMO-C, ELMO-CONEB holds up surprisingly well and it is competitive on SNLI, Coref, and SST-5 while being inferior on NER and SRL. This motivates us to take a step further and use a completely random output word embedding. 618 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Task ELMO-C ELMO-CONEB ELMO-CRND ELMO-CCNN ELMO-CCNN-CC ELMO-CCC-CNN SNLI 88.8 88.4 88.4 88.2 88.0 88.4 Coref 72.9 73.0 72.4 72.9 72.8 72.6 SST-5 53.80 ± 0.73 52.70 ± 0.90 53.01 ± 1.67 53.38 ± 0.68 54.33 ± 1.26 54.16 ± 0.96 NER 92.24 ± 0.10 92.03 ± 0.47 91.99 ± 0.35 92.24 ± 0.36 92.04 ± 0.33 91.93 ± 0.53 SRL 82.4 82.2 82.9 82.8 83.4 83.3 Table 3: Performance of ablation models on five NLP benchmarks. ELMO-C is included for reference. We replace the output embedding of ELMO-C with a random embedding matrix, of which each element is randomly drawn from a standard normal distribution. We denote this model as ELMO-CRND. We find that this model performs well (Table 3), with only a mild performance drop compared to ELMO-C. The performance of ELMO-CRND shows the robustness of the proposed approach and demonstrates that the deep LSTM is expressive enough to fit a complex output space. However, we find that the pre-trained input word embedding is still indispensable because using a randomly initialized input embedding would lead to brittle performance (e.g., 85.8 on SNLI). Pre-trained CNN layer as word embedding In Section 4, we observed that models using Fast- Text embedding (ELMO-C and ELMO-A) as input performed worse than ELMo on SRL, a task relying heavily on syntactic information. We suspect that the FastText embedding is weaker on capturing syntactic information than the character-CNN trained in ELMo (Peters et al., 2018b). To verify this, we train ELMO-C using the trained CNN layer from ELMo as the input layer (ELMO-CCNN-CC) or the output embedding (ELMO-CCC-CNN). We observe that the two models exhibit notably better performance on SRL (see Table 3). We also consider a ELMO-CCNN model, which uses the CNN layer as both the input and output embedding. On SRL, ELMO-CCNN per- forms favorably compared to ELMO-C but slightly worse than ELMO-CCNN-CC or ELMO-CCC-CNN. We suspect that this is because ELMO-CCNN-CC and ELMO-CCC-CNN benefit from different kinds of embeddings in the input layer and the output layer. 5.2 Computational Efficiency Next, we study the computational efficiency of the continuous output layer against several baselines from two aspects. First, in Section 3.1, we dis- cussed three factors governing the overall training time of the model: 1) arithmetic complexity, 2) GPU memory consumption, and 3) communica- tion cost. We aim to study how each factor affects the overall training time of each model. Second, in the above experiments, we focus on ELMo with the LSTM as the sequence encoder. We wonder whether the continuous output layer can deliver attractive speedup for sequence encoders of different types and sizes. We investigate the continuous output layer (CONT) and three common baseline output layers: 1) the subword-level language model (SUBWORD), 2) the adaptive softmax layer (ADAPTIVE), and 3) the sampled softmax layer (SAMPLED). Addition- ally, we include a variant of the sampled softmax denoted as FIXED where the output word embed- ding is initialized by the FastText embedding and fixed during the training. This output layer is similar to a special case of CONT with a ranking loss, where the model encourages its output to be close to the target word embedding but far from a negative sample. In total, we study five different output layers. For several output layers, the trade-off between computational efficiency and model performance is controlled by their hyper-parameters. We choose hyper-parameters close to those reported in the literature to strike a balance between speed and performance. 5.2.1 Speedup Breakdown We pair the five different output layers with the same input layer (fixed word embedding) and sequence encoder (ELMo’s LSTM with projec- tion). We then test the training speed of these models under three scenarios, which are designed to reflect the individual effect of the arithmetic complexity, the GPU memory consumption, and the communication cost: • S1 (small batch): We use one GPU card and set the batch size to be 1. The asynchronous execution feature of the GPU is disabled. The time needed to finish one batch is reported. 619 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Vocab Params Batch S1 (small batch) S2 (large batch) S3 (multiple GPUs) CONT ∞ 76M 640 0.47s 115.28s 34.58s FIXED ∞ 76M 512 1.17x 1.24x 1.24x SUBWORD ∞ 92M 320 1.09x 1.53x 1.55x ADAPTIVE 40K 97M 384 1.08x 1.30x 1.34x 800K 196M 256 1.16x 1.47x 1.89x 2000K 213M 192 1.25x 1.82x 2.49x SAMPLED 40K 96M 512 1.07x 1.18x 1.30x 800K 483M 256 1.15x 1.35x 1.91x 2000K 1102M 64 1.16x 2.35x 16.09x Table 4: Statistics on the computation efficiency of different models. For CONT, we report the actual training time in seconds. For other models, we report the relative training time compared to CONT. Params: Number of trainable parameters of the whole model in millions. Batch: Maximal batch size per card. • S2 (large batch): We use one GPU card and the maximal batch size. The time needed to finish training on one million words for each model is reported. • S3 (multiple GPUs): We use 4 GPU cards and the maximal batch size. The time needed to finish training on one million words for each model is reported. In Table 4, we report the training speed of the models under each scenario.7 In addition, we report the parameter size and the maximal batch size on one GPU card. For ADAPTIVE and SAMPLED, the vocabulary size also affects the training speed so we test them under three different vocabulary sizes:8 40K, 800K, and 2,000K. Arithmetic complexity The arithmetic com- plexity of the models is reflected by the speed under S1, where the GPU memory is always abundant and the arithmetic complexity is the dominating factor. CONT holds a mild advan- tage (1.07x-1.25x) over baseline models, which is expected because the LSTM layers in ELMo 7CONT under S3 is slightly slower than the ELMO-C model reported in Section 4.2. This is because when training the ELMO-C model reported in 4.2, we actually train a forward ELMO-C on two cards and train a backward ELMO-C on two other cards, which reduces the communication cost by half. This optimization is only applicable to our approach in the setting of ELMo and does not work for other baseline methods. In this experiment, we disable this optimization for generosity. 8The 2,000K vocabulary is created on the tokenized 250- billion-word Common Crawl corpus (Panchenko et al., 2017), which covers words that appear more than 397 times. are quite slow and that undermines the advantage of the continuous output layer. For ELMO-Sub, the small yet non-negligible softmax layer adds overhead to the arithmetic complexity. FIXED, ADAPTIVE, and SAMPLED have similar arithmetic complexity but ADAPTIVE has the highest com- plexity when the vocabulary size is large (e.g., 2,000K). GPU memory consumption The effect of GPU memory consumption can be observed by comparing the statistics under S1 and S2. The difference between S2 and S1 is that the parallel computing of the GPU is fully utilized. For CONT, its great GPU memory efficiency helps it gain larger speedup under S2, especially against common baselines such as SUBWORD, ADAPTIVE, and SAMPLED. For ELMO-Sub, in addition to the overhead from the softmax layer, breaking words into subwords leads to longer sequences, which increases the training time by 1.1x. Thus it is 1.53x slower than CONT under S2. SAMPLED suffers from its huge parameter size and exhibits poor scalability with respect to the vocabulary size (2.35x slower when the vocabulary size reaches 2,000K). Communication cost The effect of the com- munication cost across GPUs can be observed by comparing the statistics under S2 and S3. As the communication cost and GPU memory consumption both are highly dependent on the parameter size, the observations are similar. 620 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 LSTM LSTMX2 TRANS BASE ELMO TRANS LARGE GPT CONT 3.97s 10.42s 15.87s 34.58s 48.55s 43.53s FIXED 1.93x 1.32x 1.52x 1.24x 1.37x 1.14x SUBWORD 2.32x 1.49x 1.78x 1.55x 1.72x 1.44x ADAPTIVE 4.58x 2.20x 2.62x 1.89x 3.28x 2.33x SAMPLED 2.50x 1.60x 2.91x 1.91x OOM 8.31x Table 5: Time needed to finish training on one million words for each model using 4 GPU cards and the maximal batch size. For CONT, we report the actual training time in seconds. For other models, we report the relative training time compared to CONT. OOM means that the GPU memory is not sufficient. CONT shows substantial speedup over common baselines under all scenarios. 5.2.2 The Continuous Output Layer with Different Sequence Encoders For this experiment, we pair the output layers with different sequence encoders and investigate their training time. We start from a single-layer LSTM with a hidden size of 2048 (LSTM) and a two-layer version (LSTMX2), both reported in Grave et al. (2016). They are all smaller than the sequence encoder used in ELMO. We then scale up to the forward and backward Transformer reported in Peters et al. (2018b) (TRANS BASE) and the multi-layer LSTM with projection in ELMO(ELMO). Finally, we test two larger Trans- former, TRANS LARGE, a scaled-up version of TRANS BASE, and a uni-directional Transformer (denoted as GPT) with the same size as BERTBASE (Devlin et al., 2019) and GPT (Radford et al., 2018), respectively. For all models but GPT, the lengths of the input sequences are fixed at 20. For GPT, we use input sequences of length 512, following its original setting. For ADAPTIVE and SAMPLED, we fix the vocabulary size at 800K. We report the training time of each model using four GPU cards and the maximal batch size (S3) in Table 5. We find that the continuous output layer remains attractive, even when the sequence encoder is as large as GPT. In that case, the speedup of CONT over SUBWORD, ADAPTIVE, and SAMPLED is still substantial (1.44x - 8.31x). In addition, we observe that for sequence encoders of the same type, more complex they get, less speedup CONT enjoys, which is expected. For instance, from LSTM to LSTMX2, the speedup of CONT decreases noticeably. However, the speedup the continuous output brings also depends on the architecture of the sequence encoder. For instance, though TRANS BASE and TRANS LARGE are more complex than LSTMX2, CONT enjoys larger speedup with those transformers. Profiling the training process of sequence decoders such as LSTM and the Transformer on GPU devices is an interesting research topic but out of the scope of this study. 6 Conclusion We introduced an efficient framework to learn contextual representation without the softmax layer. The experiments with ELMo showed that we significantly accelerate the training of the current models while maintaining competitive performance on various downstream tasks. Acknowledgments We wish to thank the anonymous reviewers, the editor, Mark Yatskar, Muhao Chen, Xianda Zhou, and members at UCLANLP lab for helpful comments. We also thank Yulia Tsvetkov and Sachin Kumar for help with implementing the continuous output layer as well as Jieyu Zhao, Kenton Lee, and Nelson Liu for providing re- producible source code for experiments. This work was supported by National Science Foundation grant IIS-1760523 and IIS-1901527. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Yoshua Bengio and Jean-Sébastien Senécal. 2003. Quick training of probabilistic neural nets by importance sampling. In AISTATS. 621 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016. Quasi-recurrent neu- ral networks. arXiv preprint arXiv:1611.01576. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word bench- mark for measuring progress in statistical lan- guage modeling. arXiv preprint arXiv:1312.3005. Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In ACL. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In ACL. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learn- ing Research, 12:2493–2537. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In ICML. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-NLT . John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: training Imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. 2016. Effi- cient softmax approximation for GPUs. arXiv preprint arXiv:1609.04309. Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep semantic role label- ing: What works and what’s next. In ACL. Shexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai. 2018. Syntax for semantic role labeling, to be, or not to be. In ACL. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9:1735–1780. Yacine Jernite, Samuel R Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learn- ing. arXiv preprint arXiv:1705.00557. Ian Jolliffe. 2011. Principal component analysis. In International Encyclopedia of Statistical Science, Springer Berlin Heidelberg. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Explor- ing the limits of language modeling. arXiv pre- print arXiv:1602.02410. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In AAAI. Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In NIPS. Nikita Kitaev and Dan Klein. 2018. Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760. Sachin Kumar and Yulia Tsvetkov. 2019. Von Mises-Fisher loss for training sequence to sequence models with continuous outputs. In ICLR. 622 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coref- erence resolution. In EMNLP. Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-order coreference resolution with coarse-to-fine inference. In NAACL-HLT . Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In EMNLP. Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. ICLR. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2017. Advances in pre-training distributed word rep- resentations. arXiv preprint arXiv:1712.09405. A Mnih and YW Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. In ICML. Frederic Morin and Yoshua Bengio. 2005. Hier- archical probabilistic neural network language model. In AISTATS. Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann. 2017. Building a web-scale dependency- parsed corpus from CommonCrawl. arXiv preprint arXiv:1710.01779. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contex- tualized word representations. In NAACL-HLT . Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissect- ing contextual word embeddings: Architecture and representation. In EMNLP. Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using sub- word RNNs. In EMNLP. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using OntoNotes. In CoNLL. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 shared task: Modeling multi- lingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL- Shared Task. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI Blog. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP. Erik F Sang and Fien De Meulder. 2003. Introduc- tion to the CoNLL-2003 shared task: Language- independent named entity recognition. arXiv preprint cs/0306050. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP. 623 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy consid- erations for deep learning in NLP. arXiv preprint arXiv:1906.02243. Shuai Tang, Hailin Jin, Chen Fang, Zhaowen Wang, and Virginia de Sa. 2018. Speeding up context- based sentence representation learning with non-autoregressive convolutional decoding. In Workshop on Representation Learning for NLP. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Gregory S. Corrado, Macduff Hughes, and James A. Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2017. Breaking the soft- max bottleneck: A high-rank RNN language model. arXiv preprint arXiv:1711.03953. Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet training in minutes. In Proceedings of the 47th International Conference on Parallel Process- ing, ICPP 2018. 624 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Introduction Background and Related Work Approach Computational Efficiency Learning with Continue Outputs Overall Training Time Open-Vocabulary Training Interpretation of Learning Contextual Encoders with Continuous Outputs Experiment Setup Main Results Analysis Effect of the Pre-trained Embedding Computational Efficiency Speedup Breakdown The Continuous Output Layer with Different Sequence Encoders Conclusion