key: cord-0584088-ihyi3mnj authors: Livska, Adam; Kovcisk'y, Tom'avs; Gribovskaya, Elena; Terzi, Tayfun; Sezener, Eren; Agrawal, Devang; d'Autume, Cyprien de Masson; Scholtes, Tim; Zaheer, Manzil; Young, Susannah; Gilsenan-McMahon, Ellen; Austin, Sophia; Blunsom, Phil; Lazaridou, Angeliki title: StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models date: 2022-05-23 journal: nan DOI: nan sha: 750448e5d852ec0a2e4f7f809f16a1470b2b479b doc_id: 584088 cord_uid: ihyi3mnj Knowledge and language understanding of models evaluated through question answering (QA) has been usually studied on static snapshots of knowledge, like Wikipedia. However, our world is dynamic, evolves over time, and our models' knowledge becomes outdated. To study how semi-parametric QA models and their underlying parametric language models (LMs) adapt to evolving knowledge, we construct a new large-scale dataset, StreamingQA, with human written and generated questions asked on a given date, to be answered from 14 years of time-stamped news articles. We evaluate our models quarterly as they read new articles not seen in pre-training. We show that parametric models can be updated without full retraining, while avoiding catastrophic forgetting. For semi-parametric models, adding new articles into the search space allows for rapid adaptation, however, models with an outdated underlying LM under-perform those with a retrained LM. For questions about higher-frequency named entities, parametric updates are particularly beneficial. In our dynamic world, the StreamingQA dataset enables a more realistic evaluation of QA models, and our experiments highlight several promising directions for future research. Question answering (QA) allows us to interrogate models for their language understanding, knowledge, and reasoning abilities, while also being useful in various knowledge-oriented applications such as personal assistants or web search. The questions that people ask span all our knowledge and can be about any point in the history, although often they are about the most recent events that happened in the last few weeks or days. Consider examples in Table 1 that ask about events as distant as 4 years before or as recent as the day when the question was asked. As the world and knowledge evolve, we need our QA models to adapt to new information, to not forget the past, and to maintain an up-to-date world model to make our interaction with such systems more meaningful. To evaluate and improve models' ability to adapt, we need a dataset with temporal grounding of both questions and knowledge-dates when questions were asked and publication dates of documents. As the currently available QA datasets are not suitable for this, we propose a novel dataset and subsequently perform a systematic study of adaptation in the state-of-the-art QA models. Previous research has often focused on answering questions about individual passages or books (Rajpurkar et al., 2016; Kočiský et al., 2018) , answering about static structured (Jia et al., 2021) , or unstructured knowledge corpora such as Wikipedia (Lee et al., 2019a) . More recent work has considered answering questions about knowledge with temporal grounding of facts in a knowledge graph (Saxena et al., 2021; Jia et al., 2021) or news articles (Wang et al., 2021; Dhingra et al., 2021) We present a new dataset, StreamingQA 1 , that provides temporal context of both, the questions and the knowledge required to answer them. The dataset contains questions written by annotators or generated with a large-scale LM. The questions are answerable from a streaming knowledge corpus of time-stamped English WMT news articles published between 2007 and 2020 (see Figure 1 ). Having temporal metadata for questions and articles enables us to ingest new knowledge periodically, in a streaming setup, and evaluate on questions asked during that period. We consider questions about recent and past knowledge separately to measure adaptation and forgetting. Moreover, question dates allow to ask questions with relative time specifications (e.g., "3 months ago"), which are under-represented in the existing QA datasets. Lastly, news domain, compared to often used Wikipedia, provides more realistic challenges for open-book retrieval with redundant, noisy, and sometimes conflicting information. Previous work demonstrated that large LMs struggle with temporal generalization, a type of domain shift that occurs when a model at test time needs to understand new knowledge, named entities, and topics (Lazaridou et al., 2021; Röttger & Pierrehumbert, 2021) . In this work, we leverage StreamingQA to quantify similar adaptation in existing parametric (closed-book) and semi-parametric (open-book) QA models that today are frequently based on such LMs. Our findings suggest that parametric adaptation improves QA performance for open-book approaches like RAG (Lewis et al., 2020a) and FiD (Izacard & Grave, 2020) (Section 4.2). Moreover, a more granular, frequency-based analysis (Section 4.3) suggests that parametric and semiparametric approaches to adaptation are complementary: parametric adaptation improves accuracy of questions about frequent knowledge/named entities, where semi-parametric, dense retrieval-based methods can under-perform . In contrast, semi-parametric adaptation helps with less frequent knowledge where retrieval is less confused by redundancy and ambiguity of information associated with more frequent names, where the parametric LMs struggle. In the closed-book setup, we find that incremental fine-tuning works reasonably well without causing catastrophic forgetting (Section 4.1), and it also results in substantially lower computational costs than full model retraining 2 . Lastly, we also establish benchmarks for less computationally intensive QA tasks (Section 4.5): one-step adaptation In this section, we introduce a new QA dataset and a task to evaluate models' adaptation and forgetting of knowledge. We require, in addition to questions, temporal metadata: a question date (when the question could have been asked) and a knowledge date (when an article that answers this question was published). Using this metadata enables us to evaluate how well the model understands new knowledge that becomes available incrementally at evaluation time (see Figure 1 ). We also use the timestamps to split the training and evaluation sets into non-overlapping historical periods. See Table 2 for examples of questions. To construct the StreamingQA dataset, we consider 14 years (2007) (2008) (2009) (2010) (2011) (2012) (2013) (2014) (2015) (2016) (2017) (2018) (2019) (2020) of English WMT news articles 3 (Akhbardeh et al., 2021) , together with their publication dates, as our knowledge corpus (approx. 11M articles). Specifically, given an article, we first generate a question date, that is, the date when we want the question to be asked (Section 2.2), and subsequently, either (i) automatically generate questions (Section 2.3), or (ii) ask human annotators to write questions (Section 2.4). Lastly, to reduce noise, we apply automatic and human filtering to the questions, and collect additional reference answers (Section 2.5). We present additional statistics in Section 2.6. We consider a streaming task (Section 2.1): we split questions into four quarterly sets over 2020 based on their question dates, where questions in each quarter are to be answered from articles published up to and including that quarter. For the adaptation and forgetting analysis, we further split the quarterly evaluation sets into the recent subset and the past subset. The dataset is constructed to have an approximately equal number of each. As recent questions cover 2020 and past questions cover all history uniformly, the overall distribution is biased towards the present, as we would expect it in an actual QA system. The recent subset , and a knowledge corpus of publication dates and documents, . For a given time period t = [t s , t e ] (e.g., January to March 2020), we consider questions asked during that period, about the corresponding subset of the knowledge corpus published until then, C ≤t = {(d c,j , c j ) ∈ C : d c,j ≤ t e }. To answer a question in Q =t , we generate an answer using the corresponding knowledge corpus, p(a i |q i , d q,i , C ≤t ). We need to generate question dates that are plausible with respect to article dates and make sure that the dates and events are consistent. For evaluation sets, to create recent subset questions we sample a document with a publication date d c ∼ U[Dec2019, Dec2020] and a question date d q ∼ U[d c , d c + 30days]. To create past subset questions, we sample a document with a publication date d c ∼ U[2007, Dec2020] and a question date d q ∼ U[max(d c , Jan2020), max(d c , Jan2020) + 365days]. The articles are distributed uniformly within a month, or across all available history prior to the question date 4 , for the two subsets respectively. For training and validation sets, we consider question dates in [2007 and aim for an article distribution given a question date similar to above. We use automatic question generation as a scalable way to obtain questions grounded at different points in time. Questions are generated through few-shot prompting of a 4 We filter out samples with question date not in 2020. large LM (Rae et al., 2021) , given an evidence document and a target answer drawn from named entities, dates, and prepositional phrases contained as spans in the document. A challenge with creating questions for open-domain QA is that they need to be specific enough when considered in the context of all articles in the knowledge corpus. We first over-generate questions-answer pairs and then apply heuristic filters to eliminate trivial and/or low quality candidates (Appendix A.4). For questions included in the past subset, we append absolute or relative time specifications to question text, e.g., "3 months ago" or "in May 2017" (Appendix A.5), unless the text already contains such a specification or the answer is a date. Human annotators were asked to write questions about a news article provided together with its publication date and desired question date. We chose annotators whose first language is English, who are in the US or the UK, and have a university education. Each annotator had to write up to five questions and answers about the details of the events described in the article, and framed these questions as if they were asking another person. Each participant created about 15 questions on average. We explicitly asked the annotators to include enough context in the questions to make the questions as unambiguous as possible for the open-book setup. The full details of our study design, including compensation rates, were reviewed by DeepMind's independent ethical review committee. All participants provided informed consent prior to completing tasks and were reimbursed for their time. It is our policy that researchers must pay workers/participants at least the living wage for their location. We filtered both generated and human written questions for quality in two stages. First, we asked annotators to filter for good/bad question, similar to Kwiatkowski et al. (2019) , filtering for factual, unambiguous, grammatical questions. To judge ambiguity, we have additionally provided the question date and asked not to assume a particular location (e.g., US). To include a question, we need 3 annotators to agree. Secondly, we asked annotators to answer each question given the original passage, its publication date, and the question date. Annotators first selected parts of the passage that supported the answer 5 , and then wrote a short answer in their own words. We did not require the answers to be sub-strings of the passage. We only kept questions where annotators could provide answers, and obtained additional references in the same way. 6 The StreamingQA dataset contains about 28k generated questions and about 8.8k human written questions for evaluation, and 100k and 10k questions for training and validation, respectively. See Table 3 for details. In Figure 2 , we see that human written questions and answers tend to be somewhat longer. Based on the first question word distribution, we have a diverse set of both human written and generated questions, with the latter slightly biased to "Which"/"Where"/"When", likely due to prompting of the large LM. For written questions, many start with "In", which is often annotators providing temporal context so that questions stand without the article in the open-book setting. Examining the answer types on the written evaluation sets by automatic labeling, we have about 46.9%, 40.6%, and 12.4% of named entity, phrase, and date answers, respectively (see Appendix A.2). About 6% of evaluation reference answers are seen among answers to questions in the training set, and 23% of training questions have an answer contained in evaluation reference answers. To evaluate how well parametric and semi-parametric approaches to QA ingest and understand unseen information, we consider an auto-regressive, left-to-right Transformer-XL (TXL) language model as our parametric, closed-book QA model (CB), and use it as the underlying LM for our RAG-style (Lewis et al., 2020b) semi-parametric, open-book QA model (OB). We also consider a more recent open-book model based on Fusion-in-Decoder (FID) that uses a T5 (Raffel et al., 2020) sequenceto-sequence model 7 . We use the standard metrics for QA: the F1 and the exact match (EM), after normalizing the answers, in the same way as Rajpurkar et al. (2016) . These are suitable for the on average short answers in our dataset. To include the temporal metadata, we prefix publication dates to articles ("Thursday, February 7, 2019. [article text]"), and for questions we add question dates ("Today is Wednesday, May 6, 2020. [question text]"). We consider three setups of TXL training: TXL STALE model is trained on WMT articles until the end of December 2019 (approx. 10.1M articles), and is missing knowledge required for answering questions in the recent subset. TXL RETR. model is re-trained from scratch on all WMT articles until the end of December 2020, i.e., including the evaluation period (approx. 11.4M articles). Lastly, TXL FT model is TXL STALE that we additionally iteratively fine-tune 8 on the 2020 monthly article sets (each approx. 100k articles). TXL STALE and TXL RETR. are trained for 200k steps on 32 TPUv3, whereas fine-tuning is performed for 10k steps. Our TXL has 18 layers and 1,280 hidden units, resulting in 448M parameters, roughly 30% larger than GPT2-Medium S ta le 2 0 2 0 -0 3 2 0 2 0 -0 6 2 0 2 0 -0 9 2 0 2 0 -1 2 R e tr a in e d Figure 3 . Left: F1 score on the whole evaluation dataset of CB+STALE, CB+RETR., and CB+FT fine-tuned on articles published until the specified cut-off dates. Right: The effect of a temporal lag between the final training month of CB+FT and question dates for generated questions, relative to CB+RETR.. and BERT-large. We set the TXL sequence length to 1,024, and the memory cache length to 384 during model pretraining, and use a SentencePiece vocabulary of 50,259 tokens (Kudo & Richardson, 2018) . Analogously, for T5 base, we fine-tune the pre-trained vanilla T5 model (Raffel et al., 2020) for 300k steps on WMT articles until the end of 2019, and subsequently iteratively fine-tune for 4k steps on the 2020 monthly splits. The retrained version is fine-tuned for 300k steps on articles until the end of 2020 starting from the vanilla T5 checkpoint. We use the closed-book QA task to examine a language model's knowledge. The task is to answer questions without any additional context provided, p(a i |q i ). We fine-tune each of the pre-trained TXL LMs (stale, fine-tuned, and fully retrained) for question answering on the StreamingQA training set, using 4 TPUv3, and select the best checkpoint, as measured by F1, on the validation set; both sets contain knowledge and questions asked in or before 2019. The QA models, CB +STALE , CB +FT , and CB +RETR. , are then subsequently evaluated on the evaluation sets from 2020. Answers are sampled using greedy decoding. In this task, we answer questions given a knowledge corpus of articles, p(a i |d q,i , q i , C ≤t ). We use WMT news articles, sliced into 6-sentence chunks as our knowledge corpus, resulting in 42.1M (up to 2019) and 47.6M (up to 2020) passages. The OB model is a variation of the Retrieval Augmented Generation model (RAG-sequence; Lewis et al. (2020b) ) with a TXL-based generator, the same LMs as for our closed-book experiments. As a retriever we use Dense Passage Retrieval (DPR; Karpukhin et al. (2020) ), trained on question/passage pairs from our training set (including the question and publication dates), with embedding size of 768. We retrieve 20 passages. We also consider the Fusionin-Decoder model (FID; Izacard & Grave (2020)), which was shown to outperform RAG on a number of QA tasks, with the same pre-trained DPR retriever as OB but with 64 retrieved passages. In contrast to RAG, the FID's generator attends to all retrieved passages at the same time instead of individually. We train both models with a restricted search space that contains gold evidence of all training and validation questions; this helps to reduce computation time (we did not see a material performance degradation due to this). In this section we analyse the performance of the closedbook and open-book models on StreamingQA. We have three closed-book models: CB +STALE , the iteratively finetuned CB +FT , and the retrained CB +RETR. . In order to adapt open-book models to new information from 2020, we always include new articles in the search index and then either keep the stale generator (OB +IU , FID +IU ), fine-tune the generator on new articles (OB +IU+FT , FID +IU+FT ), or use a retrained generator (OB +IU+RETR. , FID +IU+RETR. ). Iterative LM fine-tuning improves performance on StreamingQA but lags retraining. We consider all questions in our evaluation sets (recent+past) and evaluate CB models, in Figure 3 (left). First, we observe that the CB +STALE is outperformed by each of the CB +FT modelsthe fine-tuning is able to incorporate new information for the half of the questions that are only answerable from 2020 documents. With each additional month of documents, the CB +FT models perform better for all answer types (named entities, phrases, dates; Appendix B.1). The improved performance is not simply due to more data, but is driven by better accuracy on the recent subset, while the performance on the past subset remains mostly unchanged (Appendix B.1). Secondly, we observe that CB +RETR. outperforms or is on par with all other models, and so vanilla adaptation that we consider for CB +FT should be improved to bridge the gap from fine-tuning to retraining. Adaptation and forgetting We use question dates to split Eval-Generated and Eval-Written into quarterly sets. To understand how adaptation to new information is offset by forgetting of past information, we investigate the effect of a temporal lag between the question date and the end date of knowledge in the underlying LM. Note that the question date and the knowledge date is on average much closer for the recent subset (a few weeks) compared to the past subset (years), and so adaptation to new articles is more crucial for the recent subset performance. sary information 9 . When the lag is positive, the questions are in the past with respect to the most recent information in the model, and in these settings, some previous information needed to answer these questions might have been overwritten-forgetting may occur. We aggregate the model answers for each lag, and plot the corresponding F1 relative to that of CB +RETR. in Figure 3 (right) . As we fine-tune and the lag between the model knowledge and question month increases, the performance on the past subset slightly deteriorates until we under-perform CB +RETR. by about 5%. On the recent subset, the performance first improves significantly, and then as we 9 For example, an answer from a model trained until March 2020 for the question "What does Donald Trump, US president, call his 2020 plan to expedite the development of a COVID-19 vaccine?" asked on May 30, 2020 will be bucketed into -1Q. pass the question quarter and continue fine-tuning on further data, we start seeing minor forgetting. Similar conclusions hold for the written questions (Appendix B.1). Open-Book QA Adaptation and forgetting Similarly to the closed-book experiment, we examine adaptation and forgetting by aggregating model answers by a temporal lag between the evaluation set and the end of the model knowledge. We consistently observe that on the recent subset of both generated and written questions (Figure 4) , the open-book models (OB, FID) have a steep adaptation rate (from -1Q to 0Q) for all model variants, including just adding new articles into the search index without LM fine-tuning (OB +IU , FID +IU ). For all of the models, we see almost no forgetting on the recent subsets. In Figure 5 , for generated past questions, there is no forgetting (see Appendix B.2 for written questions). Note that a small fraction of questions from the past subset reference articles from 2020, so seeing 2020 knowledge slightly helps compared to a lagging model. Figure 6 . QA performance given question (left) or answer (right) named entity frequency quartiles. FID +IU+FT , OB +IU+RETR. , FID +IU+RETR. , where both the index and the generator are updated. We observe this in Figure 4 , for lags of 0Q, 1Q, 2Q, and 3Q, where the models have the required knowledge for answering. We see improvements for FID +IU+FT (vs FID +IU ) on the recent subset of generated and written questions, and on the past subset ( Figure 5 ) at 0Q followed by minor forgetting. Moreover, retraining the generator improves performance for all models and all subsets. For the FID/T5 models, we see FID +IU+FT performing somewhat better than FID +IU+RETR. on the recent generated questions and performing worse on the past subset, suggesting fine-tuning on the recent data improved performance on the corresponding knowledge. Section 4.3 explores why fine-tuning the generator is helping. Seeing above that fine-tuning or retraining the generator helps, we want to understand for which questions the updated generator is particularly important compared to a stale generator. Lazaridou et al. (2021) previously demonstrated that one driving factor behind deteriorating temporal LM performance are changing frequencies of words, particularly named entities. We analyze QA performance by frequency of named entities appearing in (a) questions, and (b) answers, computed over the knowledge corpus up to 2019, and in 2020 only, respectively. First, close-book performance is substantially better for questions that contain frequent named entities (see Appendix A.3 for examples): F1 is higher by absolute 10% (Figure 6 , left), likely due to higher-frequency named entities in the knowledge corpus providing a stronger learning signal for the parametric model. Second, open-book performance does not show strong dependency on the frequency in the question, but this is due to two offsetting factors: DPR retrieval recall becomes worse with increasing frequency Figure 7 . Temporal accuracy, measured in difference of days between gold passage timestamp (ts) minus retrieved passage ts, of the DPR trained on news for generated questions (recent vs past) with and without temporal information. (recall@1 decreases by absolute 5%), while the generator performance improves. Therefore, at lower frequencies better performance is driven by non-parametric adaptation through the updated search space, and at higher frequencies by parametric adaptation. Figure 6 (right) shows performance as a function of answer named entity frequencies in 2020: CB +RETR. outperforms CB +STALE by 10-15% for more frequent answers, suggesting that one reason for better performance of updated generators is more accurate modeling of word frequencies in 2020. Temporal retrieval Using dates improves DPR performance for recent questions. In Figure 7 , median temporal difference between retrieved and gold articles is 41 and 1101 days for DPR with and without dates, respectively. Improved temporal accuracy translates into better recall overall, recall@20 for the recent generated questions is 57% and 43% for DPR with and without dates, respectively. For past questions we do not see improvements, which suggests that the model may incorrectly interpret the two time specifications, i.e., the prepended question date and absolute or relative time specification in the question text. See Appendix B.3 for more. Time specification in questions Generated questions in the past subset may contain an absolute or relative time specification in the question text, and we generally find that the open-book models perform best on questions without time specification 10 , followed by absolute, and relative. For example, for FID +IU+FT , F1 is 0.711, 0.469, 0.359, respectively, and 0.441 overall. Static and updated questions For a preliminary analysis of "static" (knowledge that likely will not change) and "updated" (that might change) questions, we observed that the open-book models generally performed worse on "updated" questions (e.g., 0.435 vs 0.494 F1 for FID +IU+FT on past, generated). The evaluation sets have 6.7%-11.2% of likely static questions based on majority agreement. The recall of retrieved documents is better for the static questions. StreamingQA dataset allows us to consider further two tasks, and we provide benchmarks to encourage research on these directions: one-step streaming setting 11 in Figure 8 (solid bars), and the usual static open-book QA setup (diagonalline bars), evaluated on all 2020 questions. There is still a large gap to human performance; moreover the dataset creates challenges to retrieval and news articles reading comprehension, see models with gold evidence versus retrieved (cross-pattern bars). For the human benchmark we collected a fourth annotator answer. See Appendix B.4 for EM and a table with all metrics. QA datasets We summarize previous QA work on understanding of knowledge with temporal context in Section 1 and provide a dataset comparison table in Appendix A.1. Question generation for QA Automatic question generation trained with supervision has been explored in QA for data augmentation Dong et al., 2019; Sultan et al., 2020) , or as a way to enrich knowledge bases for QA-pair retriever models . Here we instead leverage few-shot generation capabilities of large LMs (Rae et al., 2021; Brown et al., 2020) to generate questions and use them for both training and evaluation. Open-domain QA Progress in neural information retrieval (Karpukhin et al., 2020; Lee et al., 2019b) enables open- 11 We report using the model fine-tuned iteratively on 12 months of news articles. domain QA models that are trained end-to-end as both the retriever and the reader are differentiable (Guu et al., 2020; Lewis et al., 2020b; Izacard & Grave, 2020; Sachan et al., 2021) . Recent work in the domain has focused on improving performance by combining information from multiple documents efficiently (Sachan et al., 2021; Izacard & Grave, 2020) and on performance analysis of the dense retrievals, for instance, when dealing with named entities (Sciavolino et al., 2021; Liu et al., 2021) . Continual learning and distribution shift in LM and downstream tasks Continual learning in language is a long-standing research topic (Carlson et al., 2010; Parisi et al., 2018) that has recently seen an increase in interest. Lazaridou et al. (2021) show that performance of Transformer-XL deteriorates when evaluated on data published after the training period, and use dynamic evaluation (Krause et al., 2017) to partially make up for this degradation. Lazaridou et al. (2021) and Hu et al. (2020) release large scale benchmarks for studying temporal adaptation in the language modeling task. Jang et al. (2021) propose new metrics for knowledge updates and establish strong baselines. In contrast, we focus on studying adaptation in a downstream task of question answering: we demonstrate that deterioration in perplexity translates into worse downstream performance and that adaptation through unsupervised fine-tuning or access to retrieval improves QA performance. Röttger & Pierrehumbert (2021) study temporal adaptation of BERT models for the classification task and find that unsupervised temporal adaptation does not help downstream performance as much and task specific temporal adaptation is needed. Amba Hombaiah et al. (2021) propose new incremental methods for online BERT training using vocabulary expansion. In the context of semiparametric models Khandelwal et al. (2020) and Lewis et al. (2020a) describe flexible approaches to adaptation through updating information in the retrieval. In order to enable a more realistic evaluation of QA models, we introduced the first QA dataset and task for studying adaptation to new information over time in open and closebook settings with temporally non-overlapping training and evaluation sets. As language models grow bigger, the cost of maintaining them up-to-date increases, and therefore adaptation ability of the models becomes more important. Our experimental results show that open-book QA models allow for fast and flexible adaptation through adding new articles into the search space, with fine-tuning or retraining generally further improving performance. The ability to inject new knowledge through the search space depends on retrieval accuracy and the more up-to-date parametric LMs are capa-ble of compensating for retrieval errors. Additionally, our results show that iteratively fine-tuning the generator of the FID QA model improves performance and costly retraining from scratch is not necessary. We leave for future work to better understand and close the performance gap between retrained and stale generators. Future work StreamingQA highlights challenges of temporal reasoning and invites further research into this area: the past subset contains questions with the relative time specification, where retrieval struggles to extract relevant passages. For fine-tuning, we consider a vanilla setup without delving deeply into more sophisticated continual learning approaches. Future work should take in-depths look at how best to adapt QA models, and the problem of what to compress into weights or what to add to the search space. While we study adaptation to new knowledge, retrieving conflicting information due to updated knowledge (e.g., "How many seasons are in Game of Thrones?") is another important direction we did not tackle here. Toxic content is a concern in both human created and automatically generated content. We provide a discussion and describe our filtering of such content here. Our setup poses particular challenges as our questions and answers are based on news. First, answers in the dataset follow information in the articles regardless of the factual basis of the articles. While most of the news articles in WMT are from reputable news sources, news in general can contain content that may be considered toxic, such as graphic descriptions of crimes, or some quotes or opinions. Second, as some questions and answers are generated using a large language model, there is a risk that it may generate toxic content; however, we want to note that the generation process is constrained by conditioning on the article and a substring answer and the subsequent automatic filtering. Third, our dataset overall is intended to evaluate adaptation of models to new information in news over time, and therefore, it may not be applicable to settings where the assumptions we made don't apply. We aimed to create a balanced process that identifies most of the toxic content while decreasing the risk of removing false positives. To identify toxic content, we used the Perspective API 12 which provides classifiers for several categories of toxic content (identity attack, insult, threat, profanity, sexually explicit, severe toxicity). We decided to use the specific classifiers instead of the generic toxicity classifier because our initial annotations indicated that the specific classifiers perform better. Removing content needs to be done with care as these classifiers do contain false positives (e.g., people, [Republic of] Niger, shoot, death, abuse, balls [in sports], and [last] names which bear phonetic similarity to insults), and removing too many such examples may cause harm by decreasing representation of some groups (e.g., black, muslim, jewish, LGBTQIA+ minorities). Through manual annotation of the questions with the highest toxicity scores, we have determined thresholds for removing questions as follows: for each 0.05 band of the scores from 1 to 0 (e.g., [1.0, 0.95], [0.95, 0.90], . . . ), we remove questions in each band until two subsequent bands contain fewer than 30% toxic questions (judged by two annotators on a sample of 50 per band). The first of these two subsequent bands is also removed. We annotated more than 5.5k examples throughout this process. As the annotation judgements for filtering were done by a small group of annotators, we cannot claim that the annotation had perfect representivity nor that the annotators had full cultural context from all possible views. For our manual annotation, we adapted the Perspective API classifier definitions in a minor way (see Appendix A.6). This filtering resulted in removing about 0.57% of questions (0.60%, 0.61%, 0.43%, 0.65% from Train, Valid, Eval-Generated, Eval-Written), and thresholds of 0.75 for identity attack, 0.80 for insult, 0.65 for profanity, 0.55 for severe toxicity, 0.85 for sexually explicit, and 0.90 for threat. Subsequently, we estimated that 0.5% of toxic questions remain (sample of 1k questions). We provide the automatic toxicity scores as part of the data release. This approach was formed with input from DeepMind's ethics and safety teams, and with guidance from our multidisciplinary leadership group which advises on societal impacts associated with research. (Jia et al., 2018) Wri-other KG KG 1.2k Yes TimeQuestions (Jia et al., 2021) Wri-other KG KG 9.7k 3.2k 3.2k X KG with temporal information Yes TORQUE (Ning et al., 2020) Wri 3.2k short passages Known-source 24.5k 1.5k 4.6k X X Yes Figure 9 . Answer type proportions in the StreamingQA evaluation sets. A.1. Comparison of StreamingQA with other related datasets In Table 4 , we present a comparison of StreamingQA with other related datasets. We provide further detail on the answer types in our evaluation set in Figure 9 . Figure 10 . Question and article publication date distribution for Eval-Written. Eval-Generated is similar. See examples in Table 5 of question with high and low frequency named entities. In order to remove trivial and/or low-quality questions, we apply the following filters: (i) remove questions that contain their answer as a sub-span; (ii) few-shot prompt a large LM for QA and ensure it generated the original target answer given the evidence document and the generated question for named entity or date answers exactly, and for phrases with 40% words overlap; (iii) additionally we perform Google Search via the Google Search API 13 with the question text and evidence publication date as a query, and keep only questions for which the answer is present in the top 10 search results; and (iv) for phrase-answer questions, we only keep questions that contain a named entity in the question, hence eliminating questions that are too generic. For all questions we include an absolute or a relative time specification, choosing randomly between the two. For absolute time specification we include the month and the year of the article's publication date, for example, "in May 2017". For relative, we compute the difference between the question date and the article publication date, and include "" (< 7 days), "a week ago" (< 1 days), "N weeks ago" (<= 8 weeks), "N months ago" (< 2 years), or "N years ago". Figure 11 . F1 score on the whole dataset of models fine-tuned on data until different cut-off dates, by answer type. We have adapted the Perspective API definitions 14 of the classifiers from the Perspective API to better fit our domain for our manual annotation to determine the filtering thresholds. • IDENTITY ATTACK: Negative, discriminatory, stereotyping, or hateful against a group of people based on criteria including (but not limited to) race or ethnicity, religion, gender, nationality or citizenship, disability, age, or sexual orientation. As well as the above we also consider a QA if it: Unnecessarily strengthens negative, discriminatory, stereotyping, or hateful representations of minorities. The implication of the QA could be viewed as potentially negative, discriminatory, stereotyping, or hateful and the veracity is questionable -i.e. we would need to fact-check. • INSULT: Inflammatory, insulting, or negative language towards a person or a group of people. Not necessarily identity specific. • PROFANITY: Swear words, curse words, or other obscene or profane language. • SEVERE TOXICITY: A very hateful, aggressive, disrespectful comment or otherwise very likely to make a user leave a discussion or give up on sharing their perspective. This attribute is much less sensitive to more mild forms of toxicity, such as comments that include positive uses of curse words. • SEXUALLY EXPLICIT: Contains references to lewd content. References sexual acts or body parts that are unnecessarily graphic or detailed. • THREAT: Language that is threatening or encouraging violence or harm, including self-harm. Language that is unnecessarily graphic or detailed when reporting about a violent incident. We present the F1 scores broken down by answer type in Figure 11 , and the effect of temporal lag between model knowledge and question dates for written questions in Figure 12 . Perplexity vs closed-book QA performance. An interesting point of comparison is between closed-book QA performance and the perplexity of the underlying LM on test documents. As TXL FT is fine-tuned on more months, we expect its perplexity on evidence documents of the recent subset to reduce, while its perplexity on evidence documents of the past subset to either stay the same (in the optimal scenario) or increase, if the model forgets. Figure 13 shows these two effects. Normalized perplexity or normalized F1 LM forgetting on past articles F1 deterioration on past questions LM adaptation to recent articles F1 improvement on recent questions Figure 13 . Relationship between adaptation and forgetting of LM (solid) and CB QA models (dashed). Red lines show fine-tuned vs stale adaptation/improvements to recent articles on the recent subset. Green lines show fine-tuned vs retrained forgetting of past articles on the past subset. LM forgetting is expressed as TXLSTALE perplexity / TXLFT perplexity and F1 deterioration as CB+FT F1 / CB+STALE F1, whereas LM adaptation is expressed as TXLRETR. perplexity / TXLFT perplexity and F1 improvement as CB+FT F1 / CB+RETR. . Negative log likelihood of masked span prediction for evaluation documents of the T5. We show the vanilla T5, T5 fine-tuned on WMT up to 2019 ("Stale"), monthly fine-tuned, and trained on all WMT up to 2020. We see forgetting on the past subset, and a slight recency bias on the recent subset, compared to retraining. We show adaptation and forgetting on past written questions in Figure 14 , metrics for all subsets in Table 6 , and the masked span prediction performance for evaluation documents of the T5 model in Figure 15 . Figure 16 shows temporal distribution of gold and retrieved passages for recent questions for Q4'2020: DPR with timestamp matches the temporal distribution of the gold passages much closer. Interestingly, the model seems to get somewhat confused about the year: the second lower spike is in Q4'2019. We provide the EM in Figure 17 and all the metrics in Table 7 . Findings of the 2021 conference on machine translation (wmt21) Synthetic qa corpora generation with roundtrip consistency Dynamic language models for continuously evolving content Toward an architecture for never-ending language learning A dataset for answering time-sensitive questions Transformer-XL: Attentive language models beyond a fixed-length context Time-aware language models as temporal knowledge bases Unified language model pre-training for natural language understanding and generation Realm: Retrieval-augmented language model pretraining Continual learning with web-scale natural language Leveraging passage retrieval with generative models for open domain question answering Towards continual knowledge learning of language models International World Wide Web Conferences Steering Committee Complex temporal question answering on knowledge graphs Dense passage retrieval for open-domain question answering Generalization through memorization: Nearest neighbor language models The Nar-rativeQA reading comprehension challenge Dynamic evaluation of neural sequence models A simple and language independent subword tokenizer and detokenizer for neural text processing Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics Mind the gap: Assessing temporal generalization in neural language models Latent retrieval for weakly supervised open domain question answering Latent retrieval for weakly supervised open domain question answering Retrievalaugmented generation for knowledge-intensive nlp tasks Retrieval-augmented generation for knowledge-intensive nlp tasks Paq: 65 million probablyasked questions and what you can do with them Challenges in generalization in open domain question answering TORQUE: A reading comprehension dataset of temporal ordering questions Continual lifelong learning with neural networks: A review Scaling language models: Methods, analysis & insights from training gopher Exploring the limits of transfer learning with a unified text-to-text transformer SQuAD: 100,000+ questions for machine comprehension of text Temporal adaptation of BERT and performance on downstream document classification: Insights from social media End-to-end training of multi-document reader and retriever for open-domain question answering Question answering over temporal knowledge graphs We thank our human annotators for helping create a large part of the dataset. We also thank John Aslanides and Kevin McKee for advice on the initial setup of the human annotation data collection. We would particularly like to thank Boxi Wu for her support in helping to organize panels with experts to advise on our toxicity filtering approach. Lastly, we would like to acknowledge Dani Yogatama and Lisa Anne Hendricks who acted as our internal reviewers and thank Chris Dyer for his continuous input.