key: cord-122344-2lepkvby
authors: Hayashi, Hiroaki; Kry'sci'nski, Wojciech; McCann, Bryan; Rajani, Nazneen; Xiong, Caiming
title: What's New? Summarizing Contributions in Scientific Literature
date: 2020-11-06
journal: nan
DOI: nan
sha: 
doc_id: 122344
cord_uid: 2lepkvby

With thousands of academic articles shared on a daily basis, it has become increasingly difficult to keep up with the latest scientific findings. To overcome this problem, we introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work, making it easier to identify the key findings shared in articles. For this purpose, we extend the S2ORC corpus of academic articles, which spans a diverse set of domains ranging from economics to psychology, by adding disentangled"contribution"and"context"reference labels. Together with the dataset, we introduce and analyze three baseline approaches: 1) a unified model controlled by input code prefixes, 2) a model with separate generation heads specialized in generating the disentangled outputs, and 3) a training strategy that guides the model using additional supervision coming from inbound and outbound citations. We also propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs. Through a human study involving expert annotators, we show that in 79%, of cases our new task is considered more helpful than traditional scientific paper summarization.

With the growing popularity of open-access academic article repositories, such as arXiv or bioRxiv, disseminating new research findings has become nearly effortless. Through such services, tens of thousands of scientific papers are shared by the research community every month 1 . At the same time, the unreviewed nature of mentioned repositories and the sheer volume of new publications has made it nearly impossible to identify relevant work and keep up with the latest findings.

Scientific paper summarization, a subtask within automatic text summarization, aims to assist researchers in their work by automatically condensing articles into a short, human-readable form that contains only the most essential information. In recent years, abstractive summarization, an approach where models are trained to generate fluent summaries by paraphrasing the source article, has seen impressive progress. State-of-the-art methods leverage large, pre-trained models (Raffel et al., 2019; Lewis et al., 2020) , define task-specific pre-training strategies , and scale to long input sequences (Zhao et al., 2020; Zaheer et al., 2020) . Available large-scale benchmark datasets, such as arXiv and PubMed (Cohan et al., 2018) , were automatically collected from online archives and repurpose paper abstracts as reference summaries. However, the current form of scientific paper summarization where models are trained to generate paper abstracts has two caveats: 1) often, abstracts contain information which is not of primary importance, 2) the vast majority of scientific articles come with human-written abstracts, making the generated summaries superfluous.

To address these shortcomings, we introduce the task of disentangled paper summarization. The new task's goal is to generate two summaries simultaneously, one strictly focused on the summarized article's novelties and contributions, the other introducing the context of the work and previous efforts. In this form, the generated summaries can target the needs of diverse audiences: senior researchers and field-experts who can benefit from reading the summarized contributions, and newcomers who can quickly get up to speed with the intricacies of the addressed problems by reading the context summary and get a perspective of the latest findings from the contribution summary.

For this task, we introduce a new large-scale dataset by extending the S2ORC corpus of scientific papers, which spans multiple scientific domains and offers rich citation-related metadata. We organize and process the data, and extend it with automatically generated contribution and context reference summaries, to enable supervised model training. We also introduce three abstractive baseline approaches: 1) a unified, controllable model manipulated with descriptive control codes (Fan et al., 2018; Keskar et al., 2019) , 2) a one-to-many sequence model with a branched decoder for multi-head generation (Luong et al., 2016; Guo et al., 2018) , and 3) an informationtheoretic training strategy leveraging supervision coming from the citation metadata (Peyrard, 2019) . To benchmark our models, we design a comprehensive automatic evaluation protocol that measures performance across three axes: relevance, novelty, and disentanglement. We thoroughly evaluate and analyze the baselines models and investigate the effects of the additional training objective on the model's behavior. To motivate the usefulness of the newly introduced task, we conducted a human study involving human annotators in a hypothetical paper-reviewing setting. The results find disentangled summaries more helpful in 79% of cases in comparison to abstract-oriented outputs. Code, model checkpoints, and data preparation scripts introduced in this work are available at https://github.com/salesforce/disentangled-sum.

Recent trends in abstractive text summarization show a shift of focus from designing task-specific architectures trained from scratch (See et al., 2017; Paulus et al., 2018) to leveraging large-scale Transformer-based models pre-trained on vast amounts of data (Liu & Lapata, 2019; Lewis et al., 2020) , often in multi-task settings (Raffel et al., 2019) . A similar shift can be seen in scientific paper summarization, where state-of-the-art approaches utilize custom pre-training strategies and tackle problems of summarizing long documents (Zhao et al., 2020; Zaheer et al., 2020) . Other methods, at a smaller scale, seek to utilize the rich metadata associated with scientific articles and combine them with graph-based methods (Yasunaga et al., 2019) . In this work, we combine these two lines of work and propose models that benefit from pre-training procedures, but also take advantage of task-specific metadata.

Popular large-scale benchmark datasets in scientific paper summarization (Cohan et al., 2018) were automatically collected from open-access paper repositories and consider article abstracts as the reference summaries. Other forms of supervision have also been investigated for the task, including author-written highlights (Collins et al., 2017) , human annotations and citations (Yasunaga et al., 2019) , and transcripts from conference presentations of the articles (Lev et al., 2019) . In contrast, we introduce a large-scale automatically collected dataset with more fine-grained references than abstracts, which also offers rich citation-related metadata.

Update summarization (Dang & Owczarzak) defines a setting in a collection of documents with partially overlapping information is summarized, some of which are considered prior knowledge. The goal of the task is to focus the generated summaries on the novel information. Work in this line of research mostly focuses on novelty detection in news articles (Bysani, 2010; Delort & Alfonseca, 2012) and timeline summarization (Martschat & Markert, 2018; Chang et al., 2016) on news and social media domains. Here, we propose a novel task that is analogous to update summarization in that it also requires contrasting the source article with the content of other related articles which are considered pre-existing knowledge.

Given a source article D, the goal of disentangled paper summarization is to simultaneously summarize the contribution y con and context y ctx of the source article. Here, contribution refers to the Train  805152  6351  925  877  136  236  Valid  36129  6374  922  875  135  236  Test  54242  6350  927  892  136  237 novelties introduced in the article D, such as new methods, theories, or resources, while context represents the background of the work D, such as a description of the problem or previous work on the topic. The task inherently requires a relative comparison of the article with other related papers to effectively disentangle its novelties from pre-existing knowledge. Therefore, we also consider two sets of citations: inbound citations C I and outbound citations C O as potential sources of useful information for contrasting the article D with its broader field. Inbound citations refer to the set of papers that cite D, i.e. relevant future papers, while outbound citations are the set of papers that D cites, i.e. relevant previous papers. With its unique set of goals, the task of disentangled paper summarization poses a novel set of challenges for automatic summarization systems to overcome: 1) identifying salient content of D and related papers from C I and C O , 2) comparing the content of D with each document from the citations, and 3) summarizing the article along the two axes: contributions and context.

Current benchmark datasets used for the task of scientific paper summarization, such as arXiv and PubMed (Cohan & Goharian, 2015) , are limited in size, the number of domains, and lack of citation metadata. Thus, we construct a new dataset based on the S2ORC corpus, which offers a large collection of scientific papers spanning multiple domains along with rich citationrelated metadata, such as citation links between papers and annotated citation spans. Specifically, we carefully curate the data available in the S2ORC corpus and extend it with new reference labels.

Data Curation Some papers in the S2ORC corpus 2 do not contain a complete set of information required by our summarization task: paper text, abstract, and citation metadata. We remove such instances and construct a paper summarization dataset in which each example a) has an abstract and body text, and b) has at least 5 or more inbound and outbound citations, C I and C O respectively. In cases where a paper has more than 20 incoming or outgoing citations, we sort them in descending order by the number of their respective citation and keep the top 20 most relevant articles.

Citation Span Extraction Each article in the set of inbound and outbound citations can be represented by its full text, abstract, or the span of text associated with the citation. In this study, we follow Qazvinian & Radev (2008) and Cohan & Goharian (2015) in representing citations with the sentences in which the citation occurs. 3 Thus, an outbound citation is represented by a sentence from the source paper. Usually, such sentences directly refer to the cited paper and place its content in relation to the source paper. Analogously, an inbound citation is represented by sentences from the citing paper and relates its content with the source paper.

Reference Generation Our approach relies on the availability of reference summaries for both contributions and contexts. However, such annotations are not provided or easily extractable from the S2ORC corpus, and collecting expert annotations is infeasible due to the associated costs. Therefore, we apply a data-driven approach to automatically extract contribution and context reference summaries from the available paper abstracts. First, we manually label 400 abstracts sampled from the training set. Annotations are done on a sentence-level with binary labels indicating contribution- and context-related sentences 4 . This procedure yields 3341 sentences with associated binary labels, which we refer to as golden standard references. Next, we fine-tune an automatic sentence classifier using the golden standard data. As our classifier we use SciBERT (Beltagy et al., 2019) , which after fine-tuning achieves 86.3% accuracy in classifying contribution and context sentences on a held-out test set. Finally, we apply the fine-tuned classifier to generate reference labels for all examples in our dataset, which we refer to as silver standard references.

The statistics of the resulting dataset are shown in Table 1 .

Our goal is to build an abstractive summarization system which has the ability to generate contribution and context summaries based on the source article. To achieve the necessary level of controllability, we propose two independent approaches building on encoder-decoder architectures:

CONTROLCODE (CC) A common approach to controlling model-generated text is by conditioning the generation procedure on a control code associated with the desired output. Previous work on controllable generation (Fan et al., 2018; Keskar et al., 2019) showed that prepending a special token or descriptive prompt to the model's input during training and inference is sufficient to achieve fine-grained control over the generated content. Following this line of work, we modify our training instances by prepending textual control codes, contribution: or context:, to the summarized articles. During training, all model parameters are updated for each data instance and the model is expected to learn to associate the provided prompt with the correct output mode. The approach does not require changes in the architecture, making it straightforward to combine with existing large-scale, pre-trained models. The architecture is shown on the left of Figure 1 .

An alternative way of controlling generation is by explicitly allocating layers within the model specifically for the desired control aspects. Prior work investigating multi-task models (Luong et al., 2016; Guo et al., 2018) showed the benefits of combining shared and taskspecific layers within a single, multi-task architecture. Here, the encoder shares all parameters between the two generation modes, while the decoder shares all parameters, apart from the final layer, which splits into two generation branches. During training, each branch is individually updated with gradients from the associated mode. The model shares the softmax layer weights between the output branches under the assumption that token-level vocabulary distributions are similar in the two generation modes due to the common domain. This approach is presented on the right of Figure 1 .

Peyrard (2019) proposed an information-theoretic perspective on text summarization which decomposes the criteria of a good summary into redundancy, relevance, and informativeness. Among these criteria, informativeness measures the user's degree of surprise after reading a summary given their background knowledge, and can be formally defined as:

where ω i is a primitive semantic unit, P K is the probability over the unit under the user's knowledge, P D is the probability over the unit with respect to the source document, and i is an index over all semantic units within a summary.

As defined by Peyrard (2019), informativeness is in direct correspondence to contribution summarization. Paper contributions are novel contents introduced to the community, which causes surprisal given the general knowledge about the state of the field. Therefore, in this work we explore utilizing this measure as an auxiliary objective that is optimized during training. We define the semantic unit ω i as the summary itself 5 , which enables a simple interpretation of the corresponding probabilities. We estimate P D as the likelihood of the summary given the paper content, P D (ω i ) = p(y | D).

Since each paper is associated with a unique context and background knowledge, we treat the background knowledge as all relevant papers published before the source paper, i.e., outbound citations C O . Therefore, P K is estimated as the likelihood of the summary given the previous work,

We formulate the informativeness function as:

where the conditioning depends on the generation mode of the model, and aim to maximize it during the training procedure.

Combined with a cross entropy loss L CE , we obtain the final objective which we aim to the minimize during training:

where λ is a scaling hyperparameter determined through cross-validation.

In this section, we describe the experimental environment and report automatic evaluation results. We consider four model variants:

• CC, CC+INF: CONTROLCODE model without and with the informativeness objective, • MH, MH+INF: MULTIHEAD model without and with the informativeness objective. Figure 2 : Diagram illustrating the evaluation protocol assessing summaries along 3 axes: relevance, purity, and disentanglement.

We perform automatic evaluation of the system outputs (s con , s ctx ) against the silver standard references (y con , y ctx ). For this purpose, we have designed a comprehensive evaluation protocol, shown in Figure 2 , based on existing metrics that evaluates the performance of models across 3 dimensions:

Relevance Generated summaries should closely correspond with the available reference summaries. We measure the lexical overlap and semantic similarity between (s con , y con ) and (s ctx , y ctx ) using ROUGE (R-i) (Lin, 2004) and BERTScore (Zhang et al. 2020 ; BS), respectively.

Purity Generated contribution summary should closely correspond with its respective reference summary, but should not overlap with the context reference summary. We measure the lexical overlap between s con and (y con , y ctx ) using NouveauROUGE con (N con -i) (Conroy et al., 2011) . The metric reports an aggregate score defined as a linear combination between the two components:

where weights α i j were set by the original authors to favor outputs with maximal and minimal overlap with related and unrelated references, accordingly. Analogously, we calculate N ctx -i in reverse direction between s ctx and (y ctx , y con ). Purity P-i is defined as the average novelty in both directions:

Disentanglement Generated contribution and context summaries should have minimal overlap. We measure the degree of lexical overlap and semantic similarity between (s con , s ctx ) using ROUGE and BERTScore, respectively. To maintain consistency across metrics (higher is better) we report disentanglement scores as complements of the associated metrics:

Our models build upon distilBART 6 , a Transformerbased (Vaswani et al., 2017) , pre-trained sequence-to-sequence architecture distilled from BART (Lewis et al., 2020) . Specifically, we used a model with 6 self-attention layers in both the Encoder and Decoder. Weights were initialized from a model fine-tuned on a news summarization task. 7 For the MULTIHEAD model, the final layer of the decoder was duplicated and initialized with identical weights. We fine-tuned on the training set for 80000 gradient steps with a fixed learning rate of 3.0 × 10 −5 and choose the best checkpoints in terms of ROUGE-1 scores on the validation set. The loss scaling hyparameter λ (Eq. 3) was set to 0.05 and 0.01 for the CONTROLCODE and MULTIHEAD models, accordingly. Input and output lengths were set to 1024 and 200, respectively. At inference time, we decoded using beam search with beam size 5. The evaluation was performed using SummEval toolkit (Fabbri et al., 2020).

In Table 2 we report results from the automatic evaluation protocol described in Subsection 5.1.

Relevance Across most models and metrics, relevance scores for context generation are higher than those for contribution summarization. Manual inspection revealed that in some cases generated context summaries also include article contribution information, while this effect was not observed in the reverse situation. Considering that silver standard annotations may contain noisy examples with incorrectly separated references, we suspect that higher ROUGE scores for context summaries may be caused by noisy predictions coinciding with noisy references. Examples of such summaries are shown in the Appendix E. We also observe that informativeness-guided models (+INF) perform on par with their respective base versions, and the additional training objective does not affect the performance on the relevance metric. This insight corroborates with Peyrard (2019) who defines informativeness and relevance as orthogonal criteria.

Purity While the informativeness objective was designed to improve the novelty of generated summaries, results show an opposite effect, where informativeness-guided models slightly underperform their base counterparts. The true reason for such behavior is unknown, however, it might be an indicator that the outbound citations C O are not a good approximation of reference context summaries y ctx , or the relationship between the two is weak. This effect is more evident in the Medical and Biology domains, which are the two most frequent domains in the dataset. 6 We did not observe a substantial difference in performance between distilBART and BART. 7 Model weights are available at https://huggingface.co/sshleifer/student_cnn_6_6. Disentanglement Results indicate that CONTROLCODE-based models perform better than MUL-TIHEAD approaches in terms of generating disentangled outputs. This comes as a surprise given that the CC models share all parameters between the two generation modes, but might indicate that the two tasks contain complementary training signals. We also noticed that, both informativenessguided models performed better in terms of D-1.

Based on both purity and disentanglement evaluations, we suspect that the informativeness objective does guide the models to output more disentangled summaries (second term in Eq 2), but the signal is not strong enough to focus on generating the appropriate content (first term in Eq 2). It is also clear that the MULTIHEAD model benefits more from the additional training objective.

To better understand the strengths and shortcomings of our models, we performed a qualitative study of model outputs. Table 3 shows an example of generated summaries compared with the original abstract of the summarized article. Our model successfully separates the two generation modes and outputs coherent and easy to follow summaries. The contribution summary clearly lists the novelties of the work, while the context summary introduces the task at hand and explains its importance. In comparison, the original abstract briefly touches on many aspects: the context, methods used, and contributions, but also offers details that are not of primary importance, such as the detailed about the simulation environment.

More generally, the described trends hold across summaries generated by our models. The model outputs are fluent, abstractive, offer good separation between modes, and are on topic. However, the factual correctness of summaries could not be assessed due to the highly specialized content and language of the summarized articles. An artifact noticed in a few instances of the inspected outputs was leakage of contribution information into context summaries. Other examples of generated summaries are included in the Appendix E.

Taking advantage of the rich metadata associated with the S2ORC corpus, we analyze the performance of models across the 10 most frequent scientific domains. Table 4 shows the results of contribution summarization using the CONTROLCODE 8 model.

While ROUGE-1 scores oscillate around 40 points for most academic fields, the results indicate that summarizing documents from the Medical domain is particularly difficult, with models scoring about 7 points below average. Table 3 : Generated samples compared with the original and generated abstracts of the associated paper. The second rows shows the output decoded from DistilBART fine-tuned on our dataset, the third rows shows the outputs from CONTROLCODE model. Our model successfully generates disentangled content, thus making it easier to follow than the abstract.

Original Abstract: Energy optimization in buildings by controlling the Heating Ventilation and Air Conditioning (HVAC) system is being researched extensively. In this paper, a model-free actor-critic Reinforcement Learning (RL) controller is designed using a variant of artificial recurrent neural networks called Long-Short-Term Memory (LSTM) networks. Optimization of thermal comfort alongside energy consumption is the goal in tuning this RL controller. The test platform, our office space, is designed using SketchUp. Using OpenStudio, the HVAC system is installed in the office. The control schemes (ideal thermal comfort, a traditional control and the RL control) are implemented in MATLAB.

Using the Building Control Virtual Test Bed (BCVTB), the control of the thermostat schedule during each sample time is implemented for the office in EnergyPlus alongside local weather data. Results from training and validation indicate that the RL controller impoves thermal comfort by an average of 15% and energy efficiency by an average of 2.5% as compared to other strategies mentioned.

Generated Abstract: Despite the advances in research on HVAC control algorithms, most field equipment is controlled using classical methods that include hysteresis/on/off and Proportional Integral and Derivative (PID) controllers. These classical methods do not perform optimally. The high thermal inertia of buildings induces large time delays in the building dynamics, which cannot be handled efficiently by the simple on/off controllers. However, due to the high non-linearity in building dynamics coupled with uncertainties such as weather, energy pricing, etc., these PID controllers require extensive retuning or auto-tuning capabilities, which increases the difficulty and complexity of the control problem. In this work, we introduce novel control algorithms from a branch of machine learning called reinforcement learning. From a controls perspective, reinforcement learning algorithms can be considered as direct adaptive optimal control. Like optimal control, reinforcement training algorithms minimize the cumulative sum of costs over a time horizon. Unlike traditional optimization algorithms can learn optimal control actions Contribution: In this work, we introduce novel control algorithms from a branch of machine learning called reinforcement learning. In our current approach, the impetus is thermostat control. Instead of traditional on/off heating and cooling control, reinforcement learning is utilized to set this schedule to obtain improved Predicted Mean Vote (PMV)-based thermal comfort at an optimal energy expenditure. Hence, a thermostats schedule is computed using an RL controller. The results show that the Q-learning algorithm can learn to adapt to time-varying and nonlinear system dynamics without explicit identification of the plant model in both systems and controls.

Context: The Heating, Ventilation and Air Conditioning (HVAC) systems can account for up to 50% of total building energy demand. In the hopes of moving toward a greener, more energy-efficient future, a significant improvement in energy efficiency is needed to achieve this goal. Despite the advances in research on HVAC control algorithms, most field equipment is controlled using classical methods that include hysteresis/on/off and Proportional Integral and Derivative controllers. However, due to the high nonlinearity in building dynamics coupled with uncertainties such as weather, energy pricing, etc., these PID controllers require extensive retuning or auto-tuning capabilities, which increases the difficulty and complexity of the control problem. The high thermal inertia of buildings induces large time delays in the building dynamics, which cannot be handled efficiently by the simple on/off controllers. Manual inspection of instances with low scores (R-1 < 20), exposed that contribution summaries in the Medical domain are highly quantitative (e.g. "Among these treated . . . retinopathy was noted in X%"). While other domains such as Biology also suffer from the same phenomenon, low-scoring quantitative summaries were 1.9 times more frequent in Medicine than in Biology. An investigation into the domain distribution in our dataset (Appendix) revealed that Biology and Medicine are the two best represented fields in the corpus, with Biology having over twice as many examples. We hypothesize that the poor performance of models stems from the fact that generating such quantitative summaries requires a deeper, domain-specific understanding of the source document and the available in-domain training data is insufficient to achieve that goal.

To assess the usefulness of the newly introduced task to the research community, we conducted a human study involving expert annotators. The study aimed to compare disentangled papers summaries with traditional, abstract-based summaries in a hypothetical paper reviewing setting. Judges were shown both types of summaries side by side and asked to pick one which would be more helpful for conducting the paper review. Abstract-based summaries were generated by a model with a configuration identical to the models previously introduced in this work, trained to generate full abstracts using the same training corpus. Annotators that participated in this study hold graduate degrees in technical fields and are active in the research community, however, they were not involved or familiar with this work prior to this experiment. The study used 100 examples, out of which 50 were decoded on the test split of the adapted S2ORC dataset, while the other 50 were generated in a zero-shot fashion from articles in the CORD dataset , a recently introduced collection of papers related to COVID-19. Results in Table 5 show the proportion of all examples where the annotators preferred the disentangled summaries over the generated abstracts. The numbers indicate a strong preference from the judges for disentangled summaries, in the case of both S2ORC and CORD examples. The values on CORD samples are slightly higher than those on S2ORC; we suspect this being due to the fact that the annotators were less familiar with the topics described in Covid-related publications and would require more help to review such articles.

In this paper, we propose disentangled paper summarization, a new task in scientific paper summarizing where models simultaneously generate contribution and context summaries. With the task in mind, we introduced a large-scale dataset with fine-grained reference summaries and rich metadata. Along with the data, we introduced three abstractive baseline approaches to solving the new task and thoroughly assessed them using a comprehensive evaluation protocol design for the task at hand. Through human studies involving expert annotators with motivated the usefulness of the task in comparison to the current scientific paper summarization setting. Together with this paper, we release the code, trained model checkpoints, and data preprocessing scripts to support future work in this direction. We hope this work will positively contribute to creating AI-based tools for assisting scientists in the research process.

Different writing styles might locate and express contributions in different ways. To understand the global tendency of contribution locations in a paper, we take each sentence from the paper texts themselves in the training set and annotated contributions using the learned sentence classifier. We then group them into 10 bins according to the relative location of the sentences in the papers they belong to and constructed a distribution which summarizes the proportion of sentences labeled as contributions in each bin. Fig 3 shows the percentages of such sentences for each bin. The graph shows that no bin positions in the papers tend to describe contributions more than 50% of the time. Surprisingly, the first 10% of the papers have the lowest chance of describing the contributions, which is counter-intuitive to the general idea that papers tend to discuss the introduction and highlights of the paper at the beginning. 

As discussed in Section 3.1, labels for contribution or context are populated automatically using a classifier, which is expected to contain mistakes. Therefore, we created a gold standard evaluation set by manually annotating 100 samples in the test set and report the evaluation results in Table 6 . A sharp drop in ROUGE scores for the context summaries is due to some examples receiv- ing zero scores for generating context summaries when the manual annotation judged that there are not existent. The overall trend of CONTROLCODE model outpeforming MULTIHEAD model is still observed in the evaluation. More noticeably, we observe a reverse tendency when the two models are applied with the informativeness objective. MULTIHEAD model specifically enjoyed significant improvement in terms of novelty and disentanglement. In addition to various automatic evaluation, we perform human evaluation on disentanglement to understand which models human annotators prefer. We use Best-Worst scaling (Kiritchenko & Mohammad, 2017) over the 4-tuples of summaries on the 50 random samples from the test set and have 3 annotators pick the best and the worst contribution and context summary pairs in terms of disentanglement. The rating in Table 7 shows the percentage a model is chosen as the best minus the percentage a model is chosen as the worst; rating ranges from -1 to 1. A similar trend to automatic disentanglement evaluation is observed here as well in that introducing the informativeness objective significantly improves MULTIHEAD model, while it conversely affects CONTROLCODE model.

We show additional samples generated from CONTROLCODE model in Table 8 , 9, and 10. A common failure case of all our models is the lack of disentanglement between the two summaries. While samples start generating on-topic, the model for context summary generation starts generating the contributions at one point. This phenomenon can conspicuously be observed in Table 8 and 10. Original Abstract: Domain adaption (DA) allows machine learning methods trained on data sampled from one distribution to be applied to data sampled from another. It is thus of great practical importance to the application of such methods. Despite the fact that tensor representations are widely used in Computer Vision to capture multi-linear relationships that affect the data, most existing DA methods are applicable to vectors only. This renders them incapable of reflecting and preserving important structure in many problems. We thus propose here a learning-based method to adapt the source and target tensor representations directly, without vectorization. In particular, a set of alignment matrices is introduced to align the tensor representations from both domains into the invariant tensor subspace. These alignment matrices and the tensor subspace are modeled as a joint optimization problem and can be learned adaptively from the data using the proposed alternative minimization scheme. Extensive experiments show that our approach is capable of preserving the discriminative power of the source domain, of resisting the effects of label noise, and works effectively for small sample sizes, and even one-shot DA. We show that our method outperforms the state-of-the-art on the task of cross-domain visual recognition in both efficacy and efficiency, and particularly that it outperforms all comparators when applied to DA of the convolutional activations of deep convolutional networks.

Generated Abstract: The difficulty of securing an appropriate and exhaustive set of training data, and the tendency for the domain of application to drift over time, often lead to variations between the distributions of the training (source) and test (target) data. In Machine Learning this problem is labeled domain mismatch.

Failing to model such a distribution shift may cause significant performance degradation. Domain adaptation (DA) techniques capable of addressing this problem of distribution shift have thus received significant attention recently [24] . The assumption underpinning DA is that, although the domains differ, there is sufficient commonality to support adaptation. Many approaches have modeled this commonality by learning an invariant subspace, or set of subspaces. These methods are applicable to vector data only, however. Applying these methods to structured high-dimensional representations (e.g., convolutional activations), thus requires that the data be vectorized first. Although this solves the algebraic issue, it does not solve the underlying problem. Tensor

Contribution: To address these issues, we propose a novel approach termed Tensor-Aligned Invariant Subspace Learning (TAISL) to learn an invariant tensor subspace that is able to adapt the tensor representations directly. By introducing a set of alignment matrices, the tensors from the source domain are aligned to an underlying tensor space shared by the target domain. Instead of executing a holistic adaptation (where all feature dimensions would be taken into account), our approach performs mode-wise partial adaptation where each mode is adapted separately to avoid the curse of dimensionality. We also propose an alternating minimization scheme which allows the problem to be effectively optimized by off-the-shelf solvers. Extensive experiments on cross-domain visual recognition demonstrate the following merits of our approach: i) it effectively reduces the domain discrepancy and preserves the discriminative power of the original representations; ii) it is applicable to small sample size adaptation, even when there is only one source Context: Deep convolutional neural networks (CNNs) represent the state-of-the-art method for a substantial number of visual tasks. The activations of such CNNs, and the interactions between them, are naturally represented as tensors, meaning that DA should also be applied using this representation. However, after vectorization, many existing approaches become sensitive to the scarcity of source data (compared to the number of dimensions) and noise in the labels. The proposed direct tensor method uses much lower dimensional entities, thus avoiding these estimation problems. To address these issues we propose to learn an invariant tensor subspace that is able to adapt the tensor representations directly. We show in Section 5 that the proposed method outperforms all comparators in DA of the Convolutional Activation of CNNs. Higher-order tensor modeling offers us an opportunity to investigate multiple interactions and couplings that capture the commonality and differences between domains. Following this idea, a novel approach Table 9 : Generated Sample. Context summary has a slight erroneous decoding of contribution information.

Original Abstract: Conventional wisdom holds that model-based planning is a powerful approach to sequential decision-making. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce theÏmagination-based Planner, the first model-based, sequential decision-making agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its model-based imagination. All imagined actions and outcomes are aggregated, iteratively, into aplan contextẅhich conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complexïmagination treeby navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete maze-solving task. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.

Generated Abstract: The Imagination-based Planner (IBP) is a model-based agent which learns from experience all aspects of the planning process: how to construct, evaluate, and execute a plan. The IBP learns when to act versus when to imagine, and if imagining, how to select states and actions for aggregating the proposed actions and evaluations into a useful plan. Through training, it effectively develops a planning algorithm tailored to the target problem. The learned algorithm allows it to flexibly explore, and exploit regularities in, the state and action spaces. In two experiments we evaluated a continuous IBP implementation on a challenging continuous control task, and a discrete IBP in a maze-solving problem. A fully learnable plan-based planning agent for continuous control. An agent that learns to construct a plan via model-inspired imagination was trained to performconditionaländädaptive computation,ẅhich results in a dynamic computational graph recently [7] trained neural networks trained to Contribution: In this paper we introduce the Imagination-based Planner (IBP), a model-based planning agent which learns from experience all aspects of the planning process: how to construct, evaluate, and execute a plan. The IBP learns when to act versus when to imagine, and if imagining, how to select states and actions to evaluate which will help minimize its external task loss and internal resource costs. Through training, it effectively develops a planning algorithm tailored to the target problem. In two experiments we evaluated a continuous IBP implementation on a challenging continuous control task, and a discrete IBP in a maze-solving problem.

Context: Model-based planning involves proposing sequences of actions, evaluating them under a model of the world, and refining these proposals to optimize expected rewards. Several key advantages of model-free methods are that models support generalization to states not previously experienced, help express the relationship between present actions and future rewards, and can resolve states which are aliased in value-based approximations. These advantages are especially pronounced in problems with complex and stochastic environmental dynamics, sparse reward, and restricted trial-and-error experience. Yet even with an accurate model, planning is often very challenging because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Existing techniques for model based planning are most effective in small-scale problems, often require background knowledge of the domain, and use pre-defined solution strategies. Here we introduce the Imagination-based Planner (IBP), a model-based agent which learns from experience all aspects of the planning Table 10 : Generated Sample. More than half of the context summary is describing contribution.

Original Abstract: We provide a generalization of discriminative lexicalized shift reduce parsing techniques for phrase structure grammar to a wide range of morphologically rich languages. The model is efficient and outperforms recent strong baselines on almost all languages considered. It takes advantage of a dependency based modelling of morphology and a shallow modelling of constituency boundaries.

Generated Abstract: This paper shows that an efficient lexicalized phrase structure parser -modelling both dependencies and morphology -already significantly improves parsing accuracy. But we also show that an additional modelling of spans and constituency provides additional robustness that contributes to yield state of the art results on almost all languages considered, while remaining quite quite efficient. Moreover given the availability of existing multi-view treebanks (Bhatt et al., 2009; Seddah et al. 2014) , our proposed solution only requires a lightweight infrastructure to achieve multilin-gual parsing without requiring costly languaged dependent modifications such as feature engineering. This method was used for the construction of the dataset for the SPMRL 2013 shared task. We first review the properties of multiview treebanks for a number of morphologically rich languages for which they are annotated both for constituents and dependencies that have the property to be token-wise aligned. These double annotations are typically obtained by converting a constituency or dependency annotation into the other annotation Contribution: In this paper, we propose a lexicalized phrase structure parsing model that can effectively take morphology into account. We show that an efficient lexicalization of both dependencies and morphology significantly improves multilingual state-of-the art results on almost all languages considered, while remaining quite efficient. Moreover, given the availability of existing multi-view treebanks, our proposed solution only requires a lightweight infrastructure to achieve multilin-gual parsing without requiring costly feature engineering such as feature engineering.

Context: Most state of the art multilingual parsers are weighted by discriminative models. Most state-of-theart multilingual parsing methods rely on lexicalized phrase structure parsing techniques, which have recently been shown to improve performance in a variety of languages including free word order languages like English or Chinese. In this paper we show that an efficient lexicallyized parser -modelling both dependencies and morphology -already significantly improves parsing accuracy. But it also shows that an additional modelling of spans and constituency provides additional robustness that contributes to yield state-ofthe-art results on almost all languages considered, while remaining quite efficient. Moreover, given the availability of existing multi-view treebanks (Bhatt et al., 2009; Seddah et al. 2013) , our proposed solution only requires a lightweight infrastructure to achieve multilin-gual parsing without requiring costly feature engineering such as feature engineering.

SciBERT: A pretrained language model for scientific text

Detecting novelty in the context of progressive summarization

Timeline summarization from social media with life cycle models

Scientific article summarization using citation-context and article's discourse structure

A discourse-aware attention model for abstractive summarization of long documents

A supervised approach to extractive summarisation of scientific papers

Nouveau-ROUGE: A novelty metric for update summarization

Overview of the tac 2008 update summarization task

Dualsum: a topic-model based approach for update summarization

Summeval: Re-evaluating summarization evaluation

Controllable abstractive summarization

Soft layer-specific multi-task summarization with entailment and question generation

CTRL -A Conditional Transformer Language Model for Controllable Generation

Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation

Talk-Summ: A dataset and scalable annotation method for scientific paper summarization based on conference talks

BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension

ROUGE: A package for automatic evaluation of summaries

Text summarization with pretrained encoders

S2ORC: The semantic scholar open research corpus

Multi-task sequence to sequence learning

A temporally sensitive submodularity framework for timeline summarization

A deep reinforced model for abstractive summarization

A simple theoretical model of importance for summarization

Scientific paper summarization using citation summary networks

Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683

Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter

Get to the point: Summarization with pointer-generator networks

Attention is all you need. CoRR, abs/1706.03762

The covid-19 open research dataset

Huggingface's transformers: Stateof-the-art natural language processing

ScisummNet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks

Big bird: Transformers for longer sequences

Pegasus: Pre-training with extracted gap-sentences for abstractive summarization

Bertscore: Evaluating text generation with bert

Seal: Segment-wise extractive-abstractive long-form text summarization

The authors thank Wenhao Liu, Divyansh Agarwal, Sharvin Shah, and Tania Lopez-Cantu for assisting with annotations.