key: cord-0195723-65glcuf4 authors: Yang, Linyi; Wang, Zhen; Wu, Yuxiang; Yang, Jie; Zhang, Yue title: Towards Fine-grained Causal Reasoning and QA date: 2022-04-15 journal: nan DOI: nan sha: 4cc6d310c0d5584f50836f1bd6bdbcac1c1c86a6 doc_id: 195723 cord_uid: 65glcuf4 Understanding causality is key to the success of NLP applications, especially in high-stakes domains. Causality comes in various perspectives such as enable and prevent that, despite their importance, have been largely ignored in the literature. This paper introduces a novel fine-grained causal reasoning dataset and presents a series of novel predictive tasks in NLP, such as causality detection, event causality extraction, and Causal QA. Our dataset contains human annotations of 25K cause-effect event pairs and 24K question-answering pairs within multi-sentence samples, where each can have multiple causal relationships. Through extensive experiments and analysis, we show that the complex relations in our dataset bring unique challenges to state-of-the-art methods across all three tasks and highlight potential research opportunities, especially in developing"causal-thinking"methods. To entail a new goal of building more powerful AI systems beyond making predictions using statistical correlations (Kaushik et al., 2021; Srivastava et al., 2020; Li et al., 2021) , causality has received much research attention in recent years (Gao et al., 2019a; Schölkopf et al., 2021; Feder et al., 2021; Scherrer et al., 2021) . In NLP, understanding finegrained causal relations between events in a document is essential in language understanding and is beneficial to various applications, such as information extraction (Gao et al., 2019a) , question answering (Chen et al., 2021b) , and machine reading comprehension (Chen et al., 2021a) , especially in high-stakes domains such as medicine and finance. Despite a large body of work on automatic causality detection and reasoning over text (Khoo et al., 1998; Mirza et al., 2014; Chang and Chen, 2019; Mariko et al., 2020b) , relatively little work * These authors contributed equally to this work. has considered the plethora of fine-grained causal concepts (Talmy, 1988; Wolff et al., 2005) . For example, the spread of COVID-19 has led to the boom in online shopping, (i.e., cause), but it also has deterred ( i.e., prevent) people from going shoppingcentres. Previous work has focused only on the "cause" relation. However, as suggested by the literature in classical psychology (Wolff and Song, 2003) , a single "cause" relationship cannot cover the richness of causal concepts in real-world scenarios; instead, it is important to understand possible fine-grained relationships between two events from different causal perspectives, such as enable and prevent. In practice, being able to recognize those fine-grained relationships can not only benefits the causal inference task (Keith et al., 2020) but also facilities the construction of event evolutionary graph by providing more possible relation patterns between events (Li et al., 2018) . Motivated by Wolff and Song (2003) , we extend the causal reasoning task in NLP from a shallow "cause" relationship to three possible fine-grained relationships when constructing our dataset, including cause, enable, and prevent. Formally, the enable relationship can be expressed as the sufficient but not necessary condition between events, while the cause relationship typically refers to the necessary and sufficient condition. Based on the handlabeled, fine-grained, cause-effect pairs extracted from text corpus, we construct a fine-grained causal reasoning (FCR) dataset, which consists of 25, 193 cause-effect pairs as well as 24, 486 questionanswering pairs, in which almost all questions are "why" and "what-if" questions concerning the three fine-grained causalities. To the best of our knowledge, FCR is the first human-labeled fine-grained event causality dataset. We define a series of novel tasks based on that, including causality detection, fine-grained causality extraction, and Causal QA. In contrast to the performance of the state-of-the-art models on traditional cause-effect detection (e.g., FinCausal (Mariko et al., 2020b) ) on the 94% F1 score, experimental results show a significant gap between machine and human ceiling performance (74.1% vs. 90.53% accuracy) in our fine-grained task, providing the evidence that current statistical big models still struggle to solve the causal reasoning problem. Table 1 compares our dataset with datasets in the domain of both event causality and question answering (QA). In general, neither cause-effect detection datasets nor QA datasets considered finegrained causal reasoning tasks. FinCausal (Mariko et al., 2020a) dataset is the most relevant to ours, which developed a relatively small dataset from the Edgar Database * focusing on the simple "cause" relation only and do not contain QA tasks. We release our dataset and code at Github * . * https://www.sec.gov/edgar/ * https://github.com/YangLinyi/ Fine-grained-Causal-Reasoning Fine-grained Causal Reasoning. There is a deep literature on causal inference techniques using non-text datasets (Pearl, 2009; Morgan and Winship, 2015; Keith et al., 2020; Feder et al., 2021) , and a line of work focusing on discovering the causal relationship between events from textual data (Gordon et al., 2012; Mirza and Tonelli, 2016; Du et al., 2021) .Previous efforts lie on the graph-based event causality detection tasks (Tanon et al., 2017; Du et al., 2021) and the event-level causality detection tasks (Mariko et al., 2020a; El-Haj et al., 2021; Gusev and Tikhonov, 2021) . However, causal reasoning for text data with a special focus on fine-grained causality between events has been relatively little considered. A contrast between our dataset and the previous causality detection dataset is shown in Fig. 1 . As can be seen, given the same passage "COVID-19 has accelerated change in online shopping, and given Amazon's ... it will result in economic returns for years to come and offering more competitive prices compared to an offline business that brings pressures for the offline business recruitment.", previous work can extract facts such as "COVID-19 causes an increase in online shopping", yet cannot detect the subsequence for Amazon to "offer more competitive prices", and further the negative influence on offline business recruitment, both of which can be valuable for predicting the future events. Causal QA. Our QA task is similar to the machine reading comprehension setting (Huang et al., 2019) where algorithms make a multiple-choice selection given a passage and a question. Nevertheless, we focus on causal questions, which turn out to be more challenging. Existing popular question answering datasets (Sun et al., 2019; mainly focus on what, who, where and when questions, making their usage scenarios somewhat lim-Depressed realized prices due to lack of market access have forced capital spending cuts, stalling the growth potential of the company's oil sands assets. ited. SQuAD (Rajpurkar et al., 2016 (Rajpurkar et al., , 2018 consists of factual questions concerning Wikipedia articles, and some unanswerable questions are involved in SQuAD2.0. Although there are some datasets contain the causal reasoning tasks (Lai et al., 2017; Sun et al., 2019; , none of them consider answering questions by text span. Span-based question answering problems have gained wide interest in recent years (Yang et al., 2018; Huang et al., 2019; Lewis et al., 2021b) . The answers in DROP (Dua et al., 2019) may come from different spans of a passage and require some combination technologies to get the correct answer. Compared with these datasets, none of which has features of causal reasoning and span-based QA simultaneously, our dataset is the first to leverage fine-grained human-labeled causality for designing the Causal QA task consisting with "Why" and "What-if" questions. We collected an analyst report dataset from Yahoo Finance * , which contains 6,786 well-processed articles between December 2020 and July 2021. Each instance corresponds to a specific analyst report on a U.S. listed company, which highlights the strengths and weaknesses of the company. The original FCR dataset consists of 6,786 articles in 54, 289 sentences. We employ editors from a crowd-sourcing company to complete several human annotation tasks. Several pre-processing steps required crowd-sourcing efforts were carried out to prepare the raw dataset, including (1) A binary classification task for the causality detection; (2) A span labeling task to mark the cause and * We have received the written consent from the Yahoo Finance. effect formatted as text chunks (a given instance may contain multiple causal relations), and give event pairs a fine-grained causality label, including cause, cause_by), enable, enable_by), prevent, pre-vent_by), and irrelevant relation, where the suffix "_by" means the effect comes before its cause; (3) A re-writing task to generate the following questionanswering dataset by using the labeled event triples. Causality Detection. We first focus on a binary classification task of the causality detection, as such, removed sentences with outcome types of non-causal relationships, leaving only those text sequences (one or two sentences) that are considered containing at least a causal relation. Fine-grained Event Causality. Given the sentences each containing at least one event causality, human annotators are required to highlight all the event causalities and give each instance a finegrained label. As shown in Fig. 2 (a), a single sentence can have more than one event causality, which will be stored as triples containing cause, relation, effect . Causal QA. As shown in Fig. 2 (b), we design a novel and challenging causal reasoning QA task based on the fine-grained causality labels. We expand each < cause, relation, ef f ect > triple for generating a plausible question-answer pair. Different templates have been designed for different types of questions. For example, the active causal relations -CAUSE, ENABLE, and PREVENT -could usually be used for generating why-questions while the corresponding passive causal relations could be used for generating what-if questions. Quality Control. To ensure high quality, we restricted the participants to experienced human labelers with relevant records. For each task, we conducted pilot tests before the crowd-sourcing work officially began, receiving feedback from quality inspectors and revising instructions accordingly. We filter out the sentences regarding the estimation of the stock price movement due to the naturally high-sensitive features and uncertainty of the complex financial market. After the first-round annotation (half of the data), we manually organized spot checks for 10% samples in the dataset and revised the incorrect labels. After review, we revised roughly 3% of instances and refused the labelers with above 10% error rate from participating in the second-round data annotation. Finally, the inter-annotators agreement (IAA) ratio is 91% for fine-grained causality labels, and the F1 score of the inter-annotators agreement (IAA) ratio is 94% for causal question-answer pairs. Finally, we obtained a dataset of 51,025 instances (21,046 contain at least one causal relation) with fine-grained labels of cause-effect relations that were subsequently divided into training, validation, and testing sets for the following experiments. It may be worth noting that we sort the dataset in chronological order because the future data is not expected to be used for predictions. The primary data statistic of the FCR dataset is shown in Table 2 for three different tasks. We observe that there is no significant difference in the average token numbers between positive and negative examples for Task 1 and 2, which shows that predictive models are difficult to learn from shortcut features (Sugawara et al., 2018 (Sugawara et al., , 2020 Lai et al., 2021 ) (e.g., the instance length) during the training process. Furthermore, our dataset con- tains 846 multi-sentence samples, and 3,017 text chunks contain more than one causal relation in one instance, which requires a complex reasoning process to get the correct answer, even for a human. Most importantly, unlike other QA datasets (Sugawara et al., 2018 (Sugawara et al., , 2020 ) that can easily benefit from the test-train overlap as revealed by (Lewis et al., 2021a; , our dataset is sorted in chronological order so that the future test data could be theoretically difficult to coincide with the training set. This allows us to obtain greater insight into what extent models can actually generalize. Our dataset contains multi-sentence instances with fine-grained causality labels and the metainformation (company names and published dates). We list the sector distribution information in 3, in order to show that our dataset contains causeeffect pairs from different domains, although it has been collected from the single source -financial reports. The top three largest sectors belong to Consumer Cyclical, Industrial, and Technology. While instances from Utilities companies are the smallest group in our dataset. The use of the metainformation is two-fold. First, we choose the top three largest domains to perform out-of-domain evaluations (see Appendix B). Second, company names would be used for generating question templates. The pipeline of our experiments is shown in Fig. 4 . We define three tasks on our FCR dataset and build strong benchmark results for each task. First, as a prerequisite, models are evaluated on a binary classification task to predict whether a given text sequence contains a causal relation (Task 1). Second, we set up a joint event extraction and finegrained causality task for identifying text chunks describing the cause and effect, respectively, and which fine-grained causality category it belongs to (Task 2). Finally, we design a question answering task for answering the challenging "why-questions" and "what-if questions" (Task 3). The dataset consists of instances labeled with positive for the binary classification task if a given instance contains one causal relation and negative non-causal instances. The input data is extracted from the raw dataset directly, which contains 846 multi-sentence samples. We include multi-sentence samples besides a single sentence because causality could be found in multi-sentence contexts. In Task 2, we use [CLS] and [SEP] to mark the event's begin and end positions, respectively. For example, we have "[CLS] Better card analytics, increased capital markets and M&A offerings, and bolt-on acquisitions [SEP] should help drive [CLS] growth in fee income [SEP]", in which "Better card analytics, increased capital markets and M&A offerings, and bolt-on acquisitions" is a "cause" event and "growth in fee come" is an "effect" event. In total, there are 33, 634 event triples with sevenclass labels used for our experiments. Then, we conduct the cause-effect extraction between samples. We consider cause-effect extraction a multi-span event extraction task as complex causal scenarios containing multiple causes or effects within a single instance is under consideration. We set the label for the first token of a cause or effect to "B", the rest of the tokens within the detected text chunks are given the label "I", and the other words in a given instance are set to "O". The results of event causality extraction are reserved for generating causal questions. Both extractive methods and generative methods have been evaluated. For the extractive QA task, we adopt the same methods as the previous Transfomer-based QA works (Kayesh et al., 2020). In particular, we first convert the context C = (c 1 , c 2 , ..., c l ) and question Q = (q 1 , q 2 , ..., q l ) into a single sequence X = [CLS] c 1 c 2 . . . c l [SEP ] q 1 q 2 . . . q l [SEP], passing it to the pretrained Transformer encoders for predicting the answer span boundary (start and end). We consider using both classical deep learning models -CNN-Test (Kim, 2014) and HAN (Yang et al., 2016) -and Transformer-based models downloaded from Huggingface * -BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , and SpanBERT (Joshi et al., 2020) -as predictive models. In addition, we perform a causal reasoning QA task by leveraging six Transformer-based pretrained models built on the recently advanced Transformer architectures (Vaswani et al., 2017) with the framework provided by Huggingface * , including BERT-base, BERT-large, RoBERTa-base, RoBERTa-large, RoBERTa-base-with-squad, and RoBERTa-large-with-squad * . Furthermore, pretrained seq2seq models such as T5 (Raffel et al., 2020) , or BART (Lewis et al., 2020) on QA-pairs as the benchmark methods of the generative QA tasks. In particular, we consider T5small, T5-base, T5-large, BART-base, and BARTlarge models for building the benchmark results. We present and discuss the results of Task 1-3 based on FCR dataset in this section. For hold-out evaluation, we split our dataset into mutually exclusive training/validation/testing sets in the same ratio of 8:1:1 for all tasks. Predictive models and data splitting strategies have been kept the same among these tasks for building the benchmark results of each task. In line with the best practice, model hyper-parameters are tuned using the validation set. Both validation results and testing results will be reported in experiments. We use Adam as the optimizer and adopt the trick of decay learning-rate with the steps increase to train our model until converging for all models. The Macro F1-score and accuracy are used for evaluating the event causality analysis task, and the exact match and F1-score are used for Causal QA. The Macro F1-score is defined as the mean of label-wise F1-scores: where i is the label index and N is the number of classes. The causal detection result is shown in Table 3 . We find that although Transformer-based methods achieve much better results than other methods -CNN and HAN using ELMO embeddings -on judging whether an instance contains at least a causal relationship (RoBERTa-Large can get the highest F1 Score -84.64), it is still significantly below the human performance (84.64 vs. 94.32). The results of human performance are reported by quality inspectors from the crowdsourcing company. It is worth noting that the best results on the FinCausal (Mariko et al., 2020b) dataset can reach the human-level result (F1 = 97.75), providing indirect evidence that our dataset is more challenging caused by more complex causality instances. The results of the fine-grained event causality extraction task are shown in Table 4 . We find that SpanBERT and RoBERTa model can achieve the best performance for event causality extraction (F1 = 86.82 and EM = 60.26) and fine-grained classification (F1 = 68.99 and EM = 74.09), respectively. Nevertheless, all methods perform dramatically worse on the more challenging joint task, where the prediction is judged true only if event extraction and classification results exactly match the ground truth. Although the SpanBERT-large model can achieve the highest 21.78 EM on the test set, there is still much room for improvement. We find that the large Transformer-based models (Vaswani et al., 2017) with larger parameter sizes could not improve the performance on these tasks based on the FCR dataset by comparing the test performance of 71.72 in ACC) with 69.85 in ACC) on the task of the fine-grained classification. It sheds new light that increasing the parameter size could not be helpful for causal reasoning tasks. A more detailed error analysis by using the bestperforming RoBERTa-Large model is given in Table 5. The model performs relatively better in terms of the F1 score when predicting simple causal relations -Irrelevant (84.40), Cause (74.00), Cause_by (79.62), and Prevent (76.34), but worse on predicting complex relations -Enable (62.61) and Enable_by (41.49) -and the category with few examples -Prevent_by (64.46). This indicates that fine-grained causal reasoning brings the unique challenge for pre-trained models. We provide both quantitative analysis and qualitative analysis for Causal QA. In addition, we compare the best performance on our dataset and other The results of Causal QA are given in to what extent models can actually benefit from the additional data for the generalization is hard to be evaluated. We are interested in better understanding the difficulty of the Causal QA task compared to other popular datasets regarding prediction performance. We list the best-performing model of several popular datasets in Table 7 . In general, we find that reasoning-based tasks are more complex than other tasks in terms of the relatively low accuracy achieved by the state-of-the-art method. LogiQA is more challenging than our dataset (39.3 vs. 85.6 in accuracy) because it requires heavy logical reasoning rather than identifying causal relations from text. Moreover, we find that the state-of-the-art result on our dataset (RoBERTa-SQuAD) is dramatically worse than the best performance on other datasets (EM = 90.9 on SQuAD2.0 while EM = 61.6 on Causal QA). This may suggest that the model tends to output the partially right answer but fails to output the utterly correct answer, although further research is required, as the model still could be easily perturbed by the length of an event. Meanwhile, the human performance is still ahead of the best-performing model's result in the causal reasoning QA task. Thus, we argue that Dataset Method F1 ACC EM SQuAD1.1 (Rajpurkar et al., 2016) LUKE (Yamada et al., 2020) 95.7 -90.6 SQuAD2.0 (Rajpurkar et al., 2018) IE-Net (Gao et al., 2019b) 93.2 -90.9 DROP (Dua et al., 2019) QDGAT 88.4 --HotpotQA (Yang et al., 2018) BigBird-etc (Zaheer et al., 2020) Shoppers spend more time at home (Relation: Enable) As a first mover in the localmarket daily deals space, Groupon has captured a leadership position, but not robust profitability. What enable Groupon capture a leadership position? A first mover in the local-market daily deals space A first mover in the local-market daily deals space (Relation: Prevent_By) In neurology, RNA therapies can reach their intended targets via intrathecal administration into spinal fluid, directly preventing the production of toxic proteins What will be prevented if intrathecal administration into spinal fluid? The production of toxic proteins The production of toxic proteins Examples of Incorrect Predictions (Relation: Enable_By) ... Through analyzing the data and applying artificial intelligence, the advertisers can improve the efficiency of advertisements through targeted marketing for Tencent ... What can help advertisers to improve the efficiency of advertisements? Analyzing the data and applying artificial intelligence Targeted marketing (Relation: Cause_By) Given expectations for more volatile equity, as well as some disruption as Brexit moves forward, it remains doubtful that flows will improve, a negative 3%-5% annual organic growth... Why a negative 3%-5% annual organic growth happened? Given ... as well as some disruption as Brexit moves forward It remains doubtful that flows will improve. Table 8 : Qualitative analysis of "Why" and "What-if" questions answering tasks based on the best-performed RoBERTa-Large model. The company name can be found in the meta-information of our dataset. Cause and Effect are extracted from the original context. The inputs of models consist with the context and question. Causal QA is worth investigating by using more "causal-thinking" methods in the future. Table 8 presents a qualitative analysis for Causal QA, where we highlight the question and answer parts extracted from the raw context. Human labelers label the gold answers while the BERT-based model generates the output answers. The first two questions are answered correctly, while the last two instances show two typical patterns prone to errors. In the first incorrect example, the model outputs "targeted marketing" using the keyword "through" but fails to give the gold answer "analyzing the data and applying artificial intelligence". This could be because the model fails to identify the difference between the same word appearing in two different positions. The last example shows that the model tends to output the answer closer to the question in the context instead of observing the whole sentence. The real reasons -"equity and credit markets" and "Brexit" -are ignored as it is relatively away for the question position. We explored the efficacy of current state-of-the-art methods for causal reasoning tasks by considering a novel fine-grained reasoning setting and developing a dataset with rich human labels. Experimental results using the state-of-the-art pre-trained language models provide the evidence that there is much room for improvement on causal reasoning tasks, and a need for designing better solutions to correlation discovery related to event causality analysis and Why/What-if QA tasks. This paper honors the ACL Code of Ethics. Public available financial analysis reports are used to extract fine-grained event relationships. No private data or non-public information was used. All annotators have received labor fees corresponding to their amount of annotated corpus. The code and It has been shown that sector-relevant features from a given domain could become spurious patterns on the other domains, leading to performance decay under distribution shift (Ovadia et al., 2019) . We use instances from three sectors with the largest amounts of samples in our dataset for conducting out-of-domain generalization text. These observe in line with recent works revealing that current deep neural models mostly memorize training instances yet struggle to predict on the out-of-distribution data (Gururangan et al., 2018; Kaushik et al., 2020; Srivastava et al., 2020) . To evaluate whether methods can generalize on the out-of-distribution data, and to what extent, the results of the out-of-domain test are shown in Table 9 and Table 10 . In particular, the model achieves the best performance when the training and test sets are extracted from the articles of the same domain companies. In the out-of-domain test, the model shows varying degrees of performance decay for both tasks. For example, in the fine-grained causality classification task, the model trained with the data from the Consumer Cyclical domain achieves 59.69 F1 Score when testing on the Consumer Cyclical data while decreasing to 47.95 when testing on technology companies. Moreover, in the cause-effect extraction task, the model trained with the data from the Consumer Cyclical domain achieves 86.26 F1 Score when testing on itself while decreasing to 85.20 when testing on Technology. This shows that the domain-relevant patterns learned by the model cannot transfer well between domains. What does this word mean? explaining contextualized embeddings with natural language definition Question directed graph attention network for numerical reasoning over text Probing into the root: A dataset for reason extraction of structural events from financial documents Routledge, and William Yang Wang. 2021b. Finqa: A dataset of numerical reasoning over financial data Mutual: A dataset for multi-turn dialogue reasoning BERT: Pre-training of deep bidirectional transformers for language understanding ExCAR: Event graph knowledge enhanced explainable causal reasoning Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs 2021. Proceedings of the 3rd Financial Narrative Processing Workshop. Association for Computational Linguistics Causal inference in natural language processing: Estimation, prediction, interpretation and beyond Modeling document-level causal structures for event causal relation identification Intraensemble in neural networks Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning Annotation artifacts in natural language inference data Headlinecause: A dataset of news headlines for detecting casualties Cosmos qa: Machine reading comprehension with contextual commonsense reasoning Dagn: Discourse-aware graph network for logical reasoning Spanbert: Improving pre-training by representing and predicting spans Learning the difference that makes a difference with counterfactually-augmented data Explaining the efficacy of counterfactually augmented data Answering binary causal questions: A transfer learning based approach Text and causal inference: A review of using text to remove confounding from causal estimates Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing Convolutional neural networks for sentence classification Race: Large-scale reading comprehension dataset from examinations Why machine reading comprehension models learn shortcuts? BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension Question and answer test-train overlap in open-domain question answering datasets Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021b. Paq: 65 million probably-asked questions and what you can do with them. arXiv Causalbert: Injecting causal knowledge into pre-trained models with minimal supervision Constructing narrative event evolutionary graph for script event prediction Guided generation of cause and effect Logiqa: A challenge dataset for machine reading comprehension with logical reasoning Challenges in generalization in open domain question answering RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints Hugues De Mazancourt, and Mahmoud El-Haj. 2020a. The financial document causality detection shared task Yagmur Ozturk, Hanna Abi Akl, and Hugues de Mazancourt. 2020b. Data processing and annotation schemes for fincausal shared task Annotating causality in the tempeval-3 corpus Catena: Causal and temporal relation extraction from natural language texts Counterfactuals and causal inference Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift Causality Exploring the limits of transfer learning with a unified text-totext transformer Know what you don't know: Unanswerable questions for SQuAD SQuAD: 100,000+ questions for machine comprehension of text Choice of plausible alternatives: An evaluation of commonsense causal reasoning Learning neural causal models with active interventions Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward causal representation learning Robustness to spurious correlations via human annotations What makes reading comprehension questions easier? Assessing the benchmarking capacity of machine reading comprehension datasets Dream: A challenge data set and models for dialogue-based reading comprehension Force dynamics in language and cognition Completeness-aware rule learning from knowledge graphs Attention is all you need Can generative pre-trained language models serve as knowledge bases for closed-book QA? Expressing causation in english and other languages Models of causation and the semantics of causal verbs Luke: deep contextualized entity representations with entity-aware self-attention Hotpotqa: A dataset for diverse, explainable multi-hop question answering Hierarchical attention networks for document classification Big bird: Transformers for longer sequences data are open-sourced under the Creative Commons Attribution-NonCommercial-ShareAlike (CC-BY-NC-SA) license. The annotation platform used in this work is introduced in Fig. 5 . As follows, we provide the detailed annotation instructions used for training the human labelers. Also, we show the annotation of some real examples stored in our dataset. Choose This is a annotation task related to event causality. In this task, you are asked to find all the causeeffect pairs and the fine-grained event relationship types from give passages. 1. Please read your assigned examples carefully. causality if at least two events occur in it and the two events are causally related.3. If a sentence contains causality, mark it as Positive; otherwise, mark it as Negative.4. For the positive sentence, first find all the events that occur in the sentence, and then pair the events to see if they constitute a causal relationship. The relationship much be one of Cause, Enable and Prevent.5. "A causes B" means B always happens if A happens. "A enables B" means A is a possible way for B to happen, but not necessarily. "A prevents B" means A and B cannot happen at the same time.6. Remember to annotate all event causality pairs. If there is no more pairs, process to the next passage. Here are some annotation examples, please read it before starting your annotation. Example 1: Moreover, we do not think that DBK's investment banking operation has the necessary scale and set-up to outcompete peers globally or within Europe.Answer: # Negative Explanation: This is a sentence that contains no causal relationship between events.Example 2: In our view, customers are likely to stay with VMware because of knowledge of its product ecosystem as well as the risks and complexities associated with changing virtual machine providers.Answer: # Positive Explanation: This is a causal sentence. There exist two events marked by yellow color. You should first annotate the two events and then give them the label according to their relationship, using one of Cause, Enable and Prevent. Here the relationship is Cause.Example 3: Depressed realized prices due to lack of market access have forced capital spending cuts, stalling the growth potential of the company's oil sands assets.Answer: # Positive Explanation: This is a causal sentence and there exist four events. You need to mark out all four of these events and then pair them up to see if they're related. If so, determine what kind of relationship they belong to.