Event Time Extraction with a Decision Tree of Neural Classifiers Nils Reimers†, Nazanin Dehghani‡∗, Iryna Gurevych† † Ubiquitous Knowledge Processing Lab (UKP) and Research Training Group AIPHES Department of Computer Science, Technische Universität Darmstadt ‡ School of Electrical and Computer Engineering, University of Tehran www.ukp.tu-darmstadt.de Abstract Extracting the information from text when an event happened is challenging. Documents do not only report on current events, but also on past events as well as on future events. Often, the relevant time information for an event is scattered across the document. In this paper we present a novel method to auto- matically anchor events in time. To our knowl- edge it is the first approach that takes tempo- ral information from the complete document into account. We created a decision tree that applies neural network based classifiers at its nodes. We use this tree to incrementally infer, in a stepwise manner, at which time frame an event happened. We evaluate the approach on the TimeBank-EventTime Corpus (Reimers et al., 2016) achieving an accuracy of 42.0% com- pared to an inter-annotator agreement (IAA) of 56.7%. For events that span over a single day we observe an accuracy improvement of 33.1 points compared to the state-of-the-art CAEVO system (Chambers et al., 2014). Without re- training, we apply this model to the SemEval- 2015 Task 4 on automatic timeline generation and achieve an improvement of 4.01 points F1-score compared to the state-of-the-art. Our code is publically available.1 1 Introduction Knowing when an event happened is useful for a lot of use cases. Examples are in the fields of time-aware information retrieval, text summarization, automated timeline generation, and automatic knowledge base population. Many facts in a knowledge base are ∗During author’s internship in the research training group AIPHES at UKP Lab, TU Darmstadt. 1https://github.com/ukplab/ tacl2017-event-time-extraction only true for a certain time period, for example the presidency of a person. Hence, the population of a knowledge base can highly benefit from high quality event and event time2 extraction (Surdeanu, 2013). Inherent to events is the connection to time. Allan (2002) defines an event as “something that happens at some specific time and place”. The challenges for automatic event time extraction are manifold. The temporal information in news articles which states when an event happened is, in most cases, not in the same or in neighboring sentences with the event (Reimers et al., 2016). It can be mentioned far before the event or far after the event. Even worse, for more than 60% of events, the specific day at which the event happened is not mentioned. However, from the world knowledge and causal relations, the reader can infer a lot of temporal information about those events and can often infer that the event happened before or after some specific point in time. In this paper we describe a new classifier for auto- matic event time extraction. We use the TimeBank- EventTime Corpus (Reimers et al., 2016) to train and evaluate our proposed architecture. In contrast to other corpora on temporal relations, the annota- tion of the TimeBank-EventTime Corpus does not make restrictions where, and in which form, tempo- ral information for an event must be provided. The annotators were allowed to take the whole document into account and were asked to answer, to the best of their ability, the question at which date or time period the event happened. The event time annotation for some sample events is shown in the following: • He was [sent]1980-05-26 into space on May 26, 2We will refer to the temporal information when an event happened as event time. 77 Transactions of the Association for Computational Linguistics, vol. 6, pp. 77–89, 2018. Action Editor: Patrick Pantel. Submission batch: 6/2017; Revision batch: 10/2017; Published 2/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. 1980. He [spent]endPoint =1980-06-01beginPoint=1980-05-26 six days aboard the Salyut 6 spacecraft. • [...] two areas [expected]endPoint =before 1998-02-06beginPoint=before 1998-02-06 to be hardest [hit]after 1998-01-01 before 1998-01-31 when the effects of the Asian crisis [...]. This annotation imposes several challenges for an automatic approach: 1. The number of possible labels is infinite, as date values are part of the labels. 2. Due to the diverse types of events and due to varying temporal information for events, the structure of the labels varies. 3. Temporal information from the whole document must be taken into account. 4. For 12.6% of the events, the event time label is a combination of several temporal clues. An example could be that the annotator combined that the person went missing on the 15th and that the person went missing in the month of August. However, nowhere in text is the 15th of August explicitly mentioned. The main contribution of this paper is the proposal of a novel combination of a decision tree combined with neural network classifiers for the nodes to solve the afore-mentioned challenges. To our knowledge, this is the first system that works on the complete document and can extract long-range relations be- tween events and temporal expressions. Further, it is the first system that focuses on extracting begin and end points for events that span over multiple days. Evaluated on the TimeBank-EventTime Corpus (Reimers et al., 2016), it achieves an accuracy of 42.0% compared to an inter-annotator agreement (IAA) of 56.7%. Compared to the state-of-the-art CAEVO system (Chambers et al., 2014), we observe a substantial improvement in accuracy of 33.7 per- centage points for events that happened on a single day. For Multi-Day Events, we observe an accuracy of 24.3% using a strict metric. We show that the proposed model generalizes well to new tasks and textual domains. We applied it without re-training to the SemEval-2015 Task 4 on automatic timeline generation. There, it achieves an improvement of 4.01 points F1-score compared to the state-of-the-art. 2 Related Work We start with a review on common annotation schemes to capture temporal information for events in documents. Afterwards, we present related work on automatically extracting temporal information for events. 2.1 Annotation of Events and Temporal Information One of the most widely used specifications for events and temporal expressions is TimeML (Saurı́ et al., 2004). It provides specifications for the annotation of events, temporal expressions, and the temporal links (TLINK). An event is defined as term for situations that happen or occur. Temporal expressions, such as times, dates, or durations, are annotated and their temporal values are normalized using the definitions of Ferro (2002). A TLINK is the relation between two events, between an event and a temporal expres- sion, or between two temporal expressions. TimeML defines 14 different relation types, however, most corpora which are using the TimeML specification restrict the number of relations to a smaller set. A prominent corpus using the TimeML specifica- tions is the TimeBank Corpus (Pustejovsky et al., 2003), which was also the basis for the three shared tasks TempEval-1 (Verhagen et al., 2007), TempEval- 2 (Verhagen et al., 2010) and TempEval-3 (UzZaman et al., 2013). A drawback of TLINKs is the quadratic growth of possible TLINKs with the number of events and tem- poral expressions, resulting in more than 10,000 pos- sible TLINKs for several documents in the TimeBank Corpus. As the annotation of such a large number of TLINKs would be impractical, annotation of those is always restricted in some form. For the TimeBank Corpus, only salient TLINKs were annotated. Which links are salient isn’t well defined and a low agree- ment between annotators can be observed. The three TempEval shared tasks tried to improve the coverage and added some further temporal links for mentions in the same sentence. More dense annotations were applied by Bramsen et al. (2006), Kolomiyets et al. (2012), Do et al. (2012) and Cassidy et al. (2014). While Bramsen et al., Kolomiyets et al., and Do et al. only annotated some temporal links, Cassidy et al. an- notated all Event-Event, Event-Time, and Time-Time 78 pairs in the same sentence as well as in the directly succeeding sentence leading to the densest annota- tion for the TimeBank Corpus. They used six differ- ent relation types: BEFORE, AFTER, INCLUDES, IS INCLUDED, SIMULTANEOUS, and VAGUE, where VAGUE encodes that the annotators were not able to make a statement on the temporal relation of the pair. 2.2 Existent Event Time Extraction Systems Most automatic approaches use the previously in- troduced TLINKs to train and evaluate systems for extracting temporal information about events. For a new document, the system first extracts the temporal relations between events and temporal expressions. In a post-processing step, those TLINKs are used to retrieve the information when an event happened. Extracting the relations is often formulated as a pair-wise classification task. Each pair of events and/or temporal expressions is examined and classi- fied according to the available relation classes. Ensur- ing transitivity is a big challenge when formulating this task as a pair-wise classification task. One sim- ple but nonetheless frequently used solution is to automatically infer all temporal relations that can be derived from transitivity. Some systems have tried to take advantage of global information to ensure transi- tivity using Markov logical networks or integer linear programming (Bramsen et al., 2006; Chambers and Jurafsky, 2008; Yoshikawa et al., 2009; UzZaman and Allen, 2010). However, the gains were minor. Chambers et al. (2014) proposes the CAEVO- system, a sieve-based-architecture that blends mul- tiple classifiers into a precision-ranked cascade of sieves. The system was trained and evaluated on the TimeBank-Dense Corpus and created a dense TLINK annotation for all pairs of events and/or temporal ex- pressions in the same and in adjacent sentences. The code is publically available.3 A bottleneck of current systems is the limitation to TLINKs for pairs that are in the same or in adjacent sentences. According to Reimers et al. (2016) 28.3% of the events happen at the document creation time (DCT). For the remaining 71.7% of events, the event time must be inferred via TLINKs. However, for 3http://www.usna.edu/Users/cs/nchamber/ caevo/ 58.7% of those events the most informative time ex- pression4 is not in the same nor in the previous/next sentence. In conclusion, for 42.1% of all the events in a text it would be necessary to take long-range TLINKs into account to correctly retrieve the event time. Extending existing systems to take long-range relations into account is difficult due to a lack of training and evaluation data. 3 Event Time Annotation We use the TimeBank-EventTime Corpus (Reimers et al., 2016) to evaluate our architecture for automatic event time extraction. The TimeBank-EventTime Corpus does not use the concept of TLINKs, instead, for every event, the annotators were asked to anchor the event in time as precisely as possible. The annotation distinguishes between events that happened on a Single Day and Multi-Day Events that span over multiple days. For Single Day Events, the annotators provide the day the event happened in the format YYYY-MM-DD. In the case the exact date is not mentioned in the document, the annotators were asked to anchor the event in time as precisely as possible using the annotation before YYYY-MM-DD and after YYYY-MM-DD. Before notes that the event must have happened before the stated date and after that the event must have happened after the date. A combination of before and after is possible. For Multi-Day Events, the annotators were asked to provide the begin and the end point of the event. As for Single Day Events, they were allowed to use the before and after notation in the case the explicit begin/end point is not mentioned in the document. The annotated corpus contains news articles and TV broadcast transcripts from various sources written mainly between January and April 1998. The shortest document has five sentences, while the longest has 63 sentences. A label distribution can be found in (Reimers et al., 2016). 4 Automatic Event Time Extraction In this section we first present our hierarchical tree approach to automatically infer the event times in 4The most informative temporal expression is defined as the temporal expression giving the reader the information at which date, or in which time frame, the event happened. 79 a document. In Section 4.3 we present two base- lines that we use for comparison: the first uses dense TLINKs extracted by the CAEVO system and the second baseline is a reduced version of the presented tree approach. 4.1 Event Time Extraction using Trees We use the tree structure depicted in Figure 1 to extract the event time for a given target event. The structure was inspired by how annotators labeled the data. When annotating the text, the first decision is typically whether the event is a Single Day Event or a Multi-Day Event. In the case that it is a Single Day Event, the next question is whether the event happened at the Document Creation Time (DCT) or not. As the annotated data comes from the news domain, a large set of events (48.28% of the Single Day Events) happened at the document creation time. In the case the event did not happen at DCT, then the annotator scanned the text to decide whether the date when the event happened is explicitly mentioned or not. If it is not mentioned, the annotator used the before and after notation to define the time frame when the event happened as precisely as possible. For Multi-Day Events, the process is similar to determine the begin and end point of the event. The first classifier is a binary classifier to decide whether the event is a Single or a Multi-Day Event. In the case it is a Single Day Event, the next classifier decides the relation between the event and the Doc- ument Creation Time (DCT). In the case the event happened at DCT, the architecture stops. If the event happened before or after DCT, the next classifier is invoked, detecting which temporal expressions are relevant. For all relevant temporal expressions, it is then determined whether the event happened simul- taneously, before, or after the temporal expressions. The final step (2.4) outputs a single event time by narrowing down the information it receives from the relation to DCT (2.1) and the pool of relevant tempo- ral expressions and relations (2.3). For Multi-Day Events the process is similar, how- ever, the system must return the begin and the end points. The system runs three processes in parallel: it extracts the relations to relevant time expressions for the begin point (3.1.1 and 3.1.2); it extracts the relation to DCT (3.2) and; it extracts the relations to relevant time expressions for the end point (3.3.1 and 3.3.2). There are three possible relations between a Multi-Day Event and the DCT: the event started and ended before the DCT; it started and ended after the DCT; or it started before DCT and ended after DCT. This information is taken into account in step 3.1.3 and 3.3.3 when producing single begin point and end point information for the given event. 4.2 Local Classifiers This section describes the different local classifiers applied in our tree structure. For all except the Nar- row Down classifier, we used the Convolutional Neu- ral Networks Architecture (Lecun, 1989) depicted in Figure 2. The Narrow Down classifier is a sim- ple, hand-crafted, rule-based classifier described in Section 4.2.6. 4.2.1 Neural Network Architecture We use the same neural network architecture with slightly different configurations for the different local classifiers. The architecture is depicted in Figure 2 and is described in the following sections. The neural network architecture is based on the design proposed by Zeng et al. (2014), which can achieve state-of-the-art performance on relation clas- sification tasks (Zeng et al., 2014; dos Santos et al., 2015). The neural network applies a convolution over the word representations and position embeddings of the input text followed by a max-over-time pooling layer. We call the output of this layer Input Text Fea- tures. Those Input Text Features are merged with the word embedding for the event and time expression token. The merged input is fed into a hidden layer using either the hyperbolic tangent tanh(·) or a rec- tified linear unit (ReLU) as activation function. The choice of the activation function is a hyperparameter and was optimized on a development set. The final layer is either a single sigmoid neuron, in the case of binary classification, or a softmax layer. To avoid overfitting, we used two dropout layers (Srivastava et al., 2014), the first before the dense hidden layer and the second after the dense hidden layer. The percent- ages of the dropouts were set as hyperparameters. Word Embeddings. We used the pre-trained word embeddings presented by Levy and Goldberg (2014). The embedding layer of our neural networks maps each token from the input text to their respec- tive word embedding. Out-of-vocabulary tokens are 80 Figure 1: Tree structure used to extract the temporal information for an event. Rectangles are local classifiers based on deep convolutional neural networks except for the Narrow Down rectangles, which are simple rule based classifiers. Figure 2: The neural network architecture used for the different local classifiers. replaced with a special UNKNOWN token, for which the word embedding was randomly initialized. Position Embeddings. Collobert et al. (2011) pro- poses the use of position embeddings to keep track how close words in the input text are to certain tar- get words. For each input text, we specify certain words as targets. For example, we specify the event and the temporal expression as target words and train the network to learn the temporal relation between those. Each word in the input text is then augmented with the relative distances. Let pos1, pos2, ... be the positions of the target words in the input text. Then, a word at position j is augmented with the features j −pos1, j −pos2, · · · . These augmented position features are then mapped in the embedding layer to a randomly initialized vector. The dimension of this vector is a hyperparameter of the network. The word embeddings and the position embed- dings are concatenated to form the input for the con- volutional layer. In the case of two target words, the input for the convolutional layer would be: emboutput = {[wew1,pe1−pos1,pe1−pos2], [wew2, pe2−pos1,p2−pos2], ..., [wewn,pen−pos1,pen−pos2]} with wewj the embedding of the j-th word in the 81 input text, pej−posk the embedding for the distance between the j-th word and the target word k. Convolutional & Max-Over-Time Layer. A challenge for the classifier is the variable length of the input text and that important information can be anywhere in the input text. To tackle this issue, we use a convolutional layer to compute a distributed vector representation of the input text. Let us define a vector xk as the concatenation of the word and po- sition embeddings for the position k as well as for m positions to the left and to the right: xk = ([we wk−m,pek−m−pos1,pek−m−pos2]||...|| [wewk,pek−pos1,pek−pos2]||...|| [wewk+m,pek+m−pos1,pek+m−pos2]) The convolutional layer multiplies all xk by a weight matrix W1 and applies the activation func- tion component-wise. After that, a max-over-time is applied, i.e., the max-function is applied component- wise. The j-th entry of the convolutional and max- over-time layer output is defined as: [convoutput]j = max 1≤k≤n [tanh(W1xk)]j Lexical Features. Previous approaches heavily rely on lexical features. For example, the CAEVO system (Chambers et al., 2014) uses, for the classifi- cation of event-time edges, the token, the lemma, the POS tag, the tense5, the grammatical aspect6 and the class of event7 as well as the parse tree between event and time expression. In our evaluation, we did not observe that these features have a significant impact on the performance. Hence, we decided to use the event and time tokens as the only features besides the dense vector representation of the input text. For multi-token expressions, we only use the first token. Our architecture focuses on extracting the event time when event annotations and temporal expressions are provided. In order to evaluate the accuracy of this isolated step, we decided to use the provided annotations in the corpus. The baselines we 5Defined tenses: simple, perfect, and progressive 6Defined aspects in TimeBank: past, present, future 7Defined classes in TimeBank: occurrence, perception, re- porting, aspectual, state, i state, i action compared against use these gold annotations as well. Output. The distributed vector representation of the input text and the embeddings of event/time token are concatenated and passed through a dense layer. As the activation function, we allowed either the hy- perbolic tangent or the rectified linear unit (ReLU). The choice is a parameter of the network. The final layer is either a single sigmoid neuron, in the case of binary classification, or a softmax layer to compute the probabilities of the different tags. 4.2.2 Single vs. Multi-Day Event Classification The first local classifier, that decides whether an event is a Single Day Event or a Multi-Day Event, uses the event word as the target word. 4.2.3 DCT Classification A Single Day Event can happen either before the document was created (Before-class), on the same day (Simultaneous-class), or it will hap- pen at least one day after the document was created (After-class). The configuration of this local clas- sifier is as in the previous section. Note, to classify the relation to the DCT, in most cases, it was not important to know the concrete Document Creation Time. Therefore, we did not pass the DCT as a value to the network. For Multi-Day Events, we decided to group the events into three categories: first, events that be- gan and ended before the Document Creation Time (Before-class); second, events that began before DCT and ended after DCT (Includes-class); and third, events that will begin and end after DCT (After-class). 4.2.4 Detecting Relevant Time Expressions In the case the event did not happen at the DCT, it is important to take the surrounding text and po- tentially the whole document into account to figure out at which date the event happened. For our classi- fier, we assume that temporal expressions are already detected in the document. To detect temporal ex- pressions, tools like HeidelTime8 can be used that achieve an F1-score of 0.919 on extracting temporal expressions in the TimeBank Corpus (Strötgen and Gertz, 2015). 8https://github.com/HeidelTime 82 As an intermediate step to detect when an event happened, we first decide whether the temporal ex- pression is relevant for the event or not. We define a temporal expression to be relevant, if the (normal- ized) value of the temporal expression is part of the event time annotation. The value of the temporal ex- pression can either be the event time, or it can appear in the before or after notation. The classifier is executed for all event and temporal expression pairs. The input text for the distributed text representation is the text between the event and the temporal expression. 4.2.5 Temporal Relation Classification Given the relevant temporal expression for an event from the previous step, the next local classi- fier establishes the temporal relation between the event and the temporal expression. For a given, relevant event-temporal expression pair, it outputs BEFORE - when the event happened before the tem- poral expression, AFTER - when it happened after, or SIMULTANEOUS - when it happened on the men- tioned date. This local classifier has the same configu- ration as the network used to detect relevant temporal expression. 4.2.6 Narrow Down Classifier The goal of the Narrow Down Classifier, that is used in step 2.4, 3.1.3 and 3.3.3 in Figure 1, is to derive the final label given the information on the rel- evant temporal expressions, their relation to the event, and the relation to the document creation time. For most events in the corpus, this information was un- ambiguous, e.g., only one temporal expression was classified as relevant for the event. The proposed approach returns multiple relevant temporal expres- sions only for a small fraction of events. However, this number was too small to train and to validate a learning algorithm for this stage. Hence, we decided to implement a straightforward, rule-based classifier. This classifier is depicted in Algorithm 1. It takes all relations to relevant temporal expres- sions as well as the relation to the Document Cre- ation Time to derive the final output. In the case a SIMULTANEOUS relation exists, the classifier stops and the appropriate temporal expression is used as event time. If no such relation exists, a frequency distribution of the linked dates and relations is cre- ated for BEFORE as well as for AFTER relations. For example, when the system extracts three relevant BEFORE relations of different mentions of date1 throughout the text and two relevant BEFORE rela- tions of different mentions of date2, then the sys- tem would choose date1 as a slot-filler for the be- fore property. If there are as many relevant BEFORE relations for date1 as for date2, the system will choose the lowest date for the before property (line 13-18). For AFTER relations, we use the same logic, except that we choose the largest date (line 23). Algorithm 1 Narrow Down Classifier 1: function NARROWDOWN(times) 2: fd before, fd after = FreqDistribution() 3: for [relation, time] in times do 4: if relation is SIMULTANEOUS then 5: return time 6: else if relation is BEFORE then 7: fd before.new sample(time) 8: else if relation is AFTER then 9: fd after.new sample(time) 10: end if 11: end for 12: //fd before elements have the fields .num=#samples and .time=time value 13: if fd before.size > 0 then 14: // find the largest number of samples of a time 15: max samples = fd before.max( .num) 16: //take minimum over all times having max samples 17: before time = fd before.filter( .num == max samples).min( .time) 18: end if 19: if fd after.size > 0 then 20: // find the largest number of samples of a time 21: max samples = fd after.max( .num) 22: //take maximum over all times having max samples 23: after time = fd after.filter( .num == max samples).max( .time) 24: end if 25: return after + after time + before + before time 26: end function 4.3 Baseline We use two baselines to compare our system. As the first baseline, we use the system presented in Reimers et al. (2016). The baseline is based on the multi-pass architecture CAEVO introduced by Cham- bers et al. (2014) and extracts all TLINKs between event mentions and temporal expressions. The sys- tem by Chambers et al. applies multiple rules and trained classifiers to extract those TLINKs. The dif- 83 ferent stages are ranked by precision and are executed consecutively. A shortcoming of the system is that it does not produce temporal information if an event lasted for more than a day. Hence, the system cannot be used to distinguish between Single Day and Multi- Day Events, nor can it extract the begin/end points for Multi-Day Events. Our previously presented baseline uses the ex- tracted relations for Single Day Events and gener- ates a set of tuples in which the event is involved. We use the narrow down classifier from section 4.2.6 to extract the final label. When all extracted relations are of type VAGUE, the baseline returns that it cannot infer the time for the event. The second baseline is a reduced version of the hierarchical tree. For this baseline, we first apply the classifier to decide whether it is a Single Day or Multi-Day Event. When it is a Single Day Event, we classify the relation to the document creation time (DCT) (classifier 2.1). When the event did not happen at DCT, we link it to the closest temporal expression in the document. For Multi-Day Events, we only run the classifier 3.2 to extract the relation to DCT. When the event happened before DCT, we set the begin and end point to BEFORE DCT; when it happened after DCT, we set both to AFTER DCT; and, when the relation was Includes, we set the begin point to BEFORE DCT and the end point to AFTER DCT. 5 Experimental Setup We conduct our experiments on the TimeBank- EventTime Corpus (Reimers et al., 2016). The corpus is comprised of 36 documents and 1498 annotated events. We use the same split into training, develop- ment, and test set as Chambers et al. (2014) resulting in 22 documents for training, 5 documents for hyper- parameter optimization, and 9 documents for the final evaluation. Using this split allows a fair comparison to the CAEVO system. Hyperparameters for the individual local classi- fiers were chosen using random search (Bergstra and Bengio, 2012) with at least 1000 iterations per local classifier. 6 Experimental Results We evaluate our system using two different metrics. The strict metric requires an exact match between the predicted label and the gold label. A disadvantage of this metric is that it does not allow partial agree- ment. The strict agreement between two annotators is fairly low for events where the exact date of the event was not mentioned. In order to allow partial matches, we will also use a relaxed metric, which will judge two different la- bels only as an error, if those are mutually exclusive. Two labels are mutually exclusive, if there is no event date which could satisfy both labels at the same time. If the event happened on August 5th, 1998, the two annotations before 1998-08-31 and after 1998-08-01 before 1998-08-31 would both be satisfied. There- fore, these two different labels would be considered as correct. In contrast, the two annotations after 1998- 02-01 and before 1997-12-31 can never be satisfied at the same time and are therefore mutually exclusive. The score of the relaxed metric must be seen in combination with the strict metric. A system could trick the relaxed metric by returning a before date that is far in the future which results in a high relaxed score but a negligible strict score. Future research is necessary to judge the quality of different kind of partial matches and to design an appropriate metric. 6.1 System Performance Following the recommendations in (Reimers and Gurevych, 2017), we train the system with 25 dif- ferent random seed values, and compute the mean performance score and the standard deviation. Table 1 shows the results in comparison to the observed inter-annotator agreement (IAA). The inter-annotator agreement is based on two full annotations of the cor- pus. The chance-corrected agreement is α = 0.617 using Krippendorff’s α (Krippendorff, 2004). The two annotations were merged into a final gold label annotation of the corpus, which we used for training and evaluation. The accuracy to distinguish between Single Day and Multi-Day Events is 78.2% on the test set, in comparison to an inter-annotator agreement of 81.8%. The overall performance is 42.0%, compared to an IAA of 56.7% using the strict metric. For Multi-Day Events, we observe an accuracy with the strict metric of 24.5%, compared to an IAA of 52.0%. Breaking it down to the begin- and end- point extraction, we observe a much lower accuracy for the begin point extraction of just 28.5%, com- 84 System IAA Single vs. Multi-Day 78.2% ± 1.33 81.8% Single Day (Strict) 74.6% ± 1.04 80.5% Single Day (Relaxed) 92.5% ± 0.60 98.0% Multi-Day (Strict) 24.5% ± 1.61 52.0% Begin (Strict) 28.5% ± 0.73 63.8% End (Strict) 66.5% ± 1.02 74.9% Multi-Day (Relaxed) 74.6% ± 0.55 94.6% Begin (Relaxed) 94.9% ± 0.38 98.6% End (Relaxed) 80.2% ± 0.73 96.1% Overall Acc. (Strict) 42.0% ± 1.21 56.7% Overall Acc. (Relaxed) 84.6% ± 0.71 95.3% Table 1: Accuracy for the different stages of our system in comparison to the observed inter-annotator agreement (IAA). The strict metric requires an exact match between the labels. The relaxed metric requires that the two anno- tations are not mutually exclusive. pared to 66.7% accuracy for the end point extraction. However, using the relaxed metric, we see an accu- racy of 94.9% for the begin point and 80.2% for the end point. We can conclude that the extraction of the begin point works well, however, in a large set of cases (66.7%) the extracted begin point is less precise than the gold annotation. The baseline based on the CAEVO system from Chambers et al. (2014) can only be applied to Single Day Events, as TLINK types that define the start or the end of an event do not exist. We ran this baseline on all events that were correctly identified as Single Day Events. The performance of this baseline is depicted in Table 2. For the proposed approach we observe a performance increase from 41.2% to 74.6%. For 18.3% of the events, the retrieved label of the proposed approach was less precise than the gold label. An example of a less precise label would be before 1998-12-31 while the gold label was before 1998-08-15. A clear wrong label was observed for 7.1% of the generated labels. A big disadvantage of a dense TLINK annotation is the restriction of TLINKs for events and temporal expression that are in the same, or in adjacent, sen- tences. For 32.0% of the events, the baseline was not able to infer any event time information. As our sys- tem outputs a label for every event, we see a slightly increased number of wrong labels in comparison to the baseline. Single Day Events Ours CAEVO Exact match 74.6% 41.2% Less precise 18.3% 21.5% Wrong label 7.1% 5.4% Cannot infer time - 32.0% Table 2: Distribution of the retrieved labels for the pro- posed system and for the baseline. Less precise are labels where the time frame when the event has happened is larger than for the gold label. Wrong label are labels which are in clear contradiction to the gold standard. Table 3 compares the proposed system against the reduced tree that only classifies the type of the event (Single Day or Multi-Day) and the relation to the doc- ument creation time. We observe a significant drop in accuracy for Single Day Events, indicating that just classifying the relation to the document creation time is insufficient for this task. System SD MD Overall Full system 74.6% 24.3% 42.0% Reduced tree 40.4% 19.6% 24.2% CAEVO 41.2% - 18.1% Table 3: Comparison of the accuracy (strict metric) for Single Day Events (SD), Multi-Day Events (MD) and overall. Reduced tree uses only the local classifiers 1, 2.1 and 3.2. 6.2 Error Analysis Error propagation is an important factor in a decision tree. Table 4 depicts the accuracy of the different local classifiers. We compare those to a Majority Vote baseline. For all local classifiers we can see a large performance increase over the baseline. We observe the lowest accuracy for the classifiers of the begin point (3.1.1. and 3.1.2.). This is in line with the previous observation of the low accuracy for begin point labels as well as with the low IAA for begin point annotations. The root classifier, which decides whether the event is a Single Day or a Multi-Day Event, is the most critical classifier. This classifier is responsible for 21.7% of the erroneously labeled events. How- ever, with an accuracy of 78.3% it is already fairly close to the IAA of 81.6% and it is unclear if this classifier could substantially be improved. 85 System Majority Vote 1. Event Type 78.3% 54.5% Single Day Event 2.1. DCT Rel. 84.2% 55.6% 2.2. Relevant 79.1% 66.0% 2.3. Relation 81.0% 72.7% Multi-Day Event 3.1. Begin Point 3.1.1. Relevant 79.0% 68.9% 3.1.2. Relation 63.1% 42.9% 3.2. DCT Rel. 65.2% 46.8% 3.3. End Point 3.3.1. Relevant 83.8% 65.1% 3.3.2. Relation 85.1% 79.0% Table 4: Accuracy for the different local classifiers vs. a Majority Vote baseline. Local classifiers are numbered as depicted in Figure 1. As mentioned in the introduction, the annotators were not restricted to the dates that are explicitly mentioned in the document but could also create new dates. For example, in the sentence It’s the [second day]date:1998-03-06 of an [offensive]beginPoint=1998-03-05... it is clear for the annotator that the offensive started on 1998-03-05. However, this date is not explicitly mentioned in the text, only the date 1998-03-06 is mentioned. We call such dates out-of-document dates. Handling those cases is extremely difficult and our system is currently not capable of creating such out- of-document dates. Table 5 depicts the error rate introduced by those dates. As the table depicts, 12.6% of the event time labels are affected by out-of-document dates. An especially high percentage of such dates is observed for the be- gin point of Multi-Day Events. In a lot of these cases the document states either an explicit or a rough esti- mation on the duration of the event. In the previous example, the text stated that the offensive already lasted for two days. In another example, the docu- ment gives the information that the event started in recent years or that it lasted for roughly 2 1/2 years. 6.3 Ablation Test Table 6 presents the changes in accuracy in per- centage points when individual components of the proposed system are changed. We observe a slight Out-of-document dates Single Day Events 3.0% Multi-Day Events 24.1% Begin Point 17.0% End Point 9.9% Overall 12.6% Table 5: Percentage of labels in the test set affected by out-of-document dates. drop of -2.3 percentage points if bidirectional LSTM- networks with 100 recurrent units are used instead of Convolutional Neural Networks. LSTM-networks showed for other NLP tasks state-of-the-art perfor- mance, however, for this task they were not able to improve the performance. One reason could be the comparably small training set of 22 documents. A further disadvantage of the BiLSTM-networks was the significantly longer training time, prohibiting run- ning an extensive hyperparameter tuning. Configuration Accuracy Full system 42.0% BiLSTM instead of CNN -2.3 Rnd. word embeddings -7.7 No input text feature -9.7 No position feature -3.9 No narrow down -1.3 Table 6: Change in accuracy (strict metric) in percent- age points when replacing individual components of the architecture. An important factor for the performance was the pre-trained word embeddings. Replacing those with randomly initialized embeddings decreased the per- formance by -7.7 percentage points. As before, we think this is due to the small training size. A large number of test tokens do not appear in the training set and several tokens only appear infrequently in the training set. Hence, the network is not able to learn meaningful representations for those words. Our system successfully uses the text between the event and the temporal expression (Input Text Fea- tures) for classifying the relation between those. Re- moving this part of the architecture decreases the ac- curacy by -9.7 percentage points. Further, it appears that not only the token itself, but also the position of the token relative to the event / time token is taken 86 into account. Removing this position information from the input text feature reduces the performance by -3.9 percentage points. Replacing the narrow down classifier with a classi- fier that randomly selects one of the relevant temporal expressions reduces the performance by only -1.3 per- centage points. For most events, there was only one relevant temporal expression extracted. We analyzed the parameter settings for the top five performing lo- cal classifiers for each stage. The activation function (tanh and ReLU) appears to have a negligible impact on the performance. 6.4 Event Timeline Construction We evaluated our system on the shared task SemEval- 2015 Task 4: TimeLine: Cross-Document Event Or- dering (Minard et al., 2015). The goal is to construct an event timeline for a target entity given a set of 30 documents from Wikinews on certain topics. We use the setting of Track B, where the events are provided. We used HeidelTime to detect and normalize time expressions. We then ran our system out of the box, i.e., without retraining for the new dataset. For the shared task, an event can occur either at a specific day, in a specific month, or in a specific year. Events that can not be anchored in time are removed from the evaluation. We implemented simple rules that transform our system output to the format of the shared task: if an event is simultaneous with a specific time expression, we will output this date. If our system returns that it happened before and after a certain date, it will output the year and month if both dates are in the same month. If both dates are in the same year but in different months, it will output the year. Events with predicted timespans of over more than one year are rejected. For Multi-Day Events, we only use the begin point as only this information was annotated for this shared task. Two teams participated in the shared task (GPL- SIUA and HeidelToul). Currently, the best published performance was achieved by Cornegruta and Vla- chos (2016) with an F1-score of 28.58. Our system was able to improve the total F1-score by 4.01 points as depicted in Table 7. A challenge for our system is the different anchor- ing of events in time: while our system can anchor events at two arbitrary dates, the SemEval-2015 Task 4 only anchors events either at a specific day, month System Airbus GM Stock Total Our approach 30.37 28.83 38.01 32.59 Cornegruta 25.65 26.64 32.35 28.58 GPLSIUA 1 22.35 19.28 33.59 25.36 HeidelToul 2 16.50 10.94 25.89 18.34 Table 7: Performance of our system on the SemEval-2015 Task 4 Track B for the topics Airbus, General Motors, and stock market. or year. When our system returns the event time value after 2010-10-01 and before 2010-11-30, we had to decide how to anchor this event for the gen- erated timeline. For such an event, three final labels would be plausible: 2010-10-xx, 2010-11-xx, and 2010-xx-xx. A similar challenge occurs for events that received a label like before 2010-11-30. If we anchor it in 2010-11-xx, we must be certain that the event happened in November. Similarly, if we an- chor it in 2010-xx-xx, we must be certain that the event happened in 2010. Such information cannot be inferred directly from the returned label of our system. As only 30 documents on a single topic were provided for training, we could not tune the transfor- mation accordingly. A manual analysis revealed that this transformation caused around 15% of the errors. 7 Conclusion Event Time Extraction is a challenging classifica- tion task as the set of labels is infinite and the label depends on the information that is scattered across the document. The presented classifier is able to take the whole document into account and to infer the date when an event has happened. We applied the system to the TimeBank-EventTime Corpus and achieved an accuracy of 42.0% in comparison to an inter-annotator agreement of 56.7% using a strict metric. For 74.6% of the Single Day events, the exact event time could be extracted. This is a 33.1 percentage points improvement in comparison to the state-of-the-art approach by Chambers et al. (2014). We demonstrated the generalizability by applying it to the SemEval-2015 Task 4 on timeline generation, where it improved the F1-score by 4.01 percentage points compared to the state-of-the-art. 87 Acknowledgements This work has been supported by the German Re- search Foundation as part of the Research Training Group Adaptive Preparation of Information from Het- erogeneous Sources (AIPHES) under grant No. GRK 1994/1. We would like to thank the TACL editors and reviewers for their effort and the valuable feedback we received from them. References James Allan. 2002. Topic Detection and Tracking: Event- based Information Organization. pages 1–16. Kluwer Academic Publishers, Norwell, MA, USA. James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res., 13:281–305, February. Philip Bramsen, Pawan Deshpande, Yoong Keok Lee, and Regina Barzilay. 2006. Inducing Temporal Graphs. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages 189–198, Stroudsburg, PA, USA. Association for Computational Linguistics. Taylor Cassidy, Bill McDowell, Nathanael Chambers, and Steven Bethard. 2014. An Annotation Framework for Dense Event Ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 501–506, Baltimore, Maryland, USA. Association for Computa- tional Linguistics. Nathanael Chambers and Dan Jurafsky. 2008. Jointly combining implicit constraints improves temporal or- dering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 698–706, Stroudsburg, PA, USA. Association for Computational Linguistics. Nathanael Chambers, Taylor Cassidy, Bill McDowell, and Steven Bethard. 2014. Dense Event Ordering with a Multi-Pass Architecture. Transactions of the Associa- tion for Computational Linguistics, 2:273–284. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, November. Savelie Cornegruta and Andreas Vlachos. 2016. Time- line extraction using distant supervision and joint in- ference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1936–1942. Quang Xuan Do, Wei Lu, and Dan Roth. 2012. Joint Inference for Event Timeline Construction. In Pro- ceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Compu- tational Natural Language Learning, EMNLP-CoNLL ’12, pages 677–687, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. Cı́cero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying Relations by Ranking with Convolutional Neural Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computa- tional Linguistics and the 7th International Joint Con- ference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Pa- pers, pages 626–634. Lisa Ferro. 2002. TIDES. Instruction Manual for the Annotation of Temporal Expressions. Technical report, MITRE TECHNICAL REPORT. Oleksandr Kolomiyets, Steven Bethard, and Marie- Francine Moens. 2012. Extracting Narrative Timelines As Temporal Dependency Structures. In Proceedings of the 50th Annual Meeting of the Association for Com- putational Linguistics: Long Papers - Volume 1, ACL ’12, pages 88–97, Stroudsburg, PA, USA. Association for Computational Linguistics. Klaus Krippendorff. 2004. Content Analysis: An In- troduction to Its Methodology (second edition). Sage Publications. Yann Lecun, 1989. Generalization and network design strategies. Elsevier. Omer Levy and Yoav Goldberg. 2014. Dependency- Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pages 302–308. Anne-Lyse Minard, Manuela Speranza, Eneko Agirre, Itziar Aldabe, Marieke van Erp, Bernardo Magnini, German Rigau, and Ruben Urizar. 2015. SemEval- 2015 Task 4: TimeLine: Cross-Document Event Order- ing. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015, pages 778– 786. James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003. The TIMEBANK Corpus. In Pro- ceedings of Corpus Linguistics 2003, pages 647–656, Lancaster, UK. Nils Reimers and Iryna Gurevych. 2017. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In Proceed- ings of the 2017 Conference on Empirical Methods in 88 Natural Language Processing (EMNLP), pages 338– 348, Copenhagen, Denmark, September. Nils Reimers, Nazanin Dehghani, and Iryna Gurevych. 2016. Temporal Anchoring of Events for the Time- Bank Corpus. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (ACL 2016), volume 1: Long Papers, pages 2195–2204. Association for Computational Linguistics, August. Roser Saurı́, Jessica Littman, Robert Gaizauskas, Andrea Setzer, and James Pustejovsky. 2004. TimeML Anno- tation Guidelines, Version 1.2.1. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Over- fitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan- uary. Jannik Strötgen and Michael Gertz. 2015. A Baseline Temporal Tagger for all Languages. In Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 541–547, Lisbon, Portugal, September. Association for Computational Linguistics. Mihai Surdeanu. 2013. Overview of the TAC 2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling. In Proceedings of the TAC-KBP 2013 Workshop, Gaithersburg, Maryland, USA. Naushad UzZaman and James F. Allen. 2010. TRIPS and TRIOS System for TempEval-2: Extracting Tem- poral Information from Text. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 276–283, Stroudsburg, PA, USA. Association for Computational Linguistics. Naushad UzZaman, Hector Llorens, Leon Derczynski, Marc Verhagen, James F. Allen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Proceedings of the 7th International Workshop on Se- mantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Gerogia, USA. Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. SemEval-2007 Task 15: TempEval Temporal Rela- tion Identification. In Proceedings of the 4th Inter- national Workshop on Semantic Evaluations, SemEval ’07, pages 75–80, Stroudsburg, PA, USA. Association for Computational Linguistics. Marc Verhagen, Roser Saurı́, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 Task 13: TempEval- 2. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 57–62, Stroudsburg, PA, USA. Association for Computational Linguistics. Katsumasa Yoshikawa, Sebastian Riedel, Masayuki Asa- hara, and Yuji Matsumoto. 2009. Jointly Identifying Temporal Relations with Markov Logic. In Proceed- ings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Vol- ume 1, ACL ’09, pages 405–413, Stroudsburg, PA, USA. Association for Computational Linguistics. Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation Classification via Convolu- tional Deep Neural Network. In COLING 2014, 25th International Conference on Computational Linguis- tics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, pages 2335–2344, Dublin, Ireland. 89 90