From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations Egoitz Laparra∗ Dongfang Xu∗ Steven Bethard School of Information University of Arizona Tucson, AZ {laparra,dongfangxu9,bethard}@email.arizona.edu Abstract This paper presents the first model for time normalization trained on the SCATE corpus. In the SCATE schema, time expressions are annotated as a semantic composition of time entities. This novel schema favors machine learning approaches, as it can be viewed as a semantic parsing task. In this work, we propose a character level multi-output neural network that outperforms previous state-of-the-art built on the TimeML schema. To compare predic- tions of systems that follow both SCATE and TimeML, we present a new scoring metric for time intervals. We also apply this new metric to carry out a comparative analysis of the anno- tations of both schemes in the same corpus. 1 Introduction Time normalization is the task of translating natural language expressions of time to computer-readable forms. For example, the expression three days ago could be normalized to the formal representation 2017-08-28 in the ISO-8601 standard. As time nor- malization allows entities and events to be placed along a timeline, it is a crucial step for many informa- tion extraction tasks. Since the first shared tasks on time normalization (Verhagen et al., 2007), interest in the problem and the variety of applications have been growing. For example, Lin et al. (2015) use normal- ized timestamps from electronic medical records to contribute to patient monitoring and detect potential causes of disease. Vossen et al. (2016) identify multi- lingual occurrences of the same events in the news ∗These two authors contributed equally. by, among other steps, normalizing time-expressions in different languages with the same ISO standard. Fischer and Strötgen (2015) extract and normalize time-expressions from a large corpus of German fic- tion as the starting point of a deep study on trends and patterns of the use of dates in literature. A key consideration for time normalization sys- tems is what formal representation the time expres- sions should be normalized to. The most popular scheme for annotating normalized time expressions is ISO-TimeML (Pustejovsky et al., 2003a; Puste- jovsky et al., 2010), but it is unable to represent several important types of time expressions (e.g., a bounded set of intervals, like Saturdays since March 6) and it is not easily amenable to machine learning (the rule-based HeidelTime (Strötgen et al., 2013) still yields state-of-the-art performance). Bethard and Parker (2016) proposed an alternate scheme, Se- mantically Compositional Annotation of Time Ex- pressions (SCATE), in which times are annotated as compositional time entities (Figure 1), and suggested that this should be more amenable to machine learn- ing. However, while they constructed an annotated corpus, they did not train any automatic models on it. We present the first machine-learning models trained on the SCATE corpus of time normalizations. We make several contributions in the process: • We introduce a new evaluation metric for time normalization that can compare normalized times from different annotation schemes by mea- suring overlap of intervals on the timeline. • We use the new metric to compare SCATE and TimeML annotations on the same corpus, and confirm that SCATE covers a wider variety of 343 Transactions of the Association for Computational Linguistics, vol. 6, pp. 343–356, 2018. Action Editor: Mona Diab. Submission batch: 10/2017; Revision batch: 1/2018; Published 5/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. THIS INTERVAL REPEATING-INTERVALS DAY-OF-WEEK TYPE=SATURDAY Saturdays BETWEEN START-INTERVAL END-INTERVAL=DOC-TIME since LAST INTERVAL=DOC-TIME REPEATING-INTERVAL MONTH-OF-YEAR TYPE=MARCH SUB-INTERVAL March DAY-OF-MONTH VALUE=6 6 Figure 1: Annotation of the expression Saturdays since March 6 following the SCATE schema. time expressions. • We develop a recurrent neural network for learn- ing SCATE-style time normalization, and show that our model outperforms the state-of-the-art HeidelTime (Strötgen et al., 2013). • We show that our character-based multi-output neural network architecture outperforms both word-based and single-output models. 2 Background ISO-TimeML (Pustejovsky et al., 2003a; Pustejovsky et al., 2010) is the most popular scheme for annotat- ing time expressions. It annotates time expressions as phrases, and assigns an ISO 8601 normalization (e.g., 1990-08-15T13:37 or PT24H) as the VALUE at- tribute of the normalized form. ISO-TimeML is used in several corpora, including the TimeBank (Puste- jovsky et al., 2003b), WikiWars (Mazur and Dale, 2010), TimeN (Llorens et al., 2012), and the Temp- Eval shared tasks (Verhagen et al., 2007; Verhagen et al., 2010; UzZaman et al., 2013). However, the ISO-TimeML schema has a few drawbacks. First, times that align to more than a single calendar unit (day, week, month, etc.), such as Saturdays since March 6 (where multiple Satur- days are involved), cannot be described in the ISO 8601 format since they do not correspond to any pre- fix of YYYY-MM-DDTHH:MM:SS. Second, each time expression receives a single VALUE, regardless of the word span, the compositional semantics of the expression are not represented. For example, in the expressions since last week and since March 6, the semantics of since are identical – find the inter- val between the anchor time (last week or March 6) and now. But ISO-TimeML would have to annotate these two phrases independently, with no way to in- dicate the shared portion of their semantics. These drawbacks of ISO-TimeML, especially the lack of compositionality, make the development of machine learning models difficult. Thus, most prior work has taken a rule-based approach, looking up each token of a time expression in a normalization lexicon and then mapping this sequence of lexical entries to the normalized form (Strötgen and Gertz, 2013; Bethard, 2013; Lee et al., 2014; Strötgen and Gertz, 2015). As an alternative to TimeML, and inspired by pre- vious works, Schilder (2004) and Han and Lavie (2004), Bethard and Parker (2016) proposed Seman- tically Compositional Annotation of Time Expres- sions (SCATE). In the SCATE schema, each time expression is annotated in terms of compositional time entity over intervals on the timeline. An ex- ample is shown in Figure 1, with every annotation corresponding to a formally defined time entity. For instance, the annotation on top of since corresponds to a BETWEEN operator that identifies an interval starting at the most recent March 6 and ending at the document creation time (DCT). The BETWEEN operator is formally defined as: BETWEEN([t1, t2): INTERVAL, [t3, t4): INTERVAL): INTERVAL = [t2, t3). The SCATE schema can represent a wide variety of time expressions, and provides a formal definition of the semantics of each annotation. Unlike TimeML, SCATE uses a graph structure to capture composi- tional semantics and can represent time expressions that are not expressed with contiguous phrases. The schema also has the advantage that it can be viewed as a semantic parsing task and, consequently, is more 344 suitable for machine-learning approaches. However, Bethard and Parker (2016) present only a corpus; they do not present any models for semantic parsing. 3 An interval-based evaluation metric for normalized times Before attempting to construct machine-learned mod- els from the SCATE corpus, we were interested in evaluating Bethard and Parker (2016)’s claim that the SCATE schema is able to represent a wider vari- ety of time expressions than TimeML. To do so, we propose a new evaluation metric to compare time nor- malizations annotated in both the ISO 8601 format of TimeML and the time entity format of SCATE. This new evaluation interprets normalized annotations as intervals along the timeline and measures the overlap of the intervals. TimeML TIMEX3 (time expression) annotations are converted to intervals following ISO 8601 se- mantics of their VALUE attribute. So, for example, 1989-03-05 is converted to the interval [1989-03- 05T00:00:00, 1989-03-06T00:00:00), that is, the 24- hour period starting at the first second of the day on 1989-03-05 and ending just before the first second of the day on 1989-03-06. SCATE annotations are con- verted to intervals following the formal semantics of each entity, using the library provided by Bethard and Parker (2016). So, for example, Next(Year(1985), SimplePeriod(YEARS, 3)), the 3 years following 1985, is converted to [1986-01-01T00:00, 1989-01- 01T00:00). Note that there may be more than one interval associated with a single annotation, as in the Saturdays since March 6 example. Once all anno- tations have been converted into intervals along the timeline, we can measure how much the intervals of different annotations overlap. Given two sets of intervals, we define the interval precision, Pint, as the total length of the intervals in common between the two sets, divided by the total length of the intervals in the first set. Interval recall, Rint is defined as the total length of the intervals in common between the two sets, divided by the total length of the intervals in the second set. Formally: IS ⋂ IH = {i∩ j : i ∈ IS ∧ j ∈ IH} Pint(IS, IH) = ∑ i∈COMPACT(IS ⋂ IH) |i| ∑ i∈IS |i| Rint(IS, IH) = ∑ i∈COMPACT(IS ⋂ IH) |i| ∑ i∈∪IH |i| where IS and IH are sets of intervals, i∩j is possibly the empty interval in common between the intervals i and j, |i| is the length of the interval i, and COMPACT takes a set of intervals and merges any overlapping intervals. Given two sets of annotations (e.g., one each from two time normalization systems), we define the over- all precision, P , as the average of interval precisions where each annotation from the first set is paired with all annotations that textually overlap it in the second set. Overall recall is defined as the average of interval recalls where each annotation from the second set is paired with all annotations that textually overlap it in the first set. Formally: OIa(B) = ⋃ b∈B:OVERLAPS(a,b) INTERVALS(b) P(S, H) = 1 |S| ∑ s∈S Pint(INTERVALS(s), OIs(H)) R(S, H) = 1 |H| ∑ h∈H Rint(INTERVALS(h), OIh(S)) where S and H are sets of annotations, INTERVALS(x) gives the time intervals associ- ated with the annotation x, and OVERLAPS(a, b) decide whether the annotations a and b share at least one character of text in common. It is important to note that these metrics can be applied only to time expressions that yield bounded intervals. Time expressions that refer to intervals with undefined boundaries are out of the scope, like in “it takes just a minute” or “I work every Saturday”. 4 Data analysis 4.1 TimeML vs. SCATE Both TimeML and SCATE annotations are available on a subset of the TempEval 2013 corpus (UzZaman et al., 2013) that contains a collection of news articles from different sources, such as Wall Street Journal, 345 AQUAINT TimeBank Test Documents 10 68 20 Sentences 251 1429 339 TimeML timex3 61 499 158 SCATE entities 333 1810 461 SCATE time exp. 114 715 209 SCATE bounded 67 403 93 Table 1: Number of documents, TimeML TIMEX3 an- notations and SCATE annotations for the subset of the TempEval 2013 corpus annotated with both schemas. AQUAINT TimeBank P R F1 P R F1 Body text 92.2 92.2 92.2 82.4 83.0 82.7 All text 92.2 67.1 77.7 82.4 71.2 76.4 Table 2: Comparison of TimeML and SCATE annotations. New York Times, Cable News Network, and Voices of America. Table 1 shows the statistics of the data. Documents from the AQUAINT and TimeBank form the training and development dataset. The SCATE corpus contains 2604 time entities (individual com- ponents of a time expression, such as every, month, last, Monday, etc.) annotated in the train+dev set (i.e. AQUAINT+TimeBank). These entities compose a total of 1038 time expressions (every month, last Monday, etc.) of which 580 yield bounded intervals, i.e. intervals with a specified start and ending (last Monday is bounded, while every month is not). We apply the interval-based evaluation metric in- troduced in Section 3 to the AQUAINT and Time- Bank datasets, treating the TimeML annotations as the system (S) annotator and the SCATE annotations as the human (H) annotator. Table 2 shows that the SCATE annotations cover different time intervals than the TimeML annotations. In the first row, we see that TimeML has a recall of only 92% of the time in- tervals identified by SCATE in the AQUAINT corpus and of only 83% in the TimeBank corpus. We manu- ally analyzed all places where TimeML and SCATE annotations differed and found that the SCATE inter- pretation was always the correct one. For example, a common case where TimeML and SCATE annotations overlap, but are not identical, is time expressions preceded by a preposition like “since”. The TimeML annotation for “Since 1985” (with a DCT of 1998-03-01T14:11) only covers the year, “1985”, resulting in the time interval [1985- 01-01T00:00,1986-01-01T00:00). The SCATE an- notation represents the full expression and, conse- quently, produces the correct time interval [1986-01- 01T00:00,1998-03-01T14:11). Another common case of disagreement is where TimeML failed to compose all pieces of a complex expression. The TimeML annotation for “10:35 a.m. (0735 GMT) Friday” annotates two separate inter- vals, the time and the day (and ignores “0735 GMT” entirely). The SCATE annotation recognizes this as a description of a single time interval, [1998-08- 07T10:35, 1998-08-07T10:36). TimeML and SCATE annotations also differ in how references to particular past periods are inter- preted. For example, TimeML assumes that “last year” and “a year ago” have identical semantics, re- ferring to the most recent calendar year, e.g., if the DCT is 1998-03-04, then they both refer to the inter- val [1997-01-01T00:00,1998-01-01T00:00). SCATE has the same semantics for “last year”, but recog- nizes that “a year ago” has different semantics: a period centered at one year prior to the DCT. Under SCATE, “a year ago” refers to the interval [1996-09- 03T00:00,1997-09-03T00:00). Beyond these differences in interpretation, we also observed that, while the SCATE corpus annotates time expressions anywhere in the document (includ- ing in metadata), the TimeBank TIMEX3 annotations are restricted to the main text of the documents. The second row of Table 2 shows the evaluation when comparing overall text in the document, not just the body text. Unsurprisingly, TimeML has a lower re- call of the time intervals from the SCATE annotations under this evaluation. 4.2 Types of SCATE annotations Studying the training and development portion of the dataset, we noticed that the SCATE annotations can be usefully divided into three categories: non- operators, explicit operators, and implicit operators. We define non-operators as NUMBERs, PERIODs (e.g., three months), explicit intervals (e.g., YEARs like 1989), and repeating intervals (DAY-OF-WEEKs like Friday, MONTH-OF-YEARs like January, etc.). Non-operators are basically atomic; they can be in- 346 Non-Op Exp-Op Imp-Op Total 1497 305 219 2021 74% 15% 11% 100% Table 3: Distribution of time entity annotations in AQUAINT+TimeBank. terpreted without having to refer to other annotations. Operators are not atomic; they can only be interpreted with respect to other annotations they link to. For example, the THIS operator in Figure 1 can only be interpreted by first interpreting the DAY-OF-WEEK non-operator and the BETWEEN operator that it links to. We split operators into two types: explicit and implicit. We define an operator as explicit if it does not overlap with any other annotation. This occurs, for example, when the time connective since evokes the BETWEEN operator in Figure 1. An operator is considered to be implicit if it overlaps with an- other annotation. This occurs, for example, with the LAST operator in Figure 1, where March implies last March, but there is no explicit signal in the text, and it must be inferred from context. We study how these annotation groups distribute in the AQUAINT and TimeBank documents. Table 3 shows that non-operators are much more frequent than operators (both explicit and implicit). 5 Models We decompose the normalization of time expressions into two subtasks: a) time entity identification which detects the spans of characters that belong to each time expression and labels them with their corre- sponding time entity; and b) time entity composition that links relevant entities together while respecting the entity type constraints imposed by the SCATE schema. These two tasks are run sequentially using the output of the former as input to the latter. Once identification and composition steps are completed we can use the final product, i.e. semantic composi- tional of time entities, to feed the SCATE interpreter1 and encode time intervals. 5.1 Time entity identification Time entity identification is a type of sequence tag- ging task where each piece of a time expression is 1https://github.com/clulab/timenorm assigned a label that identifies the time entity that it evokes. We express such labels using the BIO tagging system, where B stands for the beginning of an annotation, I for the inside, and O for outside any annotation. Differing somewhat from standard sequence tagging tasks, the SCATE schema allows multiple annotations over the same span of text (e.g., “Saturdays” in Figure 1 is both a DAY-OF-WEEK and a THIS), so entity identification models must be able to handle such multi-label classification. 5.1.1 Neural architectures Recurrent neural networks (RNN) are the state- of-the-art on sequence tagging tasks (Lample et al., 2016a; Graves et al., 2013; Plank et al., 2016) thanks to their ability to maintain a memory of the sequence as they read it and make predictions conditioned on long distance features, so we also adopt them here. We introduce three RNN architectures that share a similar internal structure, but differ in how they repre- sent the output. They convert the input into features that feed an embedding layer. The embedded feature vectors are then fed into two stacked bidirectional Gated Recurrent Units (GRUs), and the second GRU followed by an activation function, outputs one BIO tag for each input. We select GRU for our models as they can outperform another popular recurrent unit LSTM (Long Short Term Memory), in terms of pa- rameter updates and convergence in CPU time with the same number of parameters (Chung et al., 2014). Our 1-Sigmoid model (Figure 2) approaches the task as a multi-label classification problem, with a set of sigmoids for each output that allow zero or more BIO labels to be predicted simultaneously. This is the standard way of encoding multi-label classification problems for neural networks, but in our experiments, we found that these models perform poorly since they can overproduce labels for each input, e.g., 03 could be labeled with both DAY-OF-MONTH and MONTH- OF-YEAR at the same time. Our 2-Softmax model (Figure 3) splits the out- put space of labels into two sets: non-operators and operators (as defined in Section 4.2). It is very un- likely that any piece of text will be annotated with more than one non-operator or with more than one operator,2 though it is common for text to be anno- 2In the training data, only 4 of 1217 non-operators overlap with another non-operator, and only 6 of 406 operators overlap 347 Input Feature Embed Bi-GRU Bi-GRU Sigmoid Output Non-Operators and Operators M M Lu NNP B-LAST B-MONTH a a Ll NNP I-LAST I-MONTH y y Ll NNP I-LAST I-MONTH Zs ∅ ∅ 2 2 Nd CD B-DAY 5 5 Nd CD I-DAY . . . . . . . . . . . . . . . . . . Figure 2: Architecture of the 1-Sigmoid model. The input is May 25. In SCATE-style annotation, May is a MONTH- OF-YEAR (a non-operator), with an implicit LAST (an operator) over the same span, and 25 is a DAY-OF-MONTH. At the feature layer, M is an uppercase letter (Lu), a and y are lowercase letters (Ll), space is a separator (Zs), and May is a proper noun (NNP). Input Feature Embed Bi-GRU Bi-GRU Softmax Output Non-Operators Operators M M Lu NNP B-MONTH B-LAST a a Ll NNP I-MONTH I-LAST y y Ll NNP I-MONTH I-LAST Zs ∅ O O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3: Architecture of the 2-Softmax model. The input is May. The SCATE annotations and features are the same as in Figure 2. tated with one non-operator and one operator (see Figure 1). As a result, we can use two softmaxes, one for non-operators and one for operators, and the 2- Softmax model thus can produce 0, 1, or 2 labels per input. We share input and embedding layers, but as- sociate a separate set of stacked Bi-GRUs with each output category, as shown in Figure 3.3 Our 3-Softmax further splits operators into explicit operators and implicit operators (again, as defined with another operator. For example, a NYT said in an editorial on Saturday, April 25, Saturday is labeled as [DAY-OF-WEEK, LAST, INTERSECTION] where the last two labels are operators. 3In preliminary experiments, we tried sharing GRU layers as well, but this generally resulted in worse performance. in Section 4.2). We expect this to help the model since the learning task is very different for these two cases: with explicit operators, the model just has to memorize which phrases evoke which operators, while with implicit operators, the model has to learn to infer an operator from context (verb tense, etc.). We use three softmaxes, one each for non-operators, explicit operators, and implicit operators, and, as with 2-Softmax, we share input and embedding layers, but associate a separate set of stacked Bi-GRUs with each output category. The model looks similar to Figure 3, but with three output groups instead of two. We feed three features as input to the RNNs: Text: The input word itself for the word-by-word 348 model, or a the single input character for the character-by-character model. Unicode character categories: The category of each character as defined by the Unicode standard.4 This encodes information like the presence of upper- case (Lu) or lowercase (Ll) letters, punctuation (Po), digits (Nd), etc. For the word-by-word model, we concatenate the character categories of all characters in the word (e.g., May becomes LULLLL). Part-of-speech: The part-of-speech as determined by the Stanford POS tagger (Toutanova et al., 2003). We expect this to be useful for, e.g., finding verb tense to help distinguish between implicit LAST and NEXT operators. For the character-by-character model, we repeat the word-level part-of-speech tag for each char- acter in the word, and characters with no part-of- speech (e.g., spaces) get no tag. 5.1.2 Input: words vs. characters Identifying SCATE-style time entity is a sequence tagging task, similar to named entity recognition (NER), so we take inspiration from recent work in neural architectures for NER. The first neural NER models followed the prior (non-neural) work in ap- proaching NER as a word classification problem, ap- plying architectures such as sliding-window feedfor- ward neural networks (Qi et al., 2009), convolutional neural networks (CNNs) with conditional random field (CRF) layers (Collobert et al., 2011), and LSTM with CRF layers and hand-crafted features (Huang et al., 2015). More recently, character-level neural net- works have also been proposed for NER, including several which combine a CNN or LSTM for learn- ing character-based representations of words with an LSTM or LSTM-CRF for word-by-word labeling (Chiu and Nichols, 2016; Lample et al., 2016b; Ma and Hovy, 2016), as well as character-by-character sequence-to-sequence networks (Gillick et al., 2016; Kuru et al., 2016). Based on these works, we consider two forms of input processing for our RNNs: word-by-word vs. character-by-character. Several aspects of the time normalization problem make the character-based ap- proach especially appealing. First, many time phrases involve numbers that must be interpreted semanti- cally (e.g., a good model should learn that months 4See http://unicode.org/notes/tn36/ cannot be a number higher than 12), and digit-by- digit processing of numbers allows such interpreta- tions, while treating each number as a word would result in a sparse, intractable learning problem. Sec- ond, word-based models assume that we know how to tokenize the text into words, but at times present challenging formats such as overnight, where over evokes a LAST operator and night is a PART-OF-DAY. Finally, character-based models can ameliorate out- of-vocabulary (OOV) words, which are a common problem when training sparse datasets. (Hybrid word- character models, such as the LSTM-CNNs-CRF (Ma and Hovy, 2016) can address this last problem, but not the previous two.) For our word-based model, we apply the NLTK Tokenizer (Bird et al., 2009) to each sentence. We further tokenize with the regular expres- sion "\d+|[ˆ\d\W]+|\S" to break apart alpha- numeric expressions like 1620EDT. However, the tokenizer is unable to break-apart expressions such as 19980206 and overnight. For our character-based model, no tokenization is applied and every character (including whitespace characters) is fed as input. 5.2 Time entity composition Once the entities of the time-expressions are identi- fied, they must be composed in order to obtain their semantic interpretation. This step of the analysis con- sists of two parts: linking the entities that make up a time-expression together and completing the entities’ properties with the proper values. For both cases, we set a simple set of rules that follow the constraints imposed by the SCATE schema5. 5.2.1 Time entity linking Algorithm 1 shows the process followed to obtain the links between the time entities. First, we define an empty stack that will store the entities belong- ing to the same time expression. Then, we iterate over the list of entities of a document sorted by their starting character offsets (SORTBYSTART). For each of these entities (entity1) and for each entity in the stack (entity2), we check if the guidelines specify a possible link (LINKISVALID) between the types of entity1 and entity2. If such a link is possible, and 5https://github.com/bethard/ anafora-annotations/blob/master/.schema/ timenorm-schema.xml 349 Algorithm 1 Linking time entities stack = ∅ for entity1 in SORTBYSTART(entities) do if START(entity1) - END(stack) > 10 then stack = ∅ end if for entity2 in stack do if LINKISVALID(entity1, entity2) then CREATELINK(entity1, entity2) end if end for PUSH(stack, entity1) end for it has not already been filled by another annotation, we greedily make the link (CREATELINK). When the distance in the number of characters between the entity and the end of the stack is bigger than 10, we assume that the entities do not belong to the time expression. Thus, we empty the stack.6 For example, our time entity identification model gets the YEAR, MONTH-OF-YEAR and DAY-OF- MONTH for the time-expression 1992-12-23. Our time entity composition algorithm then iterates over these entities. At the beginning the stack is empty, it just pushes the entity 1992 (YEAR) into the stack. For the entity 12 (MONTH-OF-YEAR) it checks if the guidelines define a possible link between this en- tity type and the one currently in the stack (YEAR). In this case, the guidelines establish that a YEAR can have a SUB-INTERVAL link to a SEASON-OF- YEAR, a MONTH-OF-YEAR or WEEK-OF-YEAR. Thus, the algorithm creates a SUB-INTERVAL link between 1992 and 12. The entity 12 is then pushed into the stack. This process is repeated for the en- tity 23 (DAY-OF-MONTH) checking if there was a possible link to the entities in the stack (1992, 12). The guidelines define a possible SUB-INTERVAL link between MONTH-OF-YEAR and DAY-OF-MONTH, so a link is created here as well. Now, suppose that the following time entity in the list is several words ahead of 23 so the character distance between both entities is larger than 10. If that is the case the stack is empty and the process starts again to compose a new time expression. 6The distance threshold was selected based on the perfor- mance on the development dataset. 5.2.2 Property completion The last step is to associate each time entity of a time-expression with a set of properties that include information needed for its interpretation. Our system decides the value of these properties as follows: TYPE: The SCATE schema defines that some enti- ties can only have specific values. For example, a SEASON-OF-YEAR can only be SPRING, SUMMER, FALL or WINTER, a MONTH-OF-YEAR can only be JANUARY, FEBRUARY, MARCH, etc. To complete this property we take the text span of the time en- tity and normalize it to the values accepted in the schema. For example, if the span of a MONTH-OF- YEAR entity was the numeric value 01 we would normalize it to JANUARY, if its span was Sep. we would normalize it to SEPTEMBER, and so on. VALUE: This property contains the value of a nu- merical entity, like DAY-OF-MONTH or HOUR-OF- DAY.To complete it, we just take the text span of the entity and convert it to an integer. If it is written in words instead of digits (e.g., nineteen instead of 19), we apply a simple grammar7 to convert to an integer. SEMANTICS: In news-style texts, it is common that expressions like last Friday, when the DCT is a Fri- day, refer to the day as the DCT instead of the previ- ous occurrence (as it would in more standard usage of last). SCATE indicates this with the SEMANTICS property, where the value INTERVAL-INCLUDED in- dicates that the current interval is included when calculating the last or next occurrence. For the rest of the cases the value INTERVAL-NOT-INCLUDED is used. In our system, when a LAST operator is found, if it is linked to a DAY-OF-WEEK (e.g. Friday) that matches the DCT, we set the value of this property to INTERVAL-INCLUDED. INTERVAL-TYPE: Operators like NEXT or LAST need an interval as reference in order to be interpreted. Normally, this reference is the DCT. For example, next week refers to the week following the DCT, and in such a case the value of the property INTERVAL- TYPE for the operator NEXT would be DOCTIME. However, sometimes the operator is linked to an in- terval that serves as reference by itself, for example, “by the year 2000”. In this cases the value of the INTERVAL-TYPE is LINK. Our system sets the value 7https://github.com/ghewgill/text2num 350 of this property to LINK if the operator is linked to a YEAR and DOCTIME otherwise. This is a very coarse heuristic; finding the proper anchor for a time expression is a challenging open problem for which future research is needed. 5.3 Automatically generated training data Every document in the dataset starts with a document creation time. These time expressions are quite partic- ular; they occur in isolation and not within the context of a sentence and they always yield a bounded inter- val. Thus their identification is a critical factor in an interval based evaluation metric. However, document times appear in many different formats: ”Monday, July-24, 2017”, ”07/24/17 09:52 AM”, ”08-15-17 1337 PM”, etc. Many of these formats are not cov- ered in the training data, which is drawn from a small number of news sources, each of which uses only a single format. We therefore designed a time gen- erator to randomly generate an extra 800 isolated training examples for a wide variety of such expres- sion formats. The generator covers 33 different for- mats8 which include variants covering abbreviation, with/without delimiters, mixture of digits and strings, and different sequences of time units. 6 Experiments We train and evaluate our models on the SCATE cor- pus described in Section 4. As a development dataset, 14 documents are taken as a random stratified sample from the TempEval 2013 (TimeBank + AQUAINT) portion shown in Table 1, including broadcast news documents (1 ABC, 1 CNN, 1 PRI, 1 VOA), and newswire documents (5 AP, 1 NYT, 4 WSJ). We use the interval-based evaluation metric described in Sec- tion 3, but also report more traditional information extraction metrics (precision, recall, and F1) for the time entity identification and composition steps. Let S be the set of items predicted by the system and H is the set of items produced by the humans, precision (P ), recall (R), and F1 are defined as: P(S, H) = |S ∩H| |S| R(S, H) = |S ∩H| |H| 8We use the common formats available in office suites, specif- ically, LibreOffice. F1(S, H) = 2 ·P(S, H) ·R(S, H) P(S, H) + R(S, H) . For these calculations, each item is an annotation, and one annotation is considered equal to another if it has the same character span (offsets), type, and properties (with the definition applying recursively for properties that point to other annotations). To make the experiments with different neural ar- chitectures comparable, we tuned the parameters of all models to achieve the best performance on the development data. Due to space constraints, we only list here the hyper-parameters for our best Char 3- Softmax: the embedding size of the character-level text, word-level text, POS tag, and unicode character category features are 128, 300, 32 and 64, respec- tively. To avoid overfitting, we used dropout with probabilities 0.25, 0.15 and 0.15 for the 3 features, respectively; the sizes of the first and second layer GRU units are set as 256 and 150. We trained the model with RMSProp optimization on mini-batches of size 120, and followed standard recommendations to leave the optimizer hyperparameter settings at their default values. Each model is trained for at most 800 epochs, the longest training time for Char 3-Softmax model is around 22 hours using 2x NVIDIA Kepler K20X GPU. 6.1 Model selection We compare the different time entity identification models described in Section 5.1, training them on the training data, and evaluating them on the develop- ment data. Among the epochs of each model, we se- lect the epoch based on the output(s) which the model is good at predicting because based on its weakness, the model would yield unstable results in our pre- liminary experiments. For example, for 3-Softmax models, our selections rely on the performances of non-operators and implicit-operators. Table 4 shows the results of the development phase. First, we find that the character-based models out- perform the word-based models.9 For example, the best character-based model achieves the F1 of 81.7 (Char 3-Softmax),which is significantly better than the best word-based model achieving the F1 of only 9We briefly explored using pre-trained word embeddings to try to improve the performance of the Word 1-Sigmoid model, but it yielded a performance that was still worse than the character-based model, so we didn’t explore it further. 351 Model P R F1 Word 1-Sigmoid 60.2 52.0 55.8 Char 1-Sigmoid 54.0 59.0 56.4 Word 2-Softmax 58.7 63.9 61.2 Char 2-Softmax 74.8 72.4 73.6 Word 3-Softmax 68.3 64.9 66.6 Char 3-Softmax 88.2 76.1 81.7 Char 3-Softmax extra 80.6 73.4 76.8 Table 4: Precision (P ), recall (R), and F1 for the different neural network architectures on Time entity identification on the development data. 66.6 (p=0).10 Second, we find that Softmax mod- els outperform Sigmoid models. For example, the Char 3-Softmax model achives the F1 of 81.7, sig- nificantly better than 56.4 F1 of the Char 1-Sigmoid model (p=0). Third, for both character- and word- based models, we find that 3-Softmax significantly outperforms 2-Softmax: the Char 3-Softmax F1 of 81.7 is better than the Char 2-Softmax F1 of 73.6 (p=0) and the Word 3-Softmax F1 of 66.6 is bet- ter than the Word 2-Softmax F1 of 61.2 (p=0.0254). Additionally, we find that all models are better at identifying non-operators than operators and that the explicit operators are the hardest to solve. For ex- ample, the Char 3-Softmax model gets 92.4 F1 for non-operators, 36.1 F1 for explicit operators and 79.1 F1 for implicit operators. Finally, we also train the best model, Char 3-Softmax, using the generated an- notations described in Section 5.3 and achieve 76.8 F1 (Char 3-Softmax extra), i.e., the model performs better without the extra data (p=0). This is probably a result of overfitting due to the small variety of time formats in the training and development data. From this analysis on the development set, we se- lect two variants of the Char 3-softmax architecture for evaluation on the test set: Char 3-Softmax and Char 3-Softmax extra. These models were then cou- pled with the rule-based linking system described in Section 5.2 to produce a complete SCATE-style parsing system. 6.2 Model evaluation We evaluate both Char 3-Softmax and Char 3- Softmax extra on the test set for identification and 10We used a paired bootstrap resampling significance test. Char 3-Softmax Char 3-Soft. extra P R F1 P R F1 Non-Op 79.2 63.2 70.3 87.4 63.2 73.4 Exp-Op 52.6 36.6 43.2 39.8 38.7 39.3 Imp-Op 53.3 47.1 50.0 65.4 50.0 56.7 Ident 70.0 54.5 61.3 69.4 55.3 61.5 Comp 59.7 46.5 52.3 57.7 46.0 51.2 Table 5: Results on the test set for Time entity identifi- cation (Ident) and Time entity composition (Comp) steps. For the former we report the performances for each entity set: non-operators (Non-Op), explicit operators (Exp-Op) and implicit operators (Imp-Op). Model P R F1 HeidelTime 70.9 76.8 73.7 Char 3-Softmax 73.8 62.4 67.6 Char 3-Softmax extra 82.7 71.0 76.4 Table 6: Precision (P ), recall (R), and F1 of our models on the test data producing bounded time intervals. For comparison, we include the results obtained by Heidel- Time. composition tasks. Table 5 shows the results. On the identification task, Char 3-Softmax extra is no worse than using the original dataset with the over- all F1 61.5 vs. 61.3 (p=0.5899), and using extra generated data the model is better at predicting non- operators and implicit operators with higher preci- sions (p=0.0096), which is the key to produce correct bounded time intervals. To compare our approach with the state-of-the-art, we run HeidelTime on the test documents and make use of the metric described in Section 3. This way, we can compare the intervals produced by both sys- tems no matter the annotation schema. Table 6 shows that our model with additional randomly generated training data outperforms HeidelTime in terms of Precision, with a significant difference of 12.6 per- centage points (p=0.011), while HeidelTime obtains a non-significant better performance in terms of Re- call (p=0.1826). Overall, our model gets 3.3 more percentage points than HeidelTime in terms of F1 (p=0.2485). Notice that, although the model trained without extra annotations is better in time entity com- position (see Table 5), it performs much worse at producing final intervals. This is caused by the fact 352 Model P R F1 HeidelTime 70.7 80.2 75.1 Char 3-Softmax 74.3 64.2 68.9 Char 3-Softmax extra 83.3 74.1 78.4 Table 7: Precision (P ), recall (R), and F1 on bounded intervals on the TimeML/SCATE perfect overlapping test data. that this model fails to identify the non-operators that compound dates in unseen formats (see Section 5.3). However, evaluating HeidelTime in the SCATE annotations may not be totally fair. HeidelTime was developed following the TimeML schema and, as we show in Section 4, SCATE covers a wider set of time expressions. For this reason, we perform an additional evaluation. First, we compare the annota- tions in the test set using our interval-based metric, similar to the comparison reported in Table 2, and select those cases where TimeML and SCATE match perfectly. Then, we remove the rest of the cases from the test set. Consequently, we also remove the pre- dictions given by the systems, both ours and Heidel- Time, for those instances. Finally, we run the interval scorer using the new configuration. As can be seen in Table 7 all the models improve their performances. However, our model still performs better when it is trained with the extra annotations. The SCATE interpreter that encodes the time in- tervals needs the compositional graph of a time- expression to have all its elements correct. Thus, failing in the identification of any entity of a time- expression results in totally uninterpretable graphs. For example, in the expression next year, if our model identifies year as a PERIOD instead of an INTER- VAL it cannot be linked to next because it violates the SCATE schema. The model can also fail in the recognition of some time-entities, like summer in the expression last summer. This identification errors are caused mainly by the sparse training data. As graphs containing these errors produce unsolvable logical formulae, the interpreter cannot produce intervals and hence the recall decreases. Within those inter- vals that are ultimately generated, the most common mistake is to confuse the LAST and NEXT operators, and as a result an incorrectly placed interval even with correctly identified non-operators. For example, if an October with an implicit NEXT operator is in- stead given a LAST operator, instead of referring to [2013-10-01T00:00, 2013-11-01T00:00), it will refer to [2012-10-01T00:00, 2012-11-01T00:00). Missing implicit operators is also the main source of errors for HeidelTime, which fails with complex compositional graphs. For example, that January day in 2011 is annotated by HeidelTime as two different intervals, corresponding respectively to January and 2011. As a consequence, HeidelTime predicts not one but two incorrect intervals, affecting its precision. 7 Discussion As for the time entity identification task, the per- formance differences between development and test dataset could be attributed to the annotation distri- butions of the datasets. For example, there are 10 Season-Of-Year annotations in the test set while there are no such annotations in the development dataset; the relative frequencies of the annotations Minute- Of-Hour, Hour-Of-Day, Two-Digit-Year and Time- Zone in the test set are much lower, and our models are good at predicting such annotations. Explicit operators are very lexically-dependent, e.g. LAST corresponds to one word from the set {last, latest, previously, recently, past, over, recent, earlier, the past, before}, and the majority of them appear once or twice in the training and development sets. Our experiments verify the advantages of character-based-models in predicting SCATE anno- tations, which are in agreement with our explana- tions in Section 5.1.2: word-based-models tend to fail to distinguish numbers from digit-based time ex- pressions. It’s difficult for word-based-models to catch some patterns of time expressions, such as 24th and 25th, August and Aug., etc., while character- based models are robust to such variance. We ran an experiment to see whether these benefits were unique to compositional annotations like those of SCATE, or more generally to simply recognizing time expressions. We used the TimeML annotations from AQUAINT and TimeBank (see Table 1) to train two multi-class classifiers to identify TIMEX3 an- notations. The models were similar to our Char 3- Softmax and Word 3-Softmax models, using the same parameter settings, but with a single softmax output layer to predict the four types of TIMEX3: DATE, TIME, DURATION, and SET. As shown in Table 8, 353 TIMEX3 TIMEX3-Digits P R F1 P R F1 Char 70.2 62.7 66.2 73.8 71.4 72.6 Word 81.3 69.0 74.7 86.2 79.4 82.6 Table 8: Precision (P ), recall (R), and F1 for character- based and word-based models in predicting TimeML TIMEX3 annotations on the TempEval 2013 test set. TIMEX3-Digits is the subset of annotations that contain digits. on the test set the word-based model significantly out- performs the character-based model in terms of both time expressions (p=0.0428) and the subset of time expressions that contain digits (p=0.0007). These results suggest that the reason character-based mod- els are more successful on the SCATE annotations is that SCATE breaks time expressions down into meaningful sub-components. For example, TimeML would simply call Monday, 1992-05-04 a DATE, and call 15:00:00 GMT Saturday a TIME. SCATE would identify four and five, respectively, different types of semantic entities in these expression; and each SCATE entity would be either all letters or all digits. In TimeML, the model is faced with difficult learning tasks, e.g., that sometimes a weekday name is part of a DATE and sometimes it is part of a TIME, while in SCATE, a weekday name is always a DAY-OF- WEEK. On the other hand, running the entity composition step with gold entity identification achieves 72.6 in terms of F1. One of the main causes of errors in this step is the heuristic to complete the INTERVAL-TYPE property. As we explain in Section 5.2, we implement a too coarse set of rules for this case. Another source of errors is the distance of the 10 characters we use to decide if the time entities belong to the same time expression. This condition prevents the creation of some links, for example, the expression “Later” at the beginning of a sentence typically refers to another time interval in a previous sentence, so the distance between them is much longer. 8 Conclusion We have presented the first model for time normaliza- tion trained on SCATE-style annotations. The model outperforms the rule-based state-of-the-art, proving that describing time expressions in terms of compo- sitional time entities is suitable for machine learn- ing approaches. This broadens the research in time normalization beyond the more restricted TimeML schema. We have shown that a character-based neural network architecture has advantages for the task over a word-based system, and that a multi-output net- work performs better than producing a single output. Furthermore, we have defined a new interval-based evaluation metric that allows us to perform a com- parison between annotations based on both SCATE and TimeML schema, and found that SCATE pro- vides a wider variety of time expressions. Finally, we have seen that the sparse training set available induces model overfitting and that the largest number of errors are committed in those cases that appear less frequently in the annotations. This is more significant in the case of explicit operators because they are very dependent on the lexicon. Improving performance on these cases is our main goal for future work. Accord- ing to the results presented in this work, it seems that a solution would be to obtain a wider training set, so a promising research line is to extend our approach to automatically generate new annotations. 9 Software The code for the SCATE-style time normalization models introduced in this paper is available at https://github.com/clulab/timenorm. 10 Acknowledgements We thank the anonymous reviewers as well as the action editor, Mona Diab, for helpful comments on an earlier draft of this paper. The work was funded by the THYME project (R01LM010090) from the National Library Of Medicine, and used computing resources supported by the National Science Founda- tion under Grant No. 1228509. The content is solely the responsibility of the authors and does not nec- essarily represent the official views of the National Library Of Medicine, National Institutes of Health, or National Science Foundation. References [Bethard and Parker2016] Steven Bethard and Jonathan Parker. 2016. A semantically compositional anno- tation scheme for time normalization. In Proceedings 354 of the Tenth International Conference on Language Re- sources and Evaluation (LREC 2016), Paris, France, 5. European Language Resources Association (ELRA). [Bethard2013] Steven Bethard. 2013. A synchronous con- text free grammar for time normalization. In Proceed- ings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 821–826, Seattle, Washington, USA, 10. Association for Computational Linguistics. [Bird et al.2009] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. [Chiu and Nichols2016] Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistic, 4:357–370. [Chung et al.2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empiri- cal evaluation of gated recurrent neural networks on se- quence modeling. arXiv preprint arXiv:1412.3555v1. [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (al- most) from scratch. The Journal of Machine Learning Research, 12:2493–2537, November. [Fischer and Strötgen2015] Frank Fischer and Jannik Strötgen. 2015. When Does (German) Literature Take Place? On the Analysis of Temporal Expressions in Large Corpora. In Proceedings of DH 2015: Annual Conference of the Alliance of Digital Humanities Orga- nizations, volume 6, Sydney, Australia. [Gillick et al.2016] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilin- gual language processing from bytes. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, NAACL HLT 2016, The 2016 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1296–1306. The Association for Computational Linguistics. [Graves et al.2013] Alex Graves, Abdel-rahman Mo- hamed, and Geoffrey Hinton. 2013. Speech recog- nition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. IEEE. [Han and Lavie2004] Benjamin Han and Alon Lavie. 2004. A framework for resolution of time in natural language. 3(1):11–32, March. [Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991. [Kuru et al.2016] Onur Kuru, Ozan Arkan Can, and Deniz Yuret. 2016. Charner: Character-level named entity recognition. In COLING 2016, 26th International Con- ference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages 911–921. [Lample et al.2016a] Guillaume Lample, Miguel Balles- teros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016a. Neural architectures for named entity recognition. In Proceedings of the 2016 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 260–270. Association for Compu- tational Linguistics. [Lample et al.2016b] Guillaume Lample, Miguel Balles- teros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016b. Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- guage Technologies, San Diego California, USA, June 12-17, 2016, pages 260–270. [Lee et al.2014] Kenton Lee, Yoav Artzi, Jesse Dodge, and Luke Zettlemoyer. 2014. Context-dependent semantic parsing for time expressions. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1437–1447, Baltimore, Maryland, 6. Association for Computational Linguistics. [Lin et al.2015] Chen Lin, Elizabeth W. Karlson, Dmitriy Dligach, Monica P. Ramirez, Timothy A. Miller, Huan Mo, Natalie S. Braggs, Andrew Cagan, Vivian S. Gainer, Joshua C. Denny, and Guergana K. Savova. 2015. Automatic identification of methotrexate- induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. Jour- nal of the American Medical Informatics Association, 22(e1):e151–e161. [Llorens et al.2012] Hector Llorens, Leon Derczynski, Robert J. Gaizauskas, and Estela Saquete. 2012. TIMEN: An Open Temporal Expression Normalisa- tion Resource. In Language Resources and Evaluation Conference, pages 3044–3051. European Language Re- sources Association (ELRA). [Ma and Hovy2016] Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics (ACL 2016), volume 1. Association for Computa- tional Linguistics. [Mazur and Dale2010] Pawet Mazur and Robert Dale. 2010. Wikiwars: A new corpus for research on tempo- ral expressions. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 913–922, Stroudsburg, PA, USA. Association for Computational Linguistics. 355 [Plank et al.2016] Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tag- ging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 412–418, Berlin, Germany, August. Association for Computational Lin- guistics. [Pustejovsky et al.2003a] James Pustejovsky, José Castaño, Robert Ingria, Roser Saurı́, Robert Gaizauskas, Andrea Setzer, and Graham Katz. 2003a. TimeML: Robust Specification of Event and Temporal Expressions in Text. In IWCS-5, Fifth International Workshop on Computational Semantics. [Pustejovsky et al.2003b] James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003b. The TimeBank corpus. In Proceedings of Corpus Linguistics 2003, Lancaster. [Pustejovsky et al.2010] James Pustejovsky, Kiyong Lee, Harry Bunt, and Laurent Romary. 2010. ISO-TimeML: An International Standard for Semantic Annotation. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10), Val- letta, Malta. European Language Resources Associa- tion (ELRA). [Qi et al.2009] Yanjun Qi, Koray Kavukcuoglu, Ronan Collobert, Jason Weston, and Pavel P. Kuksa. 2009. Combining labeled and unlabeled data with word-class distribution learning. In Proceedings of the 18th ACM conference on Information and knowledge management, ACM, pages 1737–1740. [Schilder2004] Frank Schilder. 2004. Extracting meaning from temporal nouns and temporal prepositions. ACM Transactions on Asian Language Information Process- ing (TALIP) - Special Issue on Temporal Information Processing, 3(1):33–50, March. [Strötgen and Gertz2013] Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain tem- poral tagging. Language Resources and Evaluation, 47(2):269–298. [Strötgen and Gertz2015] Jannik Strötgen and Michael Gertz. 2015. A baseline temporal tagger for all lan- guages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 541–547, Lisbon, Portugal, September. Associa- tion for Computational Linguistics. [Strötgen et al.2013] Jannik Strötgen, Julian Zell, and Michael Gertz. 2013. Heideltime: Tuning English and developing Spanish resources for TempEval-3. In Proceedings of the Seventh International Workshop on Semantic Evaluation, SemEval ’13, pages 15–19. Asso- ciation for Computational Linguistics. [Toutanova et al.2003] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic de- pendency network. In Proceedings of the 2003 Confer- ence of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 173–180, Stroudsburg, PA, USA. Association for Computational Linguistics. [UzZaman et al.2013] Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, At- lanta, Georgia, USA, 6. Association for Computational Linguistics. [Verhagen et al.2007] Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. SemEval-2007 Task 15: TempEval Temporal Relation Identification. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, pages 75–80, Prague, Czech Republic. [Verhagen et al.2010] Marc Verhagen, Roser Sauri, Tom- maso Caselli, and James Pustejovsky. 2010. SemEval- 2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 57–62, Uppsala, Sweden, 7. Association for Computa- tional Linguistics. [Vossen et al.2016] Piek Vossen, Rodrigo Agerri, Itziar Aldabe, Agata Cybulska, Marieke van Erp, Antske Fokkens, Egoitz Laparra, Anne-Lyse Minard, Alessio Palmero Aprosio, German Rigau, Marco Rospocher, and Roxane Segers. 2016. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news. Special Issue Knowledge-Based Systems, Elsevier. 356