Authenticating the Writings of Julius Caesar Mike Kestemont University of Antwerp, Prinsstraat 13, B-2000 Antwerp, Belgium Corresponding author: mike.kestemont@uantwerp.be; 0032/477.91.86.68 Justin Stover University of Oxford, All Souls College, Oxford OX1 4AL, United Kingdom justin.stover@classics.ox.ac.uk Moshe Koppel Bar-Ilan University, 52900 Ramat-Gan, Israel moishk@gmail.com Folgert Karsdorp Center for Language Studies, Radboud University, P.O. Box 9103, NL-6500 HD, Nijmegen, The Netherlands fbkarsdorp@fastmail.nl Walter Daelemans University of Antwerp, Belgium, Prinsstraat 13, B-2000 Antwerp, Belgium walter.daelemans@uantwerp.be Abstract In this paper, we shed new light on the authenticity of the Corpus Caesarianum, a group of five commen- taries describing the campaigns of Julius Caesar (100-44 BC), the founder of the Roman empire. While Caesar himself has authored at least part of these commentaries, the authorship of the rest of the texts remains a puzzle that has persisted for nineteen centuries. In particular, the role of Caesar’s general Aulus Hirtius, who has claimed a role in shaping the corpus, has remained in contention. Determining the au- thorship of documents is an increasingly important authentication problem in information and computer science, with valuable applications, ranging from the domain of art history to counter-terrorism research. We describe two state-of-the-art authorship verification systems and benchmark them on 6 present-day evaluation corpora, as well as a Latin benchmark dataset. Regarding Caesar’s writings, our analysis allow us to establish that Hirtius’s claims to part of the corpus must be considered legitimate. We thus demonstrate how computational methods constitute a valuable methodological complement to traditional, expert-based approaches to document authentication. Keywords: Authentication, Authorship Verification, Stylometry, Julius Caesar Preprint submitted to Expert Systems with Applications May 26, 2016 1. Introduction Throughout the twentieth century, influential post-structuralist thinkers, such as Foucault or Barthes have fiercely argued against the importance of the notion of ‘authorship’ (Barthes, 1968; Foucault, 1969). Across many fields in the Humanities for instance, this famously led to a temporary devaluation of the importance attached to the relationship between texts and their original producers (Love, 2002). How-5 ever, numerous examples demonstrate that the public interest in authorship currently shows few signs of abating. The highly mediatized discovery of an pseudonymously published novel by the appraised Harry Potter novelist J.K. Rowling is a good example in this respect (Juola, 2015, 2013). In recent years, many other authorship-related research, such as the Shakespeare controversy (Burrows, 2012), has continued to make frequent headlines in the popular media. In academia too, the much debated application of bibliom-10 etry (Cronin, 2001) or well-known cases of plagiarism (Maurer et al., 2006) hardly suggest that the notion of authorship would have suffered a major loss of public interest. Unsurprisingly, automated authorship analysis (Juola, 2006; Koppel et al., 2009; Stamatatos, 2009b) currently receives increasing attention in Computer and Information Sciences too, as a form of document authentication with promising practical applications across various domains, such as plagiarism detection (Stein et al., 2011) or even in forensic15 sciences (Chaski, 2005; Juola, 2015). Most computational authorship studies in computer science are still restricted to present-day docu- ment collections. In this paper, we illustrate the broader applicability of computational authorship verifi- cation by reporting a high-profile case study from Classical Antiquity (Koppel & Seidman, 2013; Stover et al., 2016). The ‘War Commentaries’ by Julius Caesar (Corpus Caesarianum) refers to a group of Latin20 prose commentaries, describing the military campaigns of the world-renowned statesman Julius Caesar (100–44 BC), the founder of the Roman Empire. While Caesar must have authored a significant portion of these commentaries himself, the exact delineation of his contribution to this important corpus remains a controversial matter. Most notably, Aulus Hirtius – one of Caesar’s most trusted generals – is some- times believed to have contributed significantly to the corpus. Thus, the authenticity and authorship of25 the Caesarian corpus is a philological puzzle that has persisted for nineteen centuries. In this paper, we use computational authorship verification to shed new light on the matter. Below, we will first situate our work in the field of stylistic authentication studies, focusing on the style versus content debate, as well as the difference between open set and closed set attribution. We go on to discuss our implementation of two verification systems, a first-order and a second-order approach,30 which represent the state of the art in the field, given the results of the latest relevant competitions on 2 authorship verification. We first benchmark both systems on 6 present-day data sets, before testing them on an evaluation set of Latin documents from Antiquity. Finally, we analyse the Corpus Caesarianum, offering a detailed discussion of the historical implications of our results. 2. Style vs Content35 Traditionally, scholars have long employed a pragmatic distinction between the ‘style’ and ‘content’ of written documents (Stamatatos et al., 2000), the former encapsulating all aspects of an individual au- thor’s language use at the textual level (Hermann et al., 2015). In authorship studies, there is nowadays a general consensus that features related to style are more useful (Juola, 2006; Koppel et al., 2009; Sta- matatos, 2009b), since topical, content-related features vary much more strongly across the documents40 authored by a single individual. Much research nowadays therefore concerns ways to effectively ex- tract stylistic characteristics from documents that are not affected by a text’s specific content or genre (Argamon & Levitan, 2005; Kestemont et al., 2012; Efstathios, 2013; Sapkota et al., 2015; Seroussi et al., 2014; Sapkota et al., 2014). This has not always been the case: historical practitioners in earlier centuries, commonly based attributions on a much looser defined set of linguistic criteria, including, for instance,45 the use of conspicuous, rare words (Love, 2002; Kestemont, 2014). Naturally, an expert reader’s subjec- tive intuitions (Gelehrtenintuition, connoisseurship) would play a much larger role in studies than would nowadays be acceptable. Especially, the focus on striking characteristics would turn out to be problem- atic. Importantly, low-frequency features are typically tied to fairly specific topics, and thus do not scale well to new texts. More importantly, these whimsical items also appeal to imitators and followers: in50 the case of malignant forgeries or benigne epigones, the authentication of documents will fail, if it is restricted to easy-to-copy, low-frequency characteristics (Love, 2002). The pioneering work by Mosteller and Wallace on the pseudonymously published Federalist papers has marked a turning point in this respect (Mosteller & Wallace, 1964). Mosteller and Wallace proposed to rigidly restrict analyses to high-frequency characteristics and only considered an author’s use of function55 words, or the small and closed set of grammatical items in a language which – as opposed to content words as nouns or verbs – do not carry a straightforward semantics when used in isolation (e.g. the article ‘the’ or the preposition ‘of’) (Aronoff & Fudeman, 2005). For authorship studies, function words are extremely attractive: they are frequent and well-distributed variables across documents, and consequently, they are not specifically linked to a single topic or genre. Importantly, psycholinguistic research suggests that60 grammatical morphemes are less consciously controlled in human language processing, since they do not actively attract cognitive attention (Stamatatos, 2009b; Binongo, 2003; Argamon & Levitan, 2005; Peng et al., 2003). This suggests that function words are relatively resistant to stylistic imitation or forgery. 3 With respect to function words, a number of recent developments are relevant. Ever since the Fed- eralist papers, research into English-language documents has dominated authorship studies. In English,65 many functional morphemes are realised as individual words which can be easily identified in running text (Aronoff & Fudeman, 2005). In recent decades, the attention for other, low-resource languages has increased, including languages that display a much higher level of word inflection (e.g. the Finno-Ugric family) (Rybicki & Eder, 2011). Until fairly recently, other types of style markers (e.g. syntactical), rarely outperformed simple, word-level style markers (Holmes, 1994, 1998; Halteren et al., 2005). Later,70 character n-grams were introduced as a powerful alternative to function words (Kjell, 1994; Daelemans, 2013). This representation from Information Retrieval (originally used for automatic language identi- fication) models texts at the sub-word level and segments them into a series of consecutive, partially overlapping groups of n characters; under a third-order trigram model (n = 3), for instance, the word ‘trigrams’ would yield the n-grams {‘tri’, ‘rig’, ‘gra’, ‘ram’, ‘ams’}.75 Multiple studies have demonstrated the excellent performance of character n-grams for modelling authorship, especially when it comes to more highly inflected languages such as Latin (Sidorov et al., 2014; Efstathios, 2013). This modelling strategy has the advantage that it can also capture morphemic information at the subword level, and is thus potentially sensitive to functional morphemes that are not realised as individual words (e.g. word endings) (Kestemont, 2014; Sapkota et al., 2015; Stamatatos,80 2009b). Similarly, certain n-grams also pick up word stems and research increasingly demonstrates that text representations based on function words can be supplemented with information from lower- frequency strata in languages (Burrows, 2007), such as word stems (Koppel et al., 2009). Naturally, such approaches carefully need to avoid overfitting on the content of a specific document collection. Recent research demonstrated that predominantly functional character n-grams (including punctuation (Grieve,85 2007)) are powerful authorship predictors (Sapkota et al., 2015). This helps explain why this family of features proves more robust with respect to cross-genre problems (Efstathios, 2013; Sapkota et al., 2014). Other recent studies have successfully applied Bayesian topic models to automatically separate style from content (Seroussi et al., 2014). This paper will not dwell on feature selection, although we recognise the substantial efforts and ad-90 vances which have been made on the topic of feature engineering in authorship studies. We limit the stylistic properties studied below to two commonly used feature types: word unigrams and character n- grams. These feature types have the advantage that they can be easily extracted from corpora, without requiring the application of preprocessing tools, such as part-of-speech taggers or parsers, which might not be available for all languages. Their relevance has moreover clearly motivated in the existing litera-95 ture (Daelemans, 2013; Kestemont, 2014; Sapkota et al., 2015; Stamatatos, 2009b). While many studies 4 have indeed reported the successful application of other feature types (Stamatatos, 2009b), it is clear from comparative experiments that word unigrams and character n-grams represent state of the art feature types in authorship studies. 3. Methods100 A number of different experimental procedures should be distinguished in present-day authorship studies (Stamatatos, 2009b). A first important distinction is that between authorship attribution and au- thorship verification (also known as open-set attribution). In the simple attribution scenario, the task is to attribute an anonymous text to a known author, through selecting the correct author from a set of candidate authors. In this closed-set scenario, the algorithm can safely assume that the correct target105 author is present in the set of available candidate authors, a scenario resembling a police line-up. It has been shown that the difficulty of this task increases as the number of candidate authors grows, and the length and or number of the available texts decreases (Daelemans & Van den Bosch, 2005). While the attribution setup is not incompletely unrealistic, it has been noted that in many real-world applications, it cannot be guaranteed that a text’s true author is present among the candidates. This is why the verification110 scenario was introduced, in which the task is to decide whether or not an anonymous text was written by a given candidate author (hence, verification). The verification setup is known to be a more generic, yet also more difficult setup. Recent research has explored interested ways of combining both attribution and verification in a single system (Puig et al., 2016), although both setups are usually treated separately. The Caesarian corpus under scrutiny is a textbook example of a problem in authorship verification, since115 we do not have any guarantees as to the identity of the authors involved. For this paper, we will there- fore use generic implementations of two verification methods which represent the state of the art in the field, especially when looking at the results of the latest PAN competitions. Both systems have proven to be successful approaches to authorship verification, and many of the top-performing contestants in competitions have integrated variations of them.120 Authorship verification is a problem which has been studied for a number of years in the annual PAN competition. The design and evaluation of our analyses closely adheres to this competition’s conventions to increase the comparability of our results (Stamatatos et al., 2014). Each dataset in the PAN competition consists of a set of ‘problems’, in which at least one, but possible more ‘known’ documents are available, which were all written by the same target author. Additionally, each problem defined an ‘unknown text’125 for which has to be determined whether or not it has been written by the author of the ‘known’ texts, through assigning a score between 0 (definitely not the same author) and 1 (definitely the same author), 5 with a threshold at .5. Systems are allowed to leave a selection of difficult problems unanswered by as- signing a score of exactly .5. The problems in each dataset fell apart in two non-overlapping sets: one development set of problems, on which systems could be calibrated, and a roughly equal-sized set of test130 problems, on which the calibrated systems were evaluated. The performance of the submitted systems is evaluated on the basis of two metrics: the AUC score (area under the curve, a well-known scalar eval- uation score for binary classifiers) and the more recently proposed c@1 score (Peñas & Rodrigo, 2011). Unlike the AUC score, c@1 extends the traditional accuracy score (i.e. the ratio of correct answers), by rewarding careful systems that choose to leave those problems unanswered which it considers too diffi-135 cult. The final performance of systems is reported as the product of the AUC and c@1 metric. Following the conventions used at the PAN competition, we statistically compare the accuracy of classifiers using approximate randomisation: this non-parametric test is valuable it does not make assumptions about the (potentially highly complex) distributions of the compared system outputs. 3.1. Verification Systems140 The first verification system (termed O1 here) used here was seminally introduced by Kjell et al. (Kešelj et al., 2003) and was subsequently refined (Potha & Stamatatos, 2014; Kestemont et al., 2011; Stamatatos, 2009a). O1 resorts to the direct (or ‘first order’) calculation of a distance metric between a target author’s stylistic profile in a given problem, and the unknown text. Following (Potha & Stamatatos, 2014; Koppel & Seidman, 2013), we define an author’s profile here as the mean centroid of the known145 document vectors for that author (i.e. we average an author’s score for a particular term across all training texts). Originally, O1 was introduced with a specific distance metric, called ‘common n-grams’ (cng). Let A and B be the respective vectors representing an author’s centroid and the unknown document respectively; consisting of n character n-gram values in some fixed order. Let ai and bi represent the value of the i-th feature in both documents respectively:150 cng(A, B) = nX i=1 ✓ 2(ai � bi) ai + bi ◆2 (1) Studies vary in their exact implementation of this method: the earliest papers would calculate this distance function only for character n-grams which were present in both the profile and the unknown document (hence ‘common’ n-grams), but subsequent research showed that it is beneficial to apply the distance function only to the items which are present in the unknown document (Stamatatos, 2007), so that we use this implementation. To verify whether the unknown document was written by the target155 author in the problem, O1 uses thresholding: unknown documents resulting in a distance below this threshold are attributed to the target author, while all others are not. To normalize the resulting distance 6 score to probability scores in the 0-1 range, they are scaled using the set of all non-zero pairwise scores which can obtained between the known documents in a problem set, before their positive complement is taken (Potha & Stamatatos, 2014). While O1 has so far primarily been used with the cng metric, it can160 also be used with the other distance metrics introduced below. The second verification system (termed O2 here) is a generic implementation of the General Imposters (GI) framework (Koppel & Winter, 2014). The general intuition behind the GI, is not to assess whether two documents are simply similar in writing style, given a static feature vocabulary, but rather, it aims to assess whether two documents are significantly more similar to one another than other documents,165 across a variety of stochastically impaired feature spaces (Stamatatos, 2006; Eder, 2012), and compared to random selections of so-called distractor authors (Juola, 2015), also called ‘imposters’. O1 relies on the calculation of a direct, first-order distance measure between two documents to assess whether they are similar enough to be attributed to the same individual. The GI, however, resorts to the calculation of a ‘second-order’ metric (see Alg. 1, SI). Let x be the vector representing an anonymous document which is170 compared to T = {t1, . . . , tn}, a set of documents by the target author. The task is to determine whether the documents in T were or were not written by the same author as x. Additionally, the GI procedure has access to I = {i1, . . . , in}, a set of distractor documents by so-called imposter authors. The GI then starts a bootstrapped procedure: during k iterations, it randomly samples a subset of the available features, as well as a random subset of imposters from I as I0. In each iteration, we determine whether175 x is closer than any of the documents in T than in I0, given the impaired feature space and a distance function. Instead of returning a first-order distance, the GI returns a second-order metric, indicating the proportion of iterations in which x was closer to an item in T than in I0. As a proportion, the second-order score produced by O2 will automatically lie between 0 and 1 (higher scores indicate a higher attribution confidence). A similar thresholding procedure is therefore applied as with O1. O2 too can used with a180 variety of distance metrics, including the cng metric used in O1. Note that O2 is an example of an ‘extrinsic’ verification method (Juola & Stamatatos, 2013): as opposed to the ‘intrinsic’ setup of O1, O2 also uses known documents from other authors in a particular problem set. In this paper, we sample imposter authors from the known documents that are available for other authors in a particular problem set. To ensure the comparability of O1 and O2, we sample author185 profiles (i.e. mean centroids), instead of individual documents from the imposter pool. Previous studies have automatically crawled the web for useful imposter documents, which yields results that might be difficult to reproduce exactly. Additionally, there is the inherent danger that one might obtain imposter documents that were indeed written by the target author, which would compromise the proper working of O2. Naturally, this problem is even more real in the case of the Latin data sets used below, because of the190 7 relatively sparse online availability of Latin documents from Classical Antiquity. 3.2. Vector space models In technical terms, a collection of texts in authorship studies is typically represented using a vector space model (VSM), as is common in text classification research (Sebastiani, 2002; Stamatatos et al., 2000). Both O1 and O2 are applied to such a VSM, yielding a matrix-like representation of a text collec-195 tions, in which each document is assigned an equal-sized vector, which numerically represents a selection of its stylistic and linguistic properties, also called features, such as word unigram frequencies (Salton & Buckley, 1988; Manning et al., 2008). This process of vectorization typically operates under a ‘bag-of- words’ assumption, which models the occurrence of items in a text, but is in many cases insensitive to their relative order or exact position in a document. A number of different VSMs are currently dominant,200 the choice for which clearly reflects the style vs content assumptions outlined above. The simplest vectorization model is the term-frequency model (tf), which records the relative fre- quency of the individual terms (e.g. words or n-grams) in a document in some fixed order. In authorship studies, it is not uncommon to aggressively truncate such VSMs to the most frequent items in the docu- ment collection (sometimes as little as 30 items (Burrows, 2002)). This truncation is a simple yet efficient205 strategy to combat vector sparsity and automatically causes models to focus on functional morphemes, since grammatical items are typically the most frequent ones in corpora (Stamatatos, 2009b). When al- lowing larger vectors, the tf-model has the disadvantage that it quickly comes to suffer from sparsity artefacts. Additionally, tf assigns equal weights to stylistic properties across the frequency spectrum in a language; therefore, it does not provide any form of feature weighing.210 Another commonly used VSM is the tf � idf-model from Information Retrieval (Manning et al., 2008). The tf � idf model extends the plain tf-model by weighing a word with its inverse document frequency (idf) in the collection. Thus, rare words that are present in only a few documents will be attached more importance. In many ways, this model can be contrasted with the assumption that low- frequency items are bad predictors of authorial style (Binongo, 2003). Nevertheless, a few studies suggest215 that it might be useful (Koppel & Winter, 2014). Arguably, this model captures the intuition that if a highly rare feature is present in two documents, this increases the likelihood that the two documents were authored by the same individual. While the method might therefore be sensitive to overfitting on low- frequency properties, this might be an attractive characteristic in certain (e.g. single-domain) authorship problems.220 Thirdly, there is the std-model which weighs the tf-model through scaling term frequencies by their standard deviation across the document in the corpus. The model has initially been suggested by Burrows 8 (Burrows, 2002) as part of a memory-based learning system for authorship attribution and was later theoretically simplified (Argamon, 2008). A similar approach has been proposed in (Kešelj et al., 2003). This model captures the inverse intuition of the tf �idf model, since it will boost the performance of very225 common items in a document collection, which will have a relatively low standard deviation in tf. This is highly uncommon in other applications in Information Sciences (e.g. document retrieval), although the model has been shown to work surprisingly well for authorship attribution in many studies (Stamatatos, 2009b). 3.3. Distance metrics230 Both O1 and O2 crucially depend on distance metrics which can be applied to two vectors, in this case a vector representing an author’s profile and a vector representing an unknown document. In authorship studies, it is a well known fact that the choice for a particular distance metric has a clear effect on the performance of systems (Evert et al., 2015), which is why distance metrics have continued to attract a lot of attention in authorship studies (Kešelj et al., 2003; Hoover, 2004; Stamatatos, 2007; Smith &235 Aldridge, 2011; Luyckx & Daelemans, 2011; Jockers et al., 2008; Evert et al., 2015). Previous studies have amply shown that specific metrics might behave and perform rather differently in different problem setups, stressing the fundamental ad hoc nature of many authorship problems (Juola, 2006; Evert et al., 2015). While many variations have been proposed, only a small set of metrics (or slight variations thereof) seem to have yielded consistent and good performance across studies. The traditional ‘Manhattan’ city240 block distance is a popular choice, which defines the difference between two documents as the sum of the absolute differences between all features. The city block distance predominantly works well for small and dense VSMs, with very limited vocabularies, such as small sets of function word frequencies. Cosine-based metrics are known to scale better to larger, sparse vectors, and they are therefore more common in Information Sciences (Manning et al., 2008). The cosine distance, for instance, is a pseudo-245 distance measure based on the complement (in positive space) of the angular cosine similarity between two document vectors. In this paper, we will also compare these more established metrics to the still fairly novel minmax measure (Koppel & Winter, 2014), originally introduced in geobotanics by M. Ružička (Ružička, 1958). While the metric has re-emerged a number of times in different disciplines (e.g. as the ‘Jaccardized250 Czekanowski index’ (Schubert & Telcs, 2014)), the method is only a recent addition to authorship studies. In mathematical notation, the minmax measure was originally formulated as the following similarity measure (Cha, 2007). Let a and b represent two document vectors, consisting of n features in some fixed order. Let ai and bi represent the value of the i-th feature in both documents respectively (e.g. the relative 9 frequencies of a particular word in both documents, in the case of the simple tf-model):255 minmax(a, b) = nP i=1 min(ai, bi) nP i=1 max(ai, bi) (2) We turn this similarity metric into a true distance measure by taking its complement in positive space (Schubert & Telcs, 2014): 1 � minmax(a, b). So far, minmax has only been studied in the context of larger verification systems (Koppel & Seidman, 2013; Koppel & Winter, 2014; Seidman, 2013; Khonji & Iraqi, 2014), so that its individual contribution has not been clearly studied yet. More importantly, its per- formance has not rigorously been compared yet to other distance measures, under different experimental260 setups or in combination with different VSMs. In this paper, we will therefore elucidate the interplay of this distance metric and the VSMs described. In the context of the tf � idf model, for instance, the minmax metric will naturally boost the importance of features with larger values (i.e. those that are highly document-specific), whereas the opposite will happen in the std-model. We will empirically investigate the effect of this additional feature weighing.265 4. Benchmark results 4.1. PAN data To demonstrate the overall validity of our approach, we first benchmark O1 and O2 on 6 publicly available benchmark corpora which have been used in the 2014 edition of the PAN competition on au- thorship verification (Stamatatos et al., 2014) (pan.webis.de). In this yearly competition, teams can270 participate in a number of challenges involving forensic text analysis, such as plagiarism detection or authorship classification tasks. The organizers release training data that teams can independently develop systems on, before submitting their software. The organizers then run the software on new, unseen test data and rank the submitting teams according to their performance. We focus on the authorship verifica- tion track which has been organised since a number of years. The PAN 2014 verification datasets (see275 SI) only concern present-day writing samples, and vary strongly in both nature, size and difficulty, so that they provide a solid point of reference. The availability of the results reported by competitors on a fixed test set, moreover makes it easy to compare our results to the best performing systems which were entered into the competition. We report our full results in the SI and limit the discussion in the main text to a sample of illustrative examples. First, we calibrate O1 and and O2 on the development problems280 and then apply both systems to the test problems, reporting the AUC · c@1 for the test problems. In the SI, we report results for each combination of a VSM and distance metric, for the following feature 10 pan.webis.de types: word unigrams, character trigrams, and character tetragrams. For each feature type, we used VSMs that represent full vocabularies. To assess whether O1 and O2 produce significantly different results, we have applied an approximate randomisation test to each pair of scores from O1 and O2. Table 1 gives a285 representative list of results in terms of AUC · c@1, namely the test results for using word unigrams in each corpus, for O1 and O2. For each problem set, we also list the performance of the best-performing individual system in that task, as well as the meta-classifier trained on all submitted systems (which often, but not always, yields the strongest overall result) (Stamatatos et al., 2014). 11 Table 1: A representative list of the main verification results on the PAN corpora in terms of AUC · c@1, namely the test results for using word unigrams in each corpus, for O1 and O2. For each problem set, we also list the performance of the best-performing individual system in that task, as well as the meta-classifier trained on all submitted systems (which often, but not always, yields the strongest overall result) (Stamatatos et al., 2014) Combination Dutch essays Dutch reviews English essays English novels Greek articles Spanish articles O1 cng - tf-std 76.89 31.95 23.94 22.26 28.79 59.54 cng - tf-idf 76.81 32.62 24.38 22.04 28.89 60.50 cng - tf 75.85 31.32 23.40 21.83 28.21 61.95 cosine - tf-std 67.51 27.91 23.61 50.22 42.65 57.20 cosine - tf-idf 61.41 27.50 16.58 33.68 44.33 50.71 cosine - tf 48.11 36.79 28.77 27.60 41.13 47.50 minmax - tf-std 71.32 34.66 25.46 45.14 54.98 45.67 minmax - tf-idf 76.95 42.09 24.22 52.69 59.33 45.63 minmax - tf 70.16 40.32 27.12 47.45 67.47 76.98 manhattan - tf-std 61.93 35.06 23.64 21.31 27.37 40.44 manhattan - tf-idf 71.44 37.08 24.33 43.44 33.52 68.51 manhattan - tf 76.35 34.59 23.92 40.61 42.80 67.83 O2 cng - tf-std 83.8 35.91 26.49 34.72 48.26 74.25 cng - tf-idf 81.70 36.55 27.95 35.47 48.83 73.10 cng - tf 80.26 35.99 27.25 35.80 50.90 80.54 cosine - tf-std 87.50 33.58 29.12 37.35 50.20 63.41 cosine - tf-idf 90.96 36.82 18.16 33.80 41.26 64.54 cosine - tf 76.59 36.58 24.95 30.55 48.63 69.52 minmax - tf-std 89.52 38.78 35.13 36.48 57.33 71.61 minmax - tf-idf 93.70 48.42 30.40 40.66 67.32 73.03 minmax - tf 87.44 38.90 30.50 36.93 67.57 83.77 manhattan - tf-std 47.96 33.05 25.31 22.65 27.99 32.94 manhattan - tf-idf 74.22 35.37 27.36 37.61 35.59 50.69 manhattan - tf 84.58 37.38 28.06 37.43 56.68 63.69 2014 Meta-classifier 86.70 42.80 53.10 50.80 72.00 70.90 2014 Best single system 82.30 52.50 51.30 47.60 63.50 69.80 12 A number of high-level trends emerge. The results immediately illustrate the large differences in290 overall difficulty which exist between the various data sets, ranging from the good scores which can be obtained for relative easy corpus of Dutch-language essays, to the more difficult corpus of English essays. Overall, O2 typically yields a higher performance than O1, although O1 produce the single highest scores for the English novels, where the length of documents is considerably longer than elsewhere. In two problem sets, the Dutch essays and Spanish articles, O2 and O1 respectively yield surprisingly strong295 results, even outperforming the meta-classifier and top-performing in the PAN competition. In the Dutch reviews and Greek articles, the performance of O2 can be characterised as very decent, with a performance between between the meta-classifier and that of the best performing individual system. Interestingly, both O1 and O2 perform relatively poorly for the following two data sets: the English essays and English novels (where text length clearly affects performance). With respect to the former corpus, we hypothesise300 that this loss in performance for O2 is due to the fact that we did not crawl the web for suitable imposters (as other studies have done), but limited our distractor pool to the other known documents in the problem set (because of our focus on Latin documents below). In these particular corpora, the algorithm might suffer from sampling documents that are too similar in content to the unknown document to act as a useful comparand. As to the other feature types, the results show that manhattan only yields acceptable results305 for the character trigram features, which is an expected outcome, because character trigrams lead to a much denser corpus representation. For sparser representations, the minmax and cosine distance offer a much better fit. Especially in the case of word unigrams – which produce the strongest results across corpora – the novel minmax metric offers surprisingly strong results in comparison to the established metrics (it is part of every winning combination under O2). Interestingly, the effect of VSMs is much less310 pronounced than distance metrics: the minmax and cosine metric are generally least affected by a change in VSM. 4.2. Latin data We now proceed to benchmarking our system on a corpus of historic Latin authors. For this study we have collected a representative reference corpus, containing works by some of the main Latin prose315 authors from Classical Antiquity, such as Cicero, Seneca or Suetonius. They predominantly include his- toriographical texts (e.g. Livy’s Ab Urbe Condita) which are sufficiently similar to Caesar’s War Com- mentaries. All original texts were cut up in non-overlapping slices of 1000 words; while this constitutes a challengingly limited document size, this procedure allows us to obtain a sufficiently fine-grained anal- ysis of the Caesarian corpus. For modern documents, promising results are increasingly obtained with320 small document sizes (Koppel et al., 2013; Koppel & Winter, 2014), such as the PAN data used above. 13 To create a set of development and test problems, we proceed as follows. We split the available oeuvres at the author-level into two equal-sized sets. For each set we create a balanced set of same-author and different-author problems: for each true document-author pair, we also include a false document-author pair, whereby we randomly assign a different target author to the test document in question. This en-325 sures that there is no overlap between the development and test problems created: therefore we can now parametrize the system on the development set and evaluate it on the test set, in an entirely parallel fashion as with the PAN data. Figure 1: Precision-recall curves for each metric-VSM combination on the Latin benchmark data (test problems), using the O1 ‘first-order’ verification system. The c@1 score is listed in the legend. The cosine and minmax metric consistently yield higher results than cng and manhattan. In Figs. 1 and 2 we graphically show the results for O1 and O2 on the Latin benchmark corpus, again using untruncated vocabularies: for each combination of a VSM a distance metric, we plot a precision-330 14 Figure 2: Precision-recall curves for each metric-VSM combination on the Latin benchmark data (test problems), using the O2 ‘second-order’ verification system. The c@1 score is listed in the legend. Only the manhattan distance now yields inferior results: the bootstrapping greatly reduces the variation between the different metric-VSM combinations. recall curve; the c@1 score is listed in the legend (see SI for detailed results). The following trends clearly emerge: O2 consistently (in most cases significantly) outperforms O1 on the Latin data. O1 shows wildly diverging results, especially across different distance metrics, whereas the effect of VSMs is much less pronounced. In O2, both the cosine distance and minmax distance yield results that are clearly superior to cng and cityblock. Overall, O2 yields much stabler results across most combinations and for most335 combinations the curves can even not be visibly distinguished any longer. Unsurprisingly cityblock is the only metric which yields visibly inferior results for O2. In O2 too, the minmax and cosine distance overall yield the highest c@1, which is invariable in the upper nineties. Our evaluation shows that the recently introduced minmax metric yields a surprisingly good and consistent performance in comparison 15 to more established metrics. While it is not consistently the best performing metric, it produced highly340 stable results for the PAN data (and to a lesser extent for the Latin data). Overall, we hypothesize that the formulation of the minmax metric has a regularizing effect in the context of authorship studies. Due to its specific formulation, the minmax metric will automatically produce distances in the 0-1 range, in contrast to the more extreme distances which can be produced by e.g. Manhattan. Perhaps because of this, the minmax metric interacts well with both std and td � idf, although these VSMs capture inverse345 intuitions. Like cosine, which also naturally scales distances, minmax is relatively insensitive to the dimensionality of the VSM under which the metric is applied. 5. Caesar’s writings After benchmarking our verification systems, we now proceed to apply them to the Caesarian Corpus (Corpus Caesarianum), because it produced more stabler results for the benchmark data set (i.e. on350 average, it produced the highest results across different metric-vector space combinations). The Caesarian Corpus is composed of five commentaries describing Caesar’s military campaigns (Mayer, 2011; Gaertner & Hausburg, 2013): Gallic War Bellum Gallicum, conquest of Gaul, 58–50 BC; Civil War Bellum civile, civil war with Pompey, 49–48 BC;355 Alexandrian War Bellum Alexandrinum, Middle East campaigns, 48–47 BC; African War Bellum Africum, war in North Africa, 47 to 46 BC Spanish War Bellum Hispaniense, rebellion in Spain, 46–45 BC. The first two commentaries are mainly by Caesar himself, the only exception being the final part of the Gallic War (Book 8), which is commonly attributed to Caesar’s general Aulus Hirtius (c90 – 43360 BC). Caesar’s primary authorship of these two works, except for Book 8, is guaranteed by the ancient testimonia of Cicero, Hirtius, Suetonius, and Priscian as well as the unanimous evidence of the manuscript tradition. Caesar’s ancient biographer Suetonius, writing a century and a half after his death, suggests that either Hirtius or another general, named Oppius, authored the remaining works: ‘[Caesar] also left commentarii of his deeds during the Gallic War and the Civil War with Pompey. For the author of the365 Bellum Alexandrinum, Africum, and Hispaniense is uncertain. Some think it is Oppius, others Hirtius, who supplemented the last, incomplete book of the Bellum Gallicum’ (Appendix I). We also have a letter of Hirtius to Cornelius Balbus, a fellow supporter of Caesar, which is transmitted in the manuscripts 16 preceding the Hirtian 8th book of the Gallic War. In this letter, Hirtius lays out his project: ‘I have continued the accounts of out Caesar on his deeds in Gall, since his earlier and later writings did not fit370 together, and I have also finished the most recent and incomplete account, extending it from the deeds in Alexandria down to the end, not admittedly of civil discord, of which we seen no end, but of Caesar’s life’ (Gaertner & Hausburg, 2013). Despite occasional doubts, the most recent analysis has shown that there is no reason at all for doubt- ing the authenticity of the letter (Gaertner & Hausburg, 2013). Hence, a puzzle that has persisted for375 nineteen centuries: what are the relationships of the different war commentaries to one another, to Hir- tius, and to Caesar (Mayer, 2011)? Current scholarship has focused primarily on the authorship of the Alexandrian War. J. Gaertner and B. Hausburg (Gaertner & Hausburg, 2013) concluded that Hirtius knit together disparate sources to complete the text, including a Caesarian core of material in chapters 1–21. He also exercised a role in the formation of the whole corpus, though with much less firm editorial hand.380 Their analysis was based on a painstaking account of all sorts of evidence, including statistical analysis of usage and language. Their account represents the pinnacle of what can possibly by achieved by manual analytical methods, and offers a ripe target for re-analysis with automated computational methods. We are not the first to do so: in 2002 M. Trauth proposed a computer-assisted analysis of the Corpus which failed to reach any definitive conclusions on the authorship of the Bellum Alexandrinum, based on an385 automated tabulation of the most frequent words. (Trauth, 2002). More than a decade of advances in computational philology allow us to go beyond his inconclusive analysis. To shed new light on the authenticity of the Caesarian corpus, we proceed as follows. To obtain docu- ments of a similar size, we have divided all original commentaries in consecutive, non-overlapping slices of 1000 words and treat these slices as individual documents. We label these documents according to the390 assumption that the Gallic and Civil Wars were written by CAESAR, with the exception of 8th book of the former commentary, which we ascribe to HIRTIUS. To label the disputed authors of the Alexandrian, African and Spanish War, we use the provisional labels X, Y and Z respectively. Fig. 3 offers an initial in- spection of the stylistic structure in this corpus, in the spirit of the first-order distance-calculations of O1. We generated a square distance table using the minmax distance metric to every document pair in the Cae-395 sarian collection and we scaled the distances to the 0–1 range. Next, we plotted a heat map of the distance matrix, and ran a conventional cluster analysis on top of the rows and columns. For the generating the hierarchical dendrograms next to the heatmap, we used the default agglomerative clustering routine in the Seaborn library (https://web.stanford.edu/˜mwaskom/software/seaborn/), which is based on the pairwise Euclidean distance between entries and the average linkage method. The labels400 indicate the most plausible authorial provenance of a document (if known), given the annotation labels 17 https://web.stanford.edu/~mwaskom/software/seaborn/ we just described. This rather naive approach demonstrate a clear-cut distinction: a significant portion of the Bellum Alexandrinum (X) clusters with Hirtius’s contribution to the Gallic Wars, under a clade that is clearly separate from Caesar’s accepted writings. Thus, Hirtius’s writings are distinguished from Caesar’s own405 core contributions; Hirtius’s samples are compellingly close in style to X. Samples from the Alexandrian War appear to be stylistically close to Hirtius’s contribution to the Gallic Wars in Book 8 – which itself is surprisingly distinct from the other chapters in it. The more fundamental question now is how close these texts should truly be, in order to proceed to an actual attribution. We therefore turn to a more advanced analysis using O2. As with the problems in the benchmark experiments, each sample in the commentary410 collection was individually paired with the profile of all five Caesarian ‘authors’ available (including X, Y and Z): using the bootstrapped procedure from O2, we calculate a second-order similarity score by assess- ing in which proportion of a series of iterations one of these documents would be attributed to a particular Caesarian author’s profile, instead of a distractor author in the background corpus. This procedure as such yields, per document, 5 second-order scores, reflecting the probability that the sample must be attributed415 to a Caesarian’s authors profile, rather than an imposter. Following the outcome of the benchmark results, we perform this analysis for the five top-scoring metric-VSM combinations. Afterwards, we average the results over these five simulations and we graphically present the results in Fig. 4 (the full results are included in the SI). Note that in this setup we are especially interested in attribution leakage from one potential author to another: the fact that a text is attributed to the profile based on the other samples from420 its own text is an expected result; the attribution to another Caesarian ‘author’, however, is not. Our O2 analyses divide the Caesarian corpus into two branches at the top-level, which might be called ‘Caesarian’ and ‘non-Caesarian’. As we would expect, the Caesarian branch includes both the Civil War and the Gallic War, books 1–7. However, it also includes the first three samples from the Alexandrian War, providing dramatic confirmation of the theory of a Caesarian core in the first 21 chapters of the425 work. The other branch includes Gallic War, book 8, the rest of the Alexandrian War, the African War, and the Spanish War. The first two are closely affiliated with one another, indicating shared authorship. Stylistically there is no good reason for rejecting Hirtius’s authorship of the Alexandrian War, once we remove the Caesarian chapters 1–21. Gaertner and Hausburg (Gaertner & Hausburg, 2013) argue strongly against Hirtius’s authorship of the Alexandrian War, instead assigning him an amorphous role as editor430 of the corpus. It is true that the Alexandrian War shows far great heterogeneity that the Spanish War, for example, but it clearly clusters with the Gallic War, book 8, in a way the other texts do not, and displays no greater stylistic heterogeneity than Caesar’s own commentaries. 18 The African War and the Spanish War are the most internally consistent of the texts, perhaps an indication of separate authorship. They do, however, cluster with one another and with Hirtius, and the435 non-Caesarian texts all show a greater similarity with each other than with the Caesarian texts. While they are not stylistically homogenous enough to allow us to positive single-authorship in a naive sense, they display no greater stylistic heterogeneity than is present in the Caesarian texts. On both branches, we find the stylistic range we ought to expect in the genre of war commentaries, where commanders drawing up the official account of their campaigns would draw upon the dispatches of their legates and subordinates,440 sometimes integrating them into their own style, other times incorporating their texts with few changes. Importantly, Fig. 4 has an additional feature: whereas other X samples could be found scattered across Caesar’s authentic writings in the non-bootstrapped verification, O2 adds a distinct clade for these and a small set of other samples. This is a strong indication that the bootstrapped O2 system is not only able to distinguish authentic Caesarian material from non-authentic writings, but that it can even differentiate445 between a pure Caesarian style from the impure style resulting from collaborative authorship or the use of source texts. Hence, our analyses broadly supports the following conclusions: 1. Caesar himself wrote, in addition to Gallic Wars, books 1–7 and the Civil War, as well as the first 21 chapters of the Alexandrian War. 2. Hirtius wrote Book 8 of the Gallic Wars and the remainder of the Alexandrian War.450 3. At least one other author wrote the African War and the Spanish War. The African War and the Spanish War were probably written by two different authors. 4. Our results do not invalidate Hirtius’s own claim that he himself compiled and edited the corpus of the non-Caesarian commentaries. 5. The significant stylistic heterogeneity we have detected in parts of the Gallic War and the Civil War455 likely represents Caesar’s compositional practice of relying on, and sometimes incorporating, the briefs written for him by his legates. These findings are entirely consistent with a natural interpretation of Hirtius’s own words in his letter to Balbus, that he composed Gallic War, book 8 as a bridge between the preceding 7 books and the Civil War, that he completed the Alexandrian War, and added the two other commentaries to make the whole460 group a continuous narrative of Caesar’s campaigns. Chronologically the corpus thus ends in March, 45 BC with the Battle of Munda in Spain, but since we know that the end of the Spanish War is missing, there is no reason why we cannot assume that it originally continued with a brief epilogue bringing the narrative up to conclude with Caesar’s assassination in 44 BC. 19 6. Acknowledgements465 The authors would like to thank [anonymized] for their valuable feedback on earlier drafts of this arti- cle. Moshe Koppel acknowledges the support of the Intel Collaboration Research Institute for Computa- tional Intelligence. The work of Folgert Karsdorp has been supported by the Computational Humanities Programme of the Royal Netherlands Academy of Arts and Sciences, under the auspices of the Tunes & Tales project.470 7. References Argamon, S. (2008). Interpreting Burrows’s Delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23, 131–147. Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the475 Association for Literary and Linguistic Computing (2005). Aronoff, M., & Fudeman, K. (2005). What is Morphology?. Blackwell. Barthes, R. (1968). La mort de l’auteur. Manteia, 5, 12–17. Binongo, J. (2003). Who Wrote the 15th Book of Oz? An application of multivariate analysis to author- ship attribution. Chance, (pp. 9–17).480 Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17, 267–287. Burrows, J. (2007). All the way through: Testing for authorship in different fre- quency strata. Literary and Linguistic Computing, 22, 27–47. URL: http://llc. oxfordjournals.org/content/22/1/27.abstract. doi:10.1093/llc/fqi067.485 arXiv:http://llc.oxfordjournals.org/content/22/1/27.full.pdf+html. Burrows, J. (2012). A second opinion on Shakespeare and Authorship Studies in the Twenty-First Cen- tury. Shakespeare Quarterly, 63, 355–392. Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1, 300–490 307. 20 http://llc.oxfordjournals.org/content/22/1/27.abstract http://llc.oxfordjournals.org/content/22/1/27.abstract http://llc.oxfordjournals.org/content/22/1/27.abstract http://dx.doi.org/10.1093/llc/fqi067 http://arxiv.org/abs/http://llc.oxfordjournals.org/content/22/1/27.full.pdf+html Chaski, C. E. (2005). Who’s at the keyboard? authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4, 1–13. Cronin, B. (2001). Hyperauthorship: A postmodern perversion or evidence of a structural shift in schol- arly communication practices? Journal of the American Society for Information Science and Tech-495 nology, 52, 558–569. URL: http://dx.doi.org/10.1002/asi.1097. doi:10.1002/asi. 1097. Daelemans, W. (2013). Explanation in computational stylometry. In Computational Linguistics and Intelligent Text Processing (pp. 451–462). Springer. Daelemans, W., & Van den Bosch, A. (2005). Memory-Based Language Processing. Studies in Natural500 Language Processing. Oxford University Press. Eder, M. (2012). Computational stylistics and biblical translation: how reliable can a dendrogram be? In T. Piotrowski, & Ł. Grabowski (Eds.), The Translator and the Computer (pp. 155–170). Wrocław: WSF Press. Efstathios, S. (2013). On the robustness of authorship attribution based on character n-gram features.505 Journal of Law and Policy, 21, 421–439. Evert, S., Proisl, T., Schöch, C., Jannidis, F., Pielström, S., & Vitt, T. (2015). Explaining delta, or: How do distance measures for authorship attribution work? URL: http://dx.doi.org/10.5281/ zenodo.18308. doi:10.5281/zenodo.18308. Foucault, M. (1969). Qu’est-ce qu’un auteur? Bulletin de la Société française de philosophie, 3, 73–104.510 Gaertner, J., & Hausburg, B. (2013). Caesar and the Bellum Alexandrinum: An Analysis of Style, Narra- tive Technique, and the Reception of Greek Historiography. Vandenhoeck & Ruprecht. Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Lin- guistic Computing, 22, 251–270. URL: http://llc.oxfordjournals.org/content/22/ 3/251.abstract. doi:10.1093/llc/fqm020.515 Halteren, H. V., Baayen, H., Tweedie, F., Haverkort, M., & Neijt, A. (2005). New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics, 12, 65– 77. Hermann, J., Oskam K., V. D., & Schöch, C. (2015). Revisiting style, a key concept in literary studies. Journal of Literary Theory, 9, 25–52.520 21 http://dx.doi.org/10.1002/asi.1097 http://dx.doi.org/10.1002/asi.1097 http://dx.doi.org/10.1002/asi.1097 http://dx.doi.org/10.1002/asi.1097 http://dx.doi.org/10.5281/zenodo.18308 http://dx.doi.org/10.5281/zenodo.18308 http://dx.doi.org/10.5281/zenodo.18308 http://dx.doi.org/10.5281/zenodo.18308 http://llc.oxfordjournals.org/content/22/3/251.abstract http://llc.oxfordjournals.org/content/22/3/251.abstract http://llc.oxfordjournals.org/content/22/3/251.abstract http://dx.doi.org/10.1093/llc/fqm020 Holmes, D. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106. Holmes, D. (1998). The evolution of stylometry in Humanities scholarship. Literary and Linguistic Computing, 13, 111–117. Hoover, D. L. (2004). Testing Burrows’s Delta. Literary and Linguistic Computing, 19, 453–475. doi:10. 1093/llc/19.4.453.525 Jockers, M. L., Witten, D. M., & Criddle, C. S. (2008). Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification. Literary and Linguistic Computing, 23, 465– 491. Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1, 233–334. Juola, P. (2013). Rowling and Galbraith: an authorial analysis. URL: http://languagelog.ldc.530 upenn.edu/nll/?p=5315. Juola, P. (2015). The Rowling Case: A Proposed Standard Analytic Protocol for Authorship Questions. Digital Scholarship in the Humanities, . doi:10.1093/llc/fqv040. Juola, P., & Stamatatos, E. (2013). Overview of the author identification task at PAN 2013. In Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013..535 Kešelj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics, PA- CLING’03 (pp. 255–264). Dalhousie University, Halifax, Nova Scotia, Canada. Kestemont, M. (2014). Function words in authorship attribution. from black magic to theory? In Pro- ceedings of the 3rd Workshop on Computational Linguistics for Literature (pp. 59–66). Association540 for Computational Linguistics. Kestemont, M., Luyckx, K., & Daelemans, W. (2011). Intrinsic plagiarism detection using character trigram distance scores - notebook for PAN at CLEF 2011. In CLEF 2011 Labs and Workshop, Note- book Papers, 19-22 September 2011, Amsterdam, The Netherlands. URL: http://ceur-ws.org/ Vol-1177/CLEF2011wn-PAN-KestemontEt2011.pdf.545 Kestemont, M., Luyckx, K., Daelemans, W., & Crombez, T. (2012). Cross-genre authorship verification using unmasking. English Studies, 93, 340–356. 22 http://dx.doi.org/10.1093/llc/19.4.453 http://dx.doi.org/10.1093/llc/19.4.453 http://dx.doi.org/10.1093/llc/19.4.453 http://languagelog.ldc.upenn.edu/nll/?p=5315 http://languagelog.ldc.upenn.edu/nll/?p=5315 http://languagelog.ldc.upenn.edu/nll/?p=5315 http://dx.doi.org/10.1093/llc/fqv040 http://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-KestemontEt2011.pdf http://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-KestemontEt2011.pdf http://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-KestemontEt2011.pdf Khonji, M., & Iraqi, Y. (2014). A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF). In Working Notes for CLEF 2014 Conference, Sheffield, UK (pp. 977–983). Kjell, B. (1994). Discrimination of authorship using visualization. Information Processing and Manage-550 ment, 30, 141–50. Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60, 9–26. Koppel, M., Schler, J., & Argamon, S. (2013). Authorship attribution: What’s easy and what’s hard? Journal of Law & Policy, 21, 317–331.555 Koppel, M., & Seidman, S. (2013). Automatically identifying pseudepigraphic texts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1449–1454). Asso- ciation for Computational Linguistics. Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology, 65, 178–187.560 Love, H. (2002). Attributing authorship. An introduction. Cambridge: Cambridge University Press. Luyckx, K., & Daelemans, W. (2011). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, 26, 35–55. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.565 Maurer, H., Kappe, F., & Zaka, B. (2006). Plagiarism - A Survey. Journal of Universal Computer Science, 12, 1050–1084. Mayer, M. (2011). Caesar and the corpus caesarianum. In G. Marasco (Ed.), Political autobiographies and memoirs in antiquity: A Brill companion (pp. 189–232). Brill. Mosteller, F., & Wallace, D. (1964). Inference and disputed authorship: The Federalist. Addison-Wesley.570 Peñas, A., & Rodrigo, A. (2011). A simple measure to assess non-response. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 HLT ’11 (pp. 1415–1424). Stroudsburg, PA, USA: Association for Computational Linguis- tics. URL: http://dl.acm.org/citation.cfm?id=2002472.2002646. 23 http://dl.acm.org/citation.cfm?id=2002472.2002646 Peng, F., Schuurmans, D., Wang, S., & Keselj, V. (2003). Language independent authorship attribution575 using character level language models. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics - Volume 1 EACL ’03 (pp. 267–274). Stroudsburg, PA, USA: Association for Computational Linguistics. URL: http://dx.doi.org/10.3115/ 1067807.1067843. doi:10.3115/1067807.1067843. Potha, N., & Stamatatos, E. (2014). A profile-based method for authorship verification. In580 A. Likas, K. Blekas, & D. Kalles (Eds.), Artificial Intelligence: Methods and Applications (pp. 313–326). Springer International Publishing volume 8445 of Lecture Notes in Computer Sci- ence. URL: http://dx.doi.org/10.1007/978-3-319-07064-3_25. doi:10.1007/ 978-3-319-07064-3_25. Puig, X., Font, M., & Ginebra, J. (2016). A unified approach to authorship attri-585 bution and verification. The American Statistician, advance access, x. URL: http: //dx.doi.org/10.1080/00031305.2016.1148630. doi:10.1080/00031305.2016. 1148630. arXiv:http://dx.doi.org/10.1080/00031305.2016.1148630. Ružička, M. (1958). Anwendung mathematisch-statistischer methoden in der geobotanik (synthetische bearbeitung von aufnahmen). Biológia (Bratislava), 13, 647–661.590 Rybicki, J., & Eder, M. (2011). Deeper Delta across genres and languages: do we really need the most frequent words? Literary and Linguistic Computing, (pp. 315–321). Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24, 513–523. Sapkota, U., Bethard, S., Montes, M., & Solorio, T. (2015). Not all character n-grams are created equal:595 A study in authorship attribution. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 93– 102). Denver, Colorado: Association for Computational Linguistics. URL: http://www.aclweb. org/anthology/N15-1010. Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., & Rosso, P. (2014). Cross-topic authorship600 attribution: Will out-of-topic data help? In COLING 2014, 25th International Conference on Com- putational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland (pp. 1228–1237). URL: http://aclweb.org/anthology/C/C14/C14-1116.pdf. 24 http://dx.doi.org/10.3115/1067807.1067843 http://dx.doi.org/10.3115/1067807.1067843 http://dx.doi.org/10.3115/1067807.1067843 http://dx.doi.org/10.3115/1067807.1067843 http://dx.doi.org/10.1007/978-3-319-07064-3_25 http://dx.doi.org/10.1007/978-3-319-07064-3_25 http://dx.doi.org/10.1007/978-3-319-07064-3_25 http://dx.doi.org/10.1007/978-3-319-07064-3_25 http://dx.doi.org/10.1080/00031305.2016.1148630 http://dx.doi.org/10.1080/00031305.2016.1148630 http://dx.doi.org/10.1080/00031305.2016.1148630 http://dx.doi.org/10.1080/00031305.2016.1148630 http://dx.doi.org/10.1080/00031305.2016.1148630 http://dx.doi.org/10.1080/00031305.2016.1148630 http://arxiv.org/abs/http://dx.doi.org/10.1080/00031305.2016.1148630 http://www.aclweb.org/anthology/N15-1010 http://www.aclweb.org/anthology/N15-1010 http://www.aclweb.org/anthology/N15-1010 http://aclweb.org/anthology/C/C14/C14-1116.pdf Schubert, A., & Telcs, A. (2014). A note on the Jaccardized Czekanowski similarity index. Scientomet- rics, 98, 1397–1399. doi:10.1007/s11192-013-1044-2.605 Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47. Seidman, S. (2013). Authorship verification using the impostors method. In CLEF 2013 Evaluation Labs and Workshop-Online Working Notes. Seroussi, Y., Zukerman, I., & Bohnert, F. (2014). Authorship attribution with topic models. Comput. Lin-610 guist., 40, 269–310. URL: http://dx.doi.org/10.1162/COLI_a_00173. doi:10.1162/ COLI_a_00173. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernndez, L. (2014). Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applica- tions, 41, 853 – 860. URL: http://www.sciencedirect.com/science/article/pii/615 S0957417413006271. doi:http://dx.doi.org/10.1016/j.eswa.2013.08.015. Methods and Applications of Artificial and Computational Intelligence. Smith, P. W. H., & Aldridge, W. (2011). Improving authorship attribution: Optimizing Burrows’ Delta method. Journal of Quantitative Linguistics, 18, 63–88. Stamatatos, E. (2006). Authorship attribution based on feature set subspacing ensembles. International620 Journal on. Artificial Intelligence tools, 15, 823–838. Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In Proceedings of the 18th International Conference on Database and Expert Systems Applications DEXA ’07 (pp. 237–241). Washington, DC, USA: IEEE Computer Society. URL: http://dx.doi.org/10. 1109/DEXA.2007.41. doi:10.1109/DEXA.2007.41.625 Stamatatos, E. (2009a). Intrinsic plagiarism detection using character n-gram profiles. In Third PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse (pp. 38–46). Stamatatos, E. (2009b). A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology, 60, 538–556. Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M. A., &630 Barrón-Cedeño, A. (2014). Overview of the author identification task at PAN 2014. In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014. (pp. 877–897). 25 http://dx.doi.org/10.1007/s11192-013-1044-2 http://dx.doi.org/10.1162/COLI_a_00173 http://dx.doi.org/10.1162/COLI_a_00173 http://dx.doi.org/10.1162/COLI_a_00173 http://dx.doi.org/10.1162/COLI_a_00173 http://www.sciencedirect.com/science/article/pii/S0957417413006271 http://www.sciencedirect.com/science/article/pii/S0957417413006271 http://www.sciencedirect.com/science/article/pii/S0957417413006271 http://dx.doi.org/http://dx.doi.org/10.1016/j.eswa.2013.08.015 http://dx.doi.org/10.1109/DEXA.2007.41 http://dx.doi.org/10.1109/DEXA.2007.41 http://dx.doi.org/10.1109/DEXA.2007.41 http://dx.doi.org/10.1109/DEXA.2007.41 Stamatatos, E., Kokkinakis, G., & Fakotakis, N. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics, 26, 471–495. Stein, B., Lipka, N., & Prettenhofer, P. (2011). Intrinsic plagiarism analysis. Language Resources and635 Evaluation, 45, 63–82. Stover, J., Winter, Y., Koppel, M., & Kestemont, M. (2016). Computational authorship verification method attributes a new work to a major 2nd century African author. Journal of the American Society for Information Science and Technology, 67, 239–242. Trauth, M. (2002). Caesar incertus auctor. Ein quantifizierendes Wort zur Kritik von Verfassersfragen640 in Lateinischen Texten. In J. Jährling, U. Meves, & E. Timm (Eds.), Röllwagenbüchlein. Festschrift Walter Röll (pp. 313–334). Niemeyer. 26 Figure 3: Naive heatmap visualisation of the stylistic structure in the Corpus Caesarianum, based on the scaled, pairwise distance matrix on the basis of the first-order minmax distance metric and the tf VSM (full vocabulary). Conventional clustering was ran on top of rows and columns, representing non-overlapping 1000-word samples from the text. A significant portion of the Bellum Alexandrinum (labeled X) clusters with Hirtius’s contribution to the Gallic Wars, under a clade that is separate from Caesar’s accepted writings. 27 Caesar Hirtius Caesar et al. (?) Figure 4: Cluster and heatmap visualisation of the results of the O2 verification procedure on the Caesarian corpus. Cell values represent the average probability of a sample being attributed to one of the five profiles distinguished. Five independent analysis were run with the 5 top-performing metric-VSM combination in the benchmark section. O2 seems not only able to distinguish authentic Caesarian material from non-authentic writings, but arguably also differentiates between a ‘pure’ Caesarian style and the mixed style resulting from e.g. the general’s dependence on pre-existing briefs by legates. 28 Highlights 1. We shed new light on the authenticity of the writings of Julius Caesar. 2. Hirtius, one of Caesar’s generals, must have contributed to Caesar’s writings. 3. We benchmark two authorship verification systems on publicly available data sets. 4. We test on both modern data sets, and Latin texts from Antiquity. 5. We show how computational methods inform traditional authentication studies. Introduction Style vs Content Methods Verification Systems Vector space models Distance metrics Benchmark results PAN data Latin data Caesar's writings Acknowledgements References