ijhac.2014.0120.dvi MANUSCRIPTS AND MACHINES: THE AUTOMATIC REPLACEMENT OF SPELLING VARIANTS IN A PORTUGUESE HISTORICAL CORPUS RITA MARQUILHAS AND IRIS HENDRICKX Abstract The CARDS-FLY project aims to collect and transcribe a diverse sample of historical personal letters from the 16th to 20th century in a digital format to create a linguistic resource for the historical study of the Portuguese language and society. The letters were written by people from all social layers of society and their historical, social and pragmatic contexts are documented in the digital format. Here we study one particular aspect of this collection, namely the spelling variation. Furthermore, on the basis of this analysis, we improved a statistical spelling normalisation tool that we aim to use to automatically normalise the spelling in the full collection of digitised letters. Keywords: historical linguistics, spelling variation, automatic normalization, Portuguese 1. introduction1 Personal letters can have a twofold importance for historians. First, they play a supporting role as documents that contain first person testimonies (with all their flaws of accuracy, to be sure) ready for interpretation alongside all other available sources on whatever topic the historian is studying. Under this light, they often become ‘a providential manna to feed biographies, the sketch of everyday life, the taste for intimacy and confidential matters’.2 Secondly, historians can also find personal letters to be important for their own sake, if the context is that of a history of written culture.3 Here they play the leading role, International Journal of Humanities and Arts Computing 8.1 (2014): 65–80 DOI: 10.3366/ijhac.2014.0120 © Edinburgh University Press 2014 www.euppublishing.com/ijhac 65 Rita Marquilhas and Iris Hendrickx given their status of social practices, ‘traces of a complex reality that absorbs countless other practices and registers’.4 For example, they enclose literate (and halfliterate) discourses on the practice of writing itself. Also, they are samples of intimate interactions, whose participants were conscious of the spatial-temporal discontinuity of their speech acts. They constituted either polite or impolite behaviour, either orderly, or disorderly conduct, depending on the observance of conventions valid for the historical communities in question. For this second approach, nevertheless, rich collections of letters are mandatory because cultural interpretations have to be tested against a large quantity of data that represents the norm followed by social actors, and a thin quantity of exceptions that constituted possible marginal behaviours. At the Linguistics Centre of the University of Lisbon (CLUL), such a large collection is being assembled, the CARDS-FLY corpus, in order to attend both the needs of cultural historians, and the needs of historical linguists. Historical linguistics is the study of language change through time, and original, non- literary sources are the most preferred data for the description and interpretation of such change. Spontaneous oral utterances would be the ideal data, but since their retrieval is impossible for language as spoken in past centuries, the personal letter discourse is the next best candidate. It offers the linguist the recording of a behaviour carried out by interactive speakers with a more informal attitude than the one adopted by writers of literary or institutional texts. The CARDS corpus (Cartas Desconhecidas – Unknown Letters) is a collection of 2,000 personal Portuguese letters written between the 16th and the 19th century. The ones dating from 1500 to 1800 were mainly seized by a religious court (the Portuguese Inquisition) as instrumental proof to prosecute individuals accused of heretical beliefs. As for the 19th century ones, they were mainly seized by a Crown court (the Casa da Suplicação) as instrumental proof exhibited either by the prosecution or by the defence of individuals accused of anti-social or anti-political behaviour. The project ran from 2007 to 2010, carried out by a mixed team of historians and linguists. The role of the linguists was to decipher and publish the manuscripts with philological care in order to preserve their relevance as sources for the history of language variation and change. The role of the historians was to contextualise the letters discourse as social events. The whole set of transcriptions, accompanied by a context summary, was given a machine- readable format, which allowed for the assemblage of an online Portuguese historical corpus of Early Modern Ages. In the sequence of CARDS, the FLY project (Forgotten Letters, Years 1900–1974) was launched in 2010 by the same core team, now accompanied by modern history experts, as well as sociologists. The aim was to enlarge the former corpus with data from the 20th century. Since collecting personal papers from contemporary times is a delicate task, given the need to guarantee the 66 The Automatic Replacement of Spelling Variants protection of private data from the public scrutiny, the letters of the FLY project come mostly from donations by families willing to contribute to the preservation of Portuguese collective memory having to do with wars (World War I and the 1961–1974 colonial war), emigration, political prison, and exile. These were also favourable contexts for a high production of written correspondence with family and friends because in such circumstances strong emotions such as fear, longing, and loneliness were bound to arise. The CARDS-FLY corpus is thus a linguistic resource prepared for the historical study of Portuguese language and society. Its strength relies on the broad social representativeness, being entirely composed by documents whose texts belong to the letter genre, the personal domain, and the informal linguistic register.5 The final goal is to have a total of 4,000 letters. By May 2013, the team had already transcribed a total of 3,809 letters involving 2,286 different participants (82 per cent men, 18 per cent women) and around 1,1 million words. The digital encoding of the letters follows a set of guidelines prepared by the Flemish project DALF: Digital Archive of Letters in Flanders based on the TEI P4 Guidelines.6 This encoding offers a machine-readable file format that allows for the philological care critical editions demand. The mark-up language is XML, and the labels contents are the ones fixed by DALF for letters idiosyncrasies and by TEI for primary sources.7 The letters manuscripts were transcribed in a conservative way and features such as unreadable parts, scratched-out parts or perforations in the letters are encoded explicitly in the XML mark-up. Also the spelling of the original document is maintained, as this is relevant for the history of language change, a prospect that is always compromised when spelling normalisations are practiced by editors. On the other hand, the lack of normalisation for spelling creates a problem when the letters are seen as a target for corpus linguistics operations: morphologic annotation, parsing, semantic annotation, concordancing, word lists, and keywords. Such level of processing demands for a corpus in standard spelling, a resource also invaluable for historians focusing on the discursive features that manifest themselves through keywords and semantic fields present in the corpus.8 As we intend to use the corpus for this purpose, we are in need of a normalised version. Manual spelling correction is a laborious and time-consuming effort and therefore we decided to explore the possibilities for automatic normalisation. We already did some exploratory experiments along that path.9 Here we first give a detailed analysis of how spelling varied and changed over time in our corpus based on a statistical analysis of a sample taken from the CARDS-FLY corpus. Next we present some practical results of automatic spelling normalization. We conclude with a discussion of the benefits and limits of using statistical methods for spelling normalization but we conclude that the benefits of the procedure are indeed remarkable ones. 67 Rita Marquilhas and Iris Hendrickx 2. setting a standard for written portuguese In the history of Portugal the standard norm for written language came late in time, only in 1911, one year after the Republican instauration. The standard adoption had been persistently proposed since the 18th century, following foreign examples, but there was never a favourable occasion for the Royal Academy of Sciences of Lisbon (Academia Real das Ciências de Lisboa) to produce a written model, neither in the 1700s nor in the 1800s.10 When a Portuguese orthography could finally be decided, there were two possible paradigms that would serve as alternative models: the shallow orthography, such as the Spanish and the Italian, which preserved phoneme- grapheme correspondences, and the deep orthography, followed in the French and the English spelling standards. The deep paradigm, more etymological, is a type of spelling where morphology, rather than phonology, is recoverable by literate people.11 The authors of the 1911 Portuguese spelling reform decided openly for the shallow paradigm. They motivated their choice as a way of creating the proper instrument that would lead to a quick progress of literacy rates in Portuguese society: What are the bases for the Portuguese orthography that our Commission proposes? There was, from the beginning of the works, two systems that could be followed. One of them was the French orthography, which, more or less coherently, is being imitated in Portugal for some time now. The other system is the one of the Spanish and Italian orthographies, much simpler, more rational, logical and easy to learn, much more adapted to the natural and even literary evolution of those languages, which is also similar to the evolution of Portuguese. What radically differentiates the orthography of those two official languages [Spanish and Italian] is the modification of the Latin spelling of innumerable Romanised Greek words to other spellings, much more similar to the value of the letters of such words in modern times. In order to make the teaching of reading and writing an easier task, the Commission found that the time had come to banish once and for all from the Portuguese writing, as they were banished from the Spanish and the Italian for a long time, [. . . ] the symbols ph, th, rh, and y [. . . ]. Translated from the Portuguese Bases da Reforma de 1911.12 The 1911 reform put an end to a long search for a Portuguese standard for spelling. But it raised a diplomatic misunderstanding between Portugal and Brazil, a problem that took a new period of 100 years to be solved. In 1990, all the Portuguese speaking countries signed an agreement on a decisive spelling 68 The Automatic Replacement of Spelling Variants reform. In 2011 that reform was finally adopted by the Portuguese education system.13 3. automatic spelling normalisation 3.1. Related Work Here we first give some examples of recent related studies that handle spelling variation in historical corpora in general and then focus on studies for the Portuguese language. The VARiant Detector (VARD) tools aimed to detect spelling variation in Early Modern English and were created for corpus linguistic research.14 The first version of the tool was based on a list of manually created mappings between historical variants and their modern versions. The latest version combined several different modules such as a list of letter replacement rules, a phonetic matching algorithm and an edit distance search method to detect spelling variation. We discuss a Portuguese version of VARD in the next section. Craig and Whipp have also worked on a tool for automatic spelling variation detection for Early Modern English but in the perspective of authorship attribution.15 For the corpus of Early Modern German, a spelling variation detection tool is currently under development.16. For the Spanish diachronic corpus, a study of the effect of automatic spelling normalisation has been conducted.17 They compared two different strategies, namely to first automatically normalise the data before using an NLP tool or to adapt the NLP tool itself to handle spelling variation. For their purpose of Parts of Speech tagging, they argued that tool adaption is better as the original spelling is kept. As for Portuguese, most of the available studies concerning the spelling change along Early Modern and Modern times have a cultural historical perspective, which means that what they analyse is the discourses of contemporary élite writers, mostly grammar authors and dictionary authors. Such discourses were either bitter criticisms because of the lack of a spelling standard for the language, or concrete proposals for a solution to that void.18 As for quantitative corpus-based approaches of the same spelling change, they had to wait for the assemblage of large Portuguese historical corpora covering the Early Modern and Modern era, a work that is being mostly undertaken in Brazil. The Tycho Brahe team, of Campinas University, was the first to present statistical measurements of the spelling change phenomenon in order to solve the processing problems it raised,19 followed by the Historical Dictionary of Brazilian Portuguese team (Dicionário Histórico do Português do Brasil).20 This dictionary is constructed on the basis of a historical Portuguese corpus (16th to 19th century) of approximately 5 million tokens. As they needed a normalised 69 Rita Marquilhas and Iris Hendrickx corpus to produce reliable frequency counts for the dictionary, they developed a rule-based method to automatically cluster spelling variants together. They clustered spelling variants around one common word form that is not always a modern word form, but the most central word form in the cluster of related variants leading to a spelling variants dictionary.21 A resource very similar to the CARDS-FLY corpus is the Shared Diachronic Corpus: Personal Brazilian Letters (Corpus Compartilhado Diacrônico: cartas pessoais brasileiras), which consists of a Brazilian collection of historical personal letters from the 18th to 20th century.22 The aim is to provide the academic community with a resource for the sociolinguistic history of Rio de Janeiro’s society along 300 years. The documents in this collection have also been normalised for spelling, but all normalisation was done manually, with the help of a friendly tool, namely E-Dictor, offered by the above-mentioned Tycho Brahe project.23 3.2. DICER Similarly to the Brazilian experiments, our study also uses a statistical corpus- based approach to get a better insight in the Portuguese spelling variation over the 16th–20th century time span. Our major originality is that we deal with an ultra-varied corpus, entirely made up of text within original letter manuscripts, either written by common people, or by élite people in common moments of their lives. We extracted a random sample from the CARDS-FLY corpus of 200 letters. These letters were manually normalised to the modern spelling by a linguist. Each word in the documents that was labelled as spelling variant was paired with its modern spelling counterpart. This sample was intended both for a manual inspection and analysis of the spelling variation present in the data, and for the development of an automatic tool for spelling normalisation. For the latter purpose, we split the sample in two parts. We used a hundred letters for training and tuning the automatic normalisation tool for this specific genre. The other hundred letters are used for evaluation of the tool as we can compare the manual normalisation against the automatic normalisation produced by the tool. We set apart the evaluation set and excluded it from any manual analysis. Tuning an automatic tool to the errors in the evaluation set would lead to a tool that performs very well on this one set but it might lead to an overly optimistic estimation of the true performance of the tool on other, unseen material. DICER (Discovery and Investigation of Character Edit Rules) is a statistical tool that creates a list of edit rules on the basis of a corpus labelled with spelling variants and their modern counterparts.24 The tool uses these pairs to detect which character(s) differ between the variant and the modern word, and it produces simple edit rules that capture the steps to rewrite the old word form 70 The Automatic Replacement of Spelling Variants to the modern form. The edit rules express what characters are being changed, what type of operation (deletion, insertion or substitution) is applied, and on which location of the word (start, second, middle, penultimate or end). To rewrite a spelling variant to its modern form may need multiple different rewrite rules. For example, apezare is a variant in our historical data for the modern form apesar ‘despite’ and the transformation requires two edit rules: ‘substitute < z > with < s > ’, and ‘delete < e > ’. DICER creates a new rule for every edit that it encounters in the corpus and therefor gives a full statistical and systematic overview of the spelling changes that are present in the corpus. Below we show a detail of the DICER results summary, after the processing of the CARDS-FLY corpus sample of a hundred letters. The summary shows the operations involving word types (not tokens). The table captures the ten top edit rules on the modernisation of those types. We can see that the substitution of < z > by < s > , especially when the < z > letter appears in the middle or in the penultimate position, is the edit rule that has been applied most frequently, namely 193 times, as shown in the column labelled as ‘Total’ (see Table 1). Since DICER finds all the edit rules involved in the modernisation process, it follows that a close examination of column ‘Variant’ versus column ‘Standard’, combined with the number of different word types that changed (column ‘Total’) will give us a good snapshot of the variation problems we have to face when dealing with the CARDS-FLY corpus. The letters authors were either following old spelling traditions, later abandoned, or, in the case of half-illiterate authors, also struggling with the rationale of the general spelling usage of their time, either old or modern. A computation of the spelling behaviour of those authors, as compared to modern Portuguese orthography, tells us that a total of 718 edit rules were needed in order to modernise the sample of 100 letters, and that these rules affected, one or more times, a sum of 3,450 different word types. When summing all operations of the 718 edit rules, we counted 4,225 different operations, which means that several of these word types had to be standardised step by step by multiple edit rules. In order to have a manageable, humanly observable, sample of this large population of data, we only examined the rules that were applied at least three times, leaving aside the less frequent ones. The resulting sample had a large lexical representativeness (3,590 operations) but a feasible number of edit rules (only 171). In the following two tables we show an interpretation of how the 171 top edit rules of the DICER tool could be distributed in terms of rule contents. The most frequent changes involved the spelling of phonological features (67 per cent), and, within these, the spelling of coronal fricatives was the most critical problem presented by our corpus variation (see Table 2 and Table 3). 71 Rita Marquilhas and Iris Hendrickx T ab le 1. T he D IC E R st an da rd iz in g ed it ru le s on th e C A R D S -F L Y co rp us (d et ai l) . P os it io n # ID O pe ra ti on V ar ia nt S ta nd ar d T ot al S ta rt S ec on d M id dl e P en ul ti m at e E nd 1 8 S ub st it ut io n Z S 19 3 0 4 13 2 46 11 2 20 S ub st it ut io n S S S 16 4 2 20 89 53 0 3 14 9 S ub st it ut io n M N 13 7 0 50 86 1 0 4 76 In se rt io n - 12 3 0 1 11 7 5 0 5 40 S ub st it ut io n à O A M 12 1 0 2 0 0 11 9 6 10 S ub st it ut io n S C 11 8 23 10 78 7 0 7 45 S ub st it ut io n I E 11 7 33 41 37 3 3 8 22 S ub st it ut io n I Í 10 7 2 11 90 2 2 9 68 S ub st it ut io n E I 10 6 23 35 40 8 0 10 6 S ub st it ut io n A Á 92 5 8 36 9 34 72 The Automatic Replacement of Spelling Variants Table 2. Causes for spelling variation in the CARDS-FLY corpus. Word types to General cause Specific cause standardise Phonology coronal fricatives 860 Phonology unstressed oral vowels written with 456 < i > , < e > , < u > , < o > Phonology nasal vowels and diphthongs 426 Phonology stressed oral vowels 408 Mixed mixed 308 Graphic Tradition abbreviations 267 Graphic Tradition learned consonant groups, 233 digraphs, and double consonants: < ct > , < pt > , < ph > , < pf > , < pp, < ff > , etc. Syntax enclisis: hyphenated verbal forms, 154 with or without sandhi, followed by clitic pronoun vs. non hyphenated verbal forms Graphic Tradition etymological vs. non etymological 136 initial < h > Phonology non standard phonology 132 (dialectal variation) Graphic Tradition archaic letters: < y > vs. < i > , 95 < u > vs. < v > , < i > vs. < j > ) Phonology liquids /l, r, R/ 63 Phonology labialised velar stops 52 /kw, gw/ vs. velar stop /k, g/1 TOTAL 3590 1We follow here Maria Helena Mateus and Ernesto d’Andrade, who present a case for the existence of segment /kw/ in the phonology of Portuguese: M. H. Mateus and E. d’ Andrade, The Phonology of Portuguese (Oxford, 2000). The fact that the CARDS-FLY corpus is composed by original manuscripts, instead of printed texts, together with the large variety of their authors’ social status, accounts for such a distribution of spelling variants. This means that much of the correspondence was written in a close-to-spoken manner, without the opportunity of being revised by a more literate copywriter. The above results also reveal the most important stumbling block in the Portuguese modern spelling system when the researcher wants to modernise historical written matter. That stumbling block is the lack of correspondent letters for the distribution of voiced and voiceless coronal fricatives. 73 Rita Marquilhas and Iris Hendrickx Table 3. Summary of spelling variation in the CARDS-FLY corpus. General cause Frequency of Rate of word for variation word types to standardise types to standardise Phonology 2397 66,7% Graphic Tradition 731 20,4% Mixed 308 8,6% Syntax 154 4,3% Totals 3590 100% In the Middle Ages, Southern Portuguese dialects were already experiencing seseo (the merge of the dental-alveolar affricates /ts, dz/ and the dental-alveolar fricatives /s, z/).25 Today only the archaic variety of the North-Eastern area keeps a distinction between four segments, articulating different fricatives in the middle of passo ‘step’, paço ‘palace’, coser ‘sew’, and cozer ‘bake, steam’. Also, but later, from the 17th century on, the voiceless palatal affricate (traditionally written < ch > ) merged with the voiceless palatal fricative (traditionally written < x > ) in Southern and Central dialects, so that the phonological difference between words like chá ‘tea’, and xá ‘shah’ was lost.26 All affricates disappeared in the innovative dialects, but since their traditional spelling was always kept by learned writers, including the ones that established the 20th century Portuguese orthography, it became a major source of variation in texts by poor writers along the centuries. Nevertheless, if we split our data into chronological segments, it is clear that the major problem for 20th century uneducated letter writers is not the spelling of coronal fricatives. That problem is specific of earlier writers, especially the ones of the 18th and the 19th century. The major problem with standardizing the spellings of 20th century poor writers resides in the system of stressed vowels, which they normally write without the phonographic diacritics prescribed by the standard rules. The other two more important sets of rules applied by the DICER tool have to do with the spelling of unstressed vowels and the spelling of nasal vowels and diphthongs, two phonological categories that are insufficiently mirrored by the Portuguese standard spelling. Neither the Spanish nor the Italian language, the overt examples that guided the creators of the Portuguese standard spelling in 1911, compare to Portuguese in what concerns the phonology of unstressed vowels and nasal vowels and diphthongs. So here the Portuguese spelling system became more etymological, less shallow, a feature that triggers several problems when it comes to standardizing historical data with many spelling variations. 74 The Automatic Replacement of Spelling Variants 3.3. VARD2 As a next step in our study we used the edit rules automatically generated by DICER to further improve the VARD2 tool for automatic spelling normalisation of historical Portuguese.27 We already experimented with the tool VARD2 in a previous study, and here we show how DICER can contribute to a better performance. VARD2 was initially developed for Early Modern English but we converted it to Portuguese. The system uses a modern lexicon to detect possible spelling variants in a historical input text. Words that do not occur in the modern lexicon are marked as possible candidates. The system checks for each candidate if it occurs in a variant dictionary, which lists frequent spelling variants and normalised equivalents. If the variant is listed, it is recognised as a true spelling variant and is replaced automatically by its modern equivalent. Otherwise, both rules based on phonological information and character rewrite rules are used to generate possible modern equivalents for the variant and associated confidence weights. One of the parameters of VARD2 is a confidence threshold that determines what weight is needed to replace the variant with the highest weighted modern equivalent that exceeds the minimum threshold. If no likely candidates are found, the variant is kept. To convert VARD2 to the Portuguese language we replaced the English modules by Portuguese ones.28 As modern lexicon we used the Multifunctional Computational Lexicon of Contemporary Portuguese.29 We had created the variant list of spelling variants and their modern equivalents on the basis of an existing spelling variants dictionary extracted from the Historical Corpus of Brazilian Portuguese mentioned above.30 We made several small improvements to the Portuguese modules in VARD2. When inspecting the modern lexicon, we noticed that even though it was extracted from a contemporary dictionary it still contained several archaic word forms. We attempted to filter out these word forms on the basis of a list of archaic word forms from the Houaiss dictionary.31 We also used the list of spelling variants from the training sample of a hundred letters to filter the lexicon by deleting the variants and adding the modern word forms. Furthermore, a manual check of the most frequent items in the spelling variant list was needed as we had already noticed that some variants were not mapped to a modern word form but to another, more frequent archaic word form. For example, in our previous experiments the variant list contained the archaic form fforão ’(they) were/went’ matched with equivalent forão instead of the correct modern counter part foram. VARD2 uses a set of rewrite rules to generate the modern word form candidates. In our first approach we manually constructed such a list of rewrite rules based on our own intuitions and on the rule set described by Giusti et al. 75 Rita Marquilhas and Iris Hendrickx Table 4. VARD2 scores on the development set with different thresholds for the rule set. Threshold Accuracy Recall Precision F-score 5 93.0 74.3 98.5 84.7 10 93.0 739 98.5 84.5 25 92.8 73.0 98.6 83.9 50 92.2 70.6 98.7 82.3 Here we intend to investigate to what extent the automatically generated rewrite rules by the DICER tool can help improve the performance of VARD2. Our analysis and interpretation of the generated rule set presented above showed that the DICER was able to produce edit rules that capture a broad and diverse set of spelling changes. As DICER generates a large rule list and some of the rules are based on evidence of only one occurrence, we decided to search for an optimal minimum frequency threshold for the rule set.32 To get an indication for a suitable cut- off point, we ran experiments on the training set to see the effect of using rules that occurred at least 5, 10, 25 and 50 times. The higher the cut-off threshold, the smaller the rule set would be. The rule set with cut-off threshold 5 has 99 rules while a cut off of 50 only leaves 14 rules. We split the training sample in a part of 80 letters for training and 20 letters as a development set to determine the optimal rule set. We ran experiments with the different thresholds on the development set. To evaluate the performance of the tool, we compute accuracy, recall, precision and F-score for the words (excluding punctuation marks) in the held out evaluation data. Recall expresses the number of cases in which there was a spelling variant in the text and the modern variant was correctly predicted by the tool, divided by the total number of predictions (errors because the tool predicted too many cases). Precision on the other hand focuses on the number of correct predictions divided by the number of true spelling variants in the data (errors because the tool missed some cases). In table 4 we show the effect of varying the threshold on the development set. We do not observe huge differences between the different thresholds, but as the threshold of 5 had a slightly higher score, we decided to use this cut-off threshold for the experiments on the test set. As we aim to study the effect of DICER edit rules on the VARD2 system, we made a comparison between the DICER edit rules, and the set of rules that we had manually created for our previous experiments. The manual rule set contains 62 different rules while the DICER rule set with threshold 5 contains 99 rules. When we compare the two rule sets, we notice only a few overlaps in rules. Both sets contain the rules to remove the double consonants < ll > , < nn > , < tt > , 76 The Automatic Replacement of Spelling Variants Table 5. A comparison on the test set of two versions of the VARD2 tool one with the DICER rule set and one with handcrafted rules. Rule set Accuracy Recall Precision F-score handcrafted 92.7 64.9 98.4 78.3 DICER 94.2 73.4 97.0 83.6 the substitution of < y > with < i > and some accent changes. The manual rule set contains many specific rules that cover multiple character strings such as ‘substitute < zente > with < sente > at the End position’. The DICER tool however has more general rules that do capture the same event, for example the rule < z > - < s > is a generalisation of the ‘substitute < zente > ’ rule. In the table 5 we show the results of the comparison VARD2 with the handcrafted rule set against a version of VARD2 trained with the DICER rule set with threshold 5 on the held out test sample of a hundred letters. Overall, we observe that VARD2 has a very high precision. The automatically generated rule set leads to a higher performance of 84 per cent F-score and 94 per cent accuracy. As shown in the table, the automatically generated rule set leads to a higher overall performance due to an increase of the recall. The DICER rule set enables the VARD2 tool to create a larger list of possible modern candidates thereby reducing the number of missed variants. For example, the variant lansar was not corrected by VARD2 trained with the handcrafted rule set, but it was correctly changed to lançar ‘to launch’ by the version trained with the DICER rule set as it included the edit ‘substitute < s > with < ç > ’. In general, the limitation of VARD2 to only detect non-word errors causes a major part of the errors. To give an example, the noun circunstancia was not detected as a spelling variant because it is listed in the modern lexicon where it represents a conjugation of the verb circunstanciar ‘to state in detail’. However, the modern equivalent of the noun has an accent: circunstância ‘circumstance’. The information about the grammatical function of a word in the sentence is not available and therefor the system cannot detect this variant. In other cases VARD2 will chose the most likely and closest modern variant, and this may not be the best option in a given context. Like the form frea that can either be an abbreviation of freguesia ‘parish’ or a variant of fria ‘cold’. A context-sensitive tool is needed to solve this type of problems but this is a line of future research as there are currently not many context-sensitive spelling normalisation tools available, certainly not for historical texts.33 4. conclusions We have presented an analysis of the main types of spelling variation that we encountered in CARDS-FLY corpus, a corpus of Portuguese historical personal 77 Rita Marquilhas and Iris Hendrickx letters that lacks standardisation because it corresponds to extremely varying sources, which were transcribed in a semi-palaeographic way. The systematic account of all spelling changes in the corpus sample, as generated by the DICER tool, shows the mixed nature of Portuguese modern orthography, not so much shallow as their inventors wanted it to be. This mixed nature of the modern standard clashes both with etymological spellings within the corpus, and with phonological ones. As spelling variation can be a hindrance for certain types of research and for automatic search in the corpus, we presented a series of experimental results with the VARD2 statistical normalisation tool. This tool can automatically normalise variants with an F-score of 84 per cent and a precision of 97 per cent. A high precision means that when VARD2 makes a correction, this is in general correct. The errors that it makes are caused by missing a spelling variant. This score is more than sufficient to be useful for automatic correction of the corpus as it is preferable to have a conservative tool making only those corrections that it is certain about. We have shown that a systematic statistical analysis of spelling variation is a powerful way to both consolidate known changes in the spelling conventions and to discover new insights in the way people wrote in earlier times. We also showed that both diachronic linguists and historians wanting to subject historical Portuguese sources to processing operations can have them modernised by an automatic way. They do not have to wait long years, nor to exhaust large human resources, in the operation of manually modernising the variant spellings of such texts, even if they were written by the poor-writer type of author. Additionally, the same procedure can always be adapted to new languages, since the tools we worked with were originally designed for English historical texts. end notes 1 Acknowledgements: This research is funded by the Portuguese Foundation of Science and Technology (FCT), under the project FLY (PTDC/CLE-LIN/098393/2008), and the FCT program Ciência 2007/2008. 2 Translated from C. Dauphin, ‘Pour une histoire de la correspondance familiale’, Romantisme 90, (1995), 89–99. Cited here at 89. 3 A. Petrucci, Public lettering: script, power, and culture (Chicago, 1993). 4 Translated from Dauphin, ‘Pour une histoire de la correspondance familiale’, 89. 5 D. Y. W. Lee, ‘Genres, registers, text types, domains, and styles: clarifying the concepts and navigating a path through the BNC jungle’, Language Learning & Technology 5, 3 (2001), 37–72. Cited here at 46 and 50. 6 DALF, Guidelines for the description and encoding of Modern correspondence material, Version 1.0, 2003, http://ctb.kantl.be/project/dalf/. 7 TEI, Text Encoding Initiative, P5 guidelines, http://www.tei-c.org/index.xml, last accessed 24 May 2013. 78 The Automatic Replacement of Spelling Variants 8 Recent examples are D. Archer and J. Culpeper, ‘Identifying key sociophilological usage in plays and trial proceedings (1640–1760): An empirical approach via corpus annotation’, Journal of Historical Pragmatics 10, 2 (2009), 286–309, and D. Z. Mohd, G. Knowles and Ch. K. Fatt, ‘Nationhood and Malaysian identity: a corpus-based approach’, Text & Talk – An Interdisciplinary Journal of Language, Discourse & Communication Studies 30, 3 (2010), 267–287. 9 I. Hendrickx and R. Marquilhas, ‘From old texts to modern spellings: an experiment in automatic normalisation’, Journal for Language Technology and Computational Linguistics 26, 2 (2011), 65–76. 10 M. F. Gonçalves, As ideias ortográficas em Portugal: de Madureira Feijó a Gonçalves Viana (1734–1911) (Lisboa, 2003), 779–786. 11 F. Coulmas, The Blackwell encyclopedia of writing systems (Oxford & Cambridge, Mass., 1996), 380. 12 Reprinted by I. Castro, I. Duarte and I. Leiria, eds, A demanda da ortografia portuguesa (Lisboa, 1987), 152. 13 Presidência do Conselho de Ministros, ‘Resolução do Conselho de Ministros n.o 8/2011’, Diário da República, 1.a Série, n.o 17, January 25, 2011. 14 P. Rayson, D. Archer and N. Smith, ‘VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora’, Proceedings of the corpus linguistics conference (Birmingham, 2005). 15 H. Craig and R. Whipp, ‘Old spellings, new methods: automated procedures for indeterminate linguistic data’ , Literary and Linguistic Computing 25, 1 (2010), 37–52. 16 S. Scheible, R. J. Whitt, M. Durrell and P. Bennett, ‘For the A Gold Standard Corpus of Early Modern German’, Proceedings of the 5th linguistic annotation workshop (Portland, Oregon, 2011), 124-128. 17 C. Sánchez-Marco, G. Boleda, J. M. Fontana and J. Domingo, ‘Annotation and representation of a diachronic corpus of Spanish’, Proceedings of the seventh conference on international language resources and evaluation (Malta, 2010), 2713–2718. 18 Gonçalves, As ideias ortográficas em Portugal; M. L. C. Buescu, Gramáticos portugueses do século XVI (Lisboa, 1978); R. Marquilhas, ‘O acento, o hífen e as consoantes mudas nas Ortografias antigas portuguesas’, in I. Castro, I. Duarte, and I. Leiria, eds., A demanda da ortografia portuguesa (Lisboa, 1987), 103–116; M. H. Paiva, ‘Variação e evolução da palavra gráfica: o testemunho dos textos metalinguísticos do século XVI’, in Actas do XII encontro nacional da Associação Portuguesa de Linguística, 2 (Coimbra, 1997), 233–252. 19 T. A. Menegatti, Regras lingüísticas para o tratamento computacional da variação de grafia e abreviaturas do corpus Tycho Brahe (Campinas, 2002). 20 R. Giusti, et al., ‘Automatic detection of spelling variation in historical corpus: An application to build a Brazilian Portuguese spelling variants dictionary’, in Proceedings of the corpus linguistics conference CL2007 (Birmingham, 2007). 21 BP spelling variants dictionary is available at: http://www.nilc.icmc.usp.br/nilc/projects/hpc/, last accessed 24 May 2013. 22 The Corpus Compartilhado Diacrônico was created by the Laboratório de História do Português Brasileiro from the Universidade Federal do Rio de Janeiro in Brazil. More information can be found at http://www.letras.ufrj.br/laborhistorico/, last accessed 24 May 2013. 23 M. C. Paixão de Sousa, F. N. Kepler and P. P. F. Faria, ‘E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos’, in Caminhos da linguística de corpus (Campinas, 2010). 24 DICER is described in chapter 4 of the following thesis: A. Baron, ‘Dealing with spelling variation in Early Modern English texts, PhD dissertation’ (Lancaster University, 2011). 79 Rita Marquilhas and Iris Hendrickx 25 L. F. Lindley Cintra, ‘Observations sur l’orthographe et la langue de quelques textes non littéraires galicien-portugais de la seconde moitié du XIIIe siècle’, Revue de Linguistique Romane 27 (1963), 59–77. 26 P. Teyssier, História da língua portuguesa (Lisboa, 1982); I. Castro, Introdução à História do Português (Lisboa, 2006). 27 A. Baron and P. Rayson, ‘VARD 2: A tool for dealing with spelling variation in historical corpora’, in Proceedings of the postgraduate conference in corpus linguistics (Birmingham, UK, 2008). 28 For a detailed description of the Portuguese modules in our version of the VARD2 tool, we refer to the following paper: Hendrickx and Marquilhas, ‘From old texts to modern spellings’, sec 4. 29 This Lexicon is available for download at: Multifunctional Computational Lexicon of Con- temporary Portuguese, 2010, http://www.clul.ul.pt/en/resources/88-project-multifunctional- computational-lexicon-of-contemporary-portuguese-r. 30 Giusti, et al., ‘Automatic detection of spelling variation in historical corpus’, sec 9. 31 A. Houaiss, et al., Dicionário Houaiss da língua portuguesa (Rio de Janeiro, 2001). We wish to thank Mauro Villar for kindly granting us access to the digital form of the Houaiss dictionary’s archaic lexicon. 32 The Dicer rules were manually converted to the VARD format and some rules were adapted as very general rules such ‘insert e anywhere’ slow down and ultimately crash the VARD program as they generate too many possibilities. To elevate this problem, such general rules were converted to more specific rules. 33 Baron, ‘Dealing with spelling variation in Early Modern English texts’, sec 6.4, and sec 7. 80