Multilingual Projection for Parsing Truly Low-Resource Languages Željko Agić♥ Anders Johannsen♥ Barbara Plank♥♣ Héctor Martı́nez Alonso♥♠ Natalie Schluter♥♦ Anders Søgaard♥ ♥ Center for Language Technology, University of Copenhagen, Denmark ♣ Center for Language and Cognition, University of Groningen, The Netherlands ♠ Univ. Paris Diderot, Sorbonne Paris Cité – Alpage, INRIA, France ♦ MobilePay, Copenhagen, Denmark {zeljko.agic,soegaard}@hum.ku.dk Abstract We propose a novel approach to cross-lingual part-of-speech tagging and dependency pars- ing for truly low-resource languages. Our an- notation projection-based approach yields tag- ging and parsing models for over 100 lan- guages. All that is needed are freely avail- able parallel texts, and taggers and parsers for resource-rich languages. The empirical evalu- ation across 30 test languages shows that our method consistently provides top-level accu- racies, close to established upper bounds, and outperforms several competitive baselines. 1 Introduction State-of-the-art approaches to inducing part-of- speech (POS) taggers and dependency parsers only scale to a small fraction of the world’s ∼6,900 lan- guages. The major bottleneck is the lack of man- ually annotated resources for the vast majority of these languages, including languages spoken by mil- lions, such as Marathi (73m), Hausa (50m), and Kur- dish (30m). Cross-lingual transfer learning—or sim- ply cross-lingual learning—refers to work on using annotated resources in other (source) languages to induce models for such low-resource (target) lan- guages. Even simple cross-lingual learning tech- niques outperform unsupervised grammar induction by a large margin. Most work in cross-lingual learning, however, makes assumptions about the availability of linguis- tic resources that do not hold for the majority of low-resource languages. The best cross-lingual de- pendency parsing results reported to date were pre- sented by Rasooli and Collins (2015). They use the intersection of languages covered in the Google dependency treebanks project and those contained in the Europarl corpus. Consequently, they only consider closely related Indo-European languages for which high-quality tokenization can be obtained with simple heuristics. In other words, we argue that recent approaches to cross-lingual POS tagging and dependency pars- ing are biased toward Indo-European languages, in particular the Germanic and Romance families. The bias is not hard to explain: treebanks, as well as large volumes of parallel data, are readily available for many Germanic and Romance languages. Several factors make cross-lingual learning between these languages easier: (i) We have large volumes of rela- tively representative, translated texts available for all language pairs; (ii) It is relatively easy to segment and tokenize Germanic and Romance texts; (iii) These languages all have very similar word order, making the alignments much more reliable. There- fore, it is more straightforward to train and evaluate cross-lingual transfer models for these languages. However, this bias means that we possibly over- estimate the potential of cross-lingual learning for truly low-resource languages, i.e., languages with no supporting tools or resources for segmentation, POS tagging, or dependency parsing. The aim of this work is to experiment with cross- lingual learning via annotation projection, making minimal assumptions about the available linguistic resources. We only want to assume what we can in fact assume for truly low-resource languages. Thus, for the target languages, we do not assume the avail- 301 Transactions of the Association for Computational Linguistics, vol. 4, pp. 301–312, 2016. Action Editor: Chris Quirk. Submission batch: 1/2016; Revision batch: 4/2016; 5/2016; 7/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. ability of any labeled data, tag dictionaries, typo- logical information, etc. For annotation projection, we need a parallel corpus, and we therefore have to rely on resources such as the Bible (parts of which are available in 1,646 languages), and publications from the Watchtower Society (up to 583 languages). These texts have the advantage of being translated both conservatively and into hundreds of languages (massively multi-parallel). However, the Bible and the Watchtower are religious texts and are more bi- ased than the corpora that have been assumed to be available in most previous work. In order to induce high-quality cross-lingual transfer models from noisy and very limited data, we exploit the fact that the available resources are massively multi-parallel. We also present a novel multilingual approach to the projection of depen- dency structures, projecting edge weights (rather than edges) via word alignments from multiple sources (rather than a single source). Our approach enables us to project more information than previ- ous approaches: (i) by postponing dependency tree decoding to after the projection, and (ii) by exploit- ing multiple information sources. Our contributions are as follows: (i) We present the first results on cross-lingual learning of POS taggers and dependency parsers, assuming only linguistic resources that are available for most of the world’s writ- ten languages, specifically, Bible excerpts and translations of the Watchtower. (ii) We extend annotation projection of syntactic dependencies across parallel text to the multi- source scenario, introducing a new, heuristics- free projection algorithm that projects weight matrices from multiple sources, rather than dependency trees or individual dependencies from a single source. (iii) We show that our approach performs signifi- cantly better than commonly used heuristics for annotation projection, as well as than delexi- calized transfer baselines. Moreover, in com- parison to these systems, our approach per- forms particularly well on truly low-resource non-Indo-European languages. All code and data are made freely available for general use.1 2 Weighted annotation projection Motivation Our approach is based on the gen- eral idea of annotation projection (Yarowsky et al., 2001) using parallel sentences. The goal is to aug- ment an unannotated target sentence with syntactic annotations projected from one or more source sen- tences through word alignments. The principle is illustrated in Figure 1, where the source languages are German and Croatian, and the target is English. The simplest case is projecting POS labels, which are observed in the source sentences but unknown in the target language. In order to induce the gram- matical category of the target word beginning, we project POS from the aligned words Anfang and početku, both of which are correctly annotated as NOUN. Projected POS labels from several sources might disagree for various reasons, e.g., erroneous source annotations, incorrect word alignments, or legitimate differences in POS between translation equivalents. We resolve such cases by taking a ma- jority vote, weighted by the alignment confidences. By letting several languages vote on the correct tag of each word, our projections become more robust, less sensitive to the noise in our source-side predic- tions and word alignments. We can also project syntactic dependencies across word alignments. If (us,vs) is a dependency edge in a source sentence, say the ingoing dependency from das to Wort, us (Wort) is aligned to ut (word), and vs (das) is aligned to vt (the), we can project the depen- dency such that (ut,vt) becomes a dependency edge in the target sentence, making the a dependent of word. Obviously, dependency annotation projection is more challenging than projecting POS, as there is a structural constraint: the projected edges must form a dependency tree on the target side. Hwa et al. (2005) were the first to consider this problem, applying heuristics to ensure well-formed trees on the target side. The heuristics were not per- fect, as they have been shown to result in excessive non-projectivity and the introduction of spurious re- lations and tokens (Tiedemann et al., 2014; Tiede- mann, 2014). These design choices all lead to di- 1https://bitbucket.org/lowlands/release 302 https://bitbucket.org/lowlands/release Figure 1: An outline of dependency annotation projection, voting, and decoding in our method, using two sources i (German) and j (Croatian) and a target t (English). Part 1 represents the multi-parallel corpus preprocessing, while parts 2 and 3 relate to our projection method. The graphs are represented as adjacency matrices with column indices encoding dependency heads. We highlight how the weight of target edge (ut = was,vt = beginning) is computed from the two contributing sources. minished parsing quality. We introduce a heuristics-free projection algo- rithm. The key difference from most previous work is that we project the whole set of potential syn- tactic relations with associated weights—rather than binary dependency edges—from a large number of multiple sources. Instead of decoding the best tree on the source side—or for a single source-target sen- tence pair—we project weights prior to decoding, only decoding the aggregated multi-source weight matrix after the individual projections are done. This means that we do not lose potentially relevant infor- mation, but rather project dense information about all candidate edges. 2.1 Multi-source sentence graph We assume the existence of n source languages and a target language t. For each tuple of translations in our multi-parallel corpus, our algorithm projects syntactic annotations from the n source sentences to the target sentence. Projection happens at the sentence-level, taking a tuple of n annotated sentences and an unannotated sentence as input. We formalize the projection step as label propagation in a graph structure where the words of the target and source sentences are vertices, while edges represent dependency edge candidates between words within a sentence (a parse), as well as similarity relations between words of sentences in different languages (word alignments). Formally, a projection graph is a graph G = (V,E). All edges are weighted by the function we : E → R. The vertices can be decomposed into sets V = V0 ∪·· ·∪Vn, where Vi is the set of words in sentence i. We often need to identify the target sentence Vt = V0 and the source sentences Vs = V1 ∪·· ·∪Vn sep- arately. Edges between Vs and Vt are the result of word alignments. The alignment subgraph is the bi- partite graph A = (Vs,Vt,EA), i.e., the subgraph of G induced by all (alignment) edges, EA, connecting Vs and Vt. The subgraph induced by the set of vertices Vi, written as G[Vi], represents the dependency edge candidates between the words of the sentence i. In general these subgraphs are dense, i.e., they encode weight matrices of edge scores and not just the sin- gle best parse. For the source sentences, we assume that the weights are provided by a parser, while the weights for the syntactic relations of the target sen- tence are unknown. With the above definitions, the dependency pro- jection problem amounts to assigning weights to the edges of G[Vt] by transferring the syntactic parse graphs G[V1], . . . ,G[Vn] from the source languages through the alignments A. 2.2 Part-of-speech projection Our annotation projection for POS tagging is simi- lar to the one proposed by Agić et al. (2015). The algorithm is presented in Algorithm 1. We first in- troduce a conditional probability distribution p(l|v) 303 Algorithm 1: Project POS tags Data: A projection graph G = (Vs ∪Vt,E); a set of POS labels L; a function p(l|v) assigning probabilities to labels l for word vertices v. Result: A labeling of Vt p̃ ← empty probability table label ← empty label-to-vertex mapping for vt ∈ Vt do for l ∈ L do p̃(l|vt) ← ∑ vs∈Vs p(l|vs)wa(vs,vt) label(vt) ← arg maxl p̃(l|vt) return label Algorithm 2: Project dependencies Data: A projection graph G = (Vs ∪Vt,E). Result: A dependency tree covering the target vertices Vt. if project from trees then for i=1 to n do G[Vi] ← DMST(G[Vi]) for (ut,vt) ∈ G[Vt] do we(ut,vt) ←−∞ if (·,ut) or (·,vt) /∈ EA then continue we(ut,vt) ← n∑ i=1 max us,vs∈Vi we(us,vs) wa(us,ut) wa(vs,vt) G[Vt] ← normalize(G[Vt]) return DMST(G[Vt]) over POS tags l ∈ L for each vertex v in the graph. For all source vertices, the probability distributions are obtained by tagging the corresponding sentences in our multilingual corpus with POS taggers, assign- ing a probability of one to the best tag for each word, and zero for all other tags. For each target token, i.e., each vertex v, the projection works by gathering ev- idence for each tag from all source tokens aligned to v, weighted by the alignment score: p(l|vt) ∝ ∑ vs∈Vs p(l|vs) wa(vs,vt) The projected tag for a target vertex vt is then arg maxl p(l|vt). When both the alignment weights and the source tag probabilities are in {0,1}, this reduces to a simple voting scheme that assigns the most frequent POS tag among the aligned words to each target word. 2.3 Dependency projection While in POS projection, we project vertex labels, in dependency projection we project edge scores. Our procedure for dependency annotation projection is given in Algorithm 2. For each source language, we parse the corresponding side of our multi-parallel corpus using a dependency parser trained on the source language treebank. However, instead of de- coding to dependency trees, we extract the weights for all potential syntactic relations, importing them into G as edge weights. The parser we use in our experiments assigns scores we ∈ R to possible edges. Since the ranges and values of these scores are dependent on the train- ing set size and the number of model updates, we standardize the scores to make them comparable across languages. Standardization centers the scores around zero with a standard deviation of one by sub- tracting the mean and dividing by the standard devi- ation. We apply this normalization per sentence. Scores are then projected from source edges to target edges via word alignments wa ∈ [0,1]. In- stead of voting among the incoming projections from multiple sources, we sum the projected edge scores. Because alignments vary in quality, we scale the score of the projected source edge by the corre- sponding alignment probability. A target edge (ut,vt) ∈ G[Vt] can originate from multiple source edges even from a single source sen- tence, due to m : n alignments. In such cases, we only project the source edge (us,vs) ∈ G[Vi>0] with the maximum score, provided the words are aligned, i.e., (us,ut) and (vs,vt) ∈ EA. In the case of a single source sentence pair, the target edge scores are set as follows: we(ut,vt) ← max us,vs∈Vi edge︷ ︸︸ ︷ we(us,vs) alignment︷ ︸︸ ︷ wa(us,ut)wa(vs,vt) We note the distinction between edge weights we and alignment weights wa. With multiple sources, 304 the target edge scores we(ut,vt) are computed as a sum over the individual sources: n∑ i=1 max us,vs∈Vi we(us,vs) wa(us,ut) wa(vs,vt) After projection we have a dense set of weighted edges in the target sentence representing possible syntactic relations. This structure is equivalent to the n × n edge matrix used in ordinary first-order graph-based dependency parsing. Before decoding, the weights are softmax- normalized to form a distribution over each possible head decision. The normalization balances out the contributions of the individual head decisions; and in our development setup, we found that omitting this step resulted in a substantial (∼10%) decrease in parsing performance. We then follow McDonald et al. (2005) in using directed maximum spanning tree (DMST) decoding to identify the best dependency tree in the matrix. We note that DMST decoding on summed projected weight matrices is similar to the idea of re-parsing with DMST decoding of the output on an ensemble of parsers (Sagae and Lavie, 2006), which we use as a baseline in our experiments. 3 Data 3.1 Training and test sets We use source treebanks from the Universal De- pendencies (UD) project, version 1.2 (Nivre et al., 2015).2 They are harmonized in terms of POS tag inventory (17 tags) and dependency annotation scheme. In our experiments, we use the canoni- cal data splits, and disregard lemmas, morphological features and alternative POS from all treebanks. Out of the 33 languages currently in UD1.2, we drop languages for which the treebank does not dis- tribute word forms (Japanese), and languages for which we have no parallel unlabeled data (Latin, Ancient Greek, Old Church Slavonic, Irish, Gothic). Languages with more than 60k tokens (in the train- ing data) are considered source languages, the re- maining 6 smaller treebanks (Estonian, Greek, Hun- garian, Latin, Romanian, Tamil) are strictly consid- ered targets. This results in 22 treebanks for training 2http://hdl.handle.net/11234/1-1548 source taggers and parsers. We use two additional test sets: Quechua and Serbian. The first one does not entirely adhere to UD, but we provide a POS tagset mapping and a few modifications and include it as a test language to deepen the robustness assess- ment for our approach across language families. The Serbian test set fully conforms to UD, as a fork of the closely related Croatian UD dataset.3 This results in a total of 30 target languages. 3.2 Multi-parallel corpora We use two sources of massively parallel text. The first is the Edinburgh Bible Corpus (EBC) collected by Christodouloupoulos and Steedman (2014) con- taining 100 languages. EBC has either 30k or 10k sentences for each language, depending on whether they are made up of full Bibles or just translations of the New Testament, respectively. We also crawled and scraped the Watchtower Online Library website to collect what we will refer to as the Watchtower Corpus (WTC).4 The data is from 2002-2016 and the final corpus contains 135 languages with sentences in the range of 26k-145k. While some EBC Bibles are written in dated language, we do not make any modifications to the corpus if the language is also present in WTC. However, as Basque is not repre- sented in WTC, we replace the Basque Bible from 1571 with a contemporary version from 2004, to en- able the use of Basque in the parsing experiments.5 EBC and WTC both consist of religious texts, but they are very different in terms of style and con- tent. If we examine Table 1 that shows the most frequent words per corpus, we observe that the En- glish Bible—the King James Version from 1611— contains many Old English verb forms (“hath”, “giveth”). In contrast, the English Watchtower is written in contemporary English, both in terms of verb inflection (“does”, “says”) and vocabulary (“to- day”, “human”). WTC also deals with contempo- rary topics such as blood “transfusion” (36 men- tions) and “computer” (42 mentions). The other languages also show differences in terms of language modernity and dialectal difference between EBC and WTC. While each Bible transla- tion has its individual history, Watchtower transla- 3https://github.com/ffnlp/sethr 4http://wol.jw.org 5http://www.biblija.net/biblija.cgi?l=eu 305 http://hdl.handle.net/11234/1-1548 https://github.com/ffnlp/sethr http://wol.jw.org http://www.biblija.net/biblija.cgi?l=eu EBC: hath, saith, hast, spake, yea, cometh, iniquity, wilt, smote, shew, begat, doth, lo, hearken, thence, verily, neighbour, goeth, shewed, giveth, smite, didst, wherewith, knoweth, night WTC: bible, does, however, says, today, during, show, human, later, important, really, humans, meetings, personal, states, future, fact, relationship, result, at- tention, someone, century, attitude, article, different Table 1: The 25 most frequent words exclusive to the English Bible or Watchtower. tions are commissioned by the same publisher, fol- lowing established editorial criteria. Thus, we not only expect Watchtower to yield projected treebanks that are closer to contemporary language, but also more reliable alignments. We expect these proper- ties to make WTC a more suitable parallel corpus for our experiments and for bootstrapping treebanks for new languages. 3.3 Preprocessing Segmentation For the multi-parallel corpora, we apply naive sentence splitting using full-stops, ques- tion marks and exclamation points of the alphabets from our corpora. We have collected these trigger symbols from the corpora, provided that they ap- peared as individual tokens at the ends of lines, and belonged to the “Punctuation, Other” Unicode cate- gory. After sentence splitting, we use naive whites- pace tokenization.6 We also remove short-vowel di- acritics from all corpora written in Arabic script. We use the same sentence splitting and tokeniza- tion for EBC and WTC. This is done regardless of Bibles being distributed in a verse-per-line format, which means verses can be split in more than one sentence. The average sentence length across lan- guages is 18.5 tokens in EBC and 16.7 in WTC. The UD treebank tokenization differs from the to- kenization used for the multi-parallel corpora. The UD dependency annotation is based on syntactic words, and the tokenization guidelines recommend, for example, splitting clitics from verbs, and undo- ing contractions (Spanish “del” becomes “de el”). These tokens made up of several syntactic words are 6https://github.com/bplank/ multilingualtokenizer called multiword tokens in the UD convention, and are included in the treebanks but are not integrated in the dependency trees, i.e., only their forming subto- kens are assigned a syntactic head.7 In order to harmonize the tokenization, we eliminate subtokens from the dependency trees, and incorporate the orig- inal multiword tokens—which are more likely to be naive raw tokens—in the trees instead. For each multiword token, we provide it with POS and de- pendency label from the highest subtoken, namely the subtoken that is closest to root. For example, in the case of a verb and its clitics, the chosen subtoken is the verb, and the multiword token is interpreted as a verb. If there are more candidates, we select one through POS ranking.8 Alignment We sentence- and word-align all lan- guage pairs in both our multi-parallel corpora. We use hunalign (Varga et al., 2005) to perform con- servative sentence alignment.9 The selected sen- tence pairs then enter word alignment. Here, we use two different aligners. The first one is IBM2 fastalign by Dyer et al. (2013), where we adopt the setup of Agić et al. (2015) who observe a ma- jor advantage in using reverse-mode alignment for POS projection (4-5 accuracy points absolute).10 In addition, we use the IBM1 aligner efmaral11 by Östling (2015). The intuition behind using IBM1 is that IBM2 introduces a bias toward more closely related languages, and we confirm this intuition through our experiments. We modify both aligners so that they output the alignment probability for each aligned token pair. Tagging and parsing The source-sides of the two multi-parallel corpora, EBC and WTC, are POS- tagged by taggers trained on the respective source languages, using TnT (Brants, 2000). We parse the corpora using TurboParser (Martins et al., 2013). The parser is used in simple arc-factored mode with pruning.12 We alter it to output per-sentence arc 7http://universaldependencies.org/ format.html 8https://github.com/coastalcph/ ud-conversion-tools. 9Parameters used: utf, bisent, cautious, realign. 10Parameters used: d, o, v, r. 11Also reverse mode, with default settings, see https:// github.com/robertostling/efmaral. 12Parameters used: basic. 306 https://github.com/bplank/multilingualtokenizer https://github.com/bplank/multilingualtokenizer http://universaldependencies.org/format.html http://universaldependencies.org/format.html https://github.com/coastalcph/ud-conversion-tools https://github.com/coastalcph/ud-conversion-tools https://github.com/robertostling/efmaral https://github.com/robertostling/efmaral weight matrices.13 4 Experiments Outline For each sentence in a target language corpus, we retrieve the aligned sentences in the source corpora. Then, for each of these source-target sentence pairs, we project POS tags and dependency edge scores via word alignments, aggregating the contributions of individual sources. Once all con- tributions are collected, we perform a per-token ma- jority vote on POS tags and DMST decoding on the summed edge scores. This results in a POS-tagged and dependency parsed target sentence ready to con- tribute in training a tagger and parser. We remove target language sentences that contain word tokens without POS labels. This may happen due to unaligned sentences and words. We then pro- ceed to train models. 4.1 Setup Each of the experiment steps involves a number of choices that we outline in this section. We also de- scribe the baseline systems and upper bounds. POS tagging Below, we present results with POS taggers based on annotation projection with both IBM1 and IBM2; cf. Table 3. We train TnT with de- fault settings on the projected annotations. Note that we use the resulting POS taggers in our dependency parsing experiments in order not to have our parsers assume the existence of POS-annotated corpora. For a more extensive assessment, we refer to the work by Agić et al. (2015) who report baseline and upper bounds. In contrast to their work, we consider two different alignment models and use the UD POS tagset (17 tags), in contrast to the 12 tags of Petrov et al. (2012). This makes our POS tagging problem slightly more challenging, but our parsing models potentially benefit from the extended tagset.14 Dependency parsing We use arc-factored Tur- boParser for all parsing models, applying the same setup as in preprocessing. There are three sets of models: our systems, baselines, and upper bounds. 13Our fork of TurboParser is available from https:// github.com/andersjo/TurboParser. 14For example, the AUX vs. VERB distinction from UD POS does not exist the tagset of Petrov et al. (2012), and neither does NOUN vs. PROPN (proper noun). Our systems are trained on the projected EBC and WTC texts, while the rest—except system: DCA- PROJ (see below)—are trained on the (delexical- ized) source-language treebanks. To avoid a bias toward languages with big tree- banks and to make our experiments tractable, we randomly subsample all training sets to a maximum of 20k sentences. In the multi-source systems, this means a uniform sample from all sources up to 20k sentences. This means our comparison is fair, and that our systems do not have the advantage of more training data over our baselines. Our systems We report on four different cross- lingual systems, alternating the use of word aligners (IBM1, IBM2) and the structures we project, as they can be either (i) arc-factored weight matrices from the parser (GRAPHS) or (ii) the single-best trees pro- vided by the parser after decoding (TREES). See the if-clause in Algorithm 2. We tune two parameters for these four systems us- ing English as development set, confidence estima- tion and normalization, and we report the best setups only. For the IBM1-based systems, we use the word alignment probabilities in the arc projection, but we use unit votes in POS voting. The opposite yields the best IBM2 scores: binarizing the alignment scores in dependency projection, while weight-voting the POS tags. We also evaluated a number of different normalization techniques in projection, only to ar- rive at standardization and softmax as by far the best choices. Baselines and upper bounds We compare our systems to three competitive baselines, as well as three informed upper bounds or oracles. First, we list our baselines. DELEX-MS: This is the multi-source direct delexicalized parser transfer baseline of McDonald et al. (2011).15 DCA-PROJ: This is the direct correspondence as- sumption (DCA)-based approach to projection, i.e., the de facto standard for projecting dependencies. First introduced by Hwa et al. (2005), it was recently elucidated by Tiedemann (2014), whose implemen- tation we follow here. In contrast to our approach, 15Referred to as multi-dir in the original paper. 307 https://github.com/andersjo/TurboParser https://github.com/andersjo/TurboParser DCA projects trees on a source-target sentence pair basis, relying on heuristics and spurious nodes or edges to maintain the tree structure. In the setup, we basically plug DCA into our projection-voting pipeline instead of our own method. REPARSE: For this baseline, we parse a target sentence using multiple single-source delexicalized parsers. Then, we collect the output trees in a graph, unit-voting the individual edge weights, and finally using DMST to compute the best dependency tree (Sagae and Lavie, 2006). Now, we explain the three upper bounds: DELEX-SB: This result is using the best single- source delexicalized system for a given target lan- guage following McDonald et al. (2013). We parse a target with multiple single-source delexicalized parsers, and select the best-performing one. SELF-TRAIN: For this result we parse the target- language EBC and WTC data, train parsers on the output predictions, and evaluate the resulting parsers on the evaluation data. Note this result is available only for the source languages. Also, note that while we refer to this as self-training, we do not concate- nate the EBC/WTC training data with the source treebank data. This upper bound tells us something about the usefulness of the parallel corpus texts. FULL: Direct in-language supervision, only avail- able for the source languages. We train parsers on the source treebanks, and use them to parse the source test sets. Evaluation All our datasets—projected, training, and test sets—contain only the following CoNLL- X features: ID, FORM, CPOSTAG, and HEAD.16 For simplicity, we do not predict dependency labels (DEPREL), and we only report unlabeled attach- ment scores (UAS). The POS taggers are evaluated for accuracy. We use our IBM1 taggers for all the baselines and upper bounds. 4.2 Results Our average results are presented in Figure 2, in- cluding broken down by language family, the lan- 16http://ilk.uvt.nl/conll/#dataformat Languages Baselines All Sources Targets IE Non-IE DELEX-MS 45.43? 45.64? 44.59? 49.53? 34.88† DCA-PROJ 47.87† 47.05? 47.19? 51.33† 40.66† REPARSE 47.79? 47.87? 47.47? 51.34? 38.67? Our systems IBM1 GRAPHS 52.82? 53.01? 52.07? 55.44? 46.08? TREES 53.47? 53.49? 53.38? 55.91? 47.19? IBM2 GRAPHS 46.44† 46.14? 44.39? 49.54† 38.47† TREES 46.48† 46.67? 45.54? 49.58† 38.93? Upper bounds DELEX-SB 48.52? 48.64? 48.02? 50.91? 42.35? SELF-TRAIN — 58.38? — — — FULL — 72.55? — — — Table 2: Overview of the parsing experiment results for the 25 languages in EBC ∩ WTC. We report the best av- erage UAS score per system and language subset. IE: Indo-European languages, †: EBC, ?: WTC. guages for which we had training data (Sources) and those for which we only had test data (Targets). We see that our systems are substantially bet- ter than both multi-source delexicalized transfer, DCA, and reparsing based on delexicalized trans- fer models. Focusing on our system results, we see that projection with IBM1 leads to better models than projection with IBM2. We also note that our improvements are biggest with non-Indo-European languages. Our IBM1-based parsers top the ones using IBM2 alignment by 6 points UAS on Indo- European languages, while the difference amounts to almost 10 points UAS on non-Indo-European lan- guages (cf. Table 2). This difference in scores ex- poses a systematic bias towards more closely related languages in work using even more advanced word alignment (Tiedemann and Agić, 2016). The detailed results using the Watchtower Corpus are listed in Table 3, where we also list the POS tagging accuracies. Note that these are not directly comparable to Agić et al. (2015), since they use a more coarse-grained tagset, and the results listed here are using WTC. We list the detailed results with the Bible Corpus online.17 The tendencies are the same, but the results are slightly lower almost con- sistently across the board. Finally, we observe that our results are also bet- ter than those that can be obtained using a predictive model to select the best source language for delexi- 17https://bitbucket.org/lowlands/release 308 http://ilk.uvt.nl/conll/#dataformat https://bitbucket.org/lowlands/release POS MULTI-PROJ Baselines Upper bounds Sources IBM1 IBM2 IBM1: GRAPHS TREES IBM2: GRAPHS TREES DELEX-MS DCA-PROJ REPARSE DELEX-SB SELF-TRAIN FULL Arabic 54.40 44.48 39.55 39.24 28.74 29.58 21.15 32.14 26.24 32.59 pl 49.47 70.79 Bulgarian 70.02 56.02 49.79 49.02 39.15 38.54 48.37 37.18 52.09 49.32 da 53.46 74.01 Croatian 76.56 74.21 55.96 55.33 49.20 50.34 45.49 50.56 48.69 46.68 cs 54.77 68.69 Czech 79.67 72.01 52.42 53.09 42.80 43.33 47.99 44.36 49.65 47.80 sl 57.77 70.36 Danish 86.20 84.66 61.27 62.26 54.82 56.41 55.96 58.64 57.13 56.21 no 64.69 71.23 Dutch 69.51 70.10 57.75 58.93 54.82 55.28 54.35 55.03 56.46 56.64 pt 61.64 72.23 English 78.92 77.22 60.00 61.21 56.46 56.72 53.87 57.12 55.13 52.62 no 66.53 76.18 Farsi 33.66 32.86 26.98 24.49 19.27 18.79 19.48 12.26 20.83 24.65 ar 22.62 64.86 Finnish 69.63 58.29 42.00 43.19 31.91 32.04 41.52 35.60 44.91 43.20 no 51.23 59.35 French 80.36 75.67 56.64 57.79 49.71 48.76 51.53 51.47 51.85 53.49 it 59.91 75.36 German 69.97 62.48 45.73 46.54 38.73 37.88 45.79 36.70 47.21 45.12 no 50.62 67.36 Hebrew 63.01 51.78 45.40 45.59 34.02 35.46 25.02 36.34 27.68 41.71 id 51.37 60.26 Hindi 50.52 42.11 16.84 17.05 15.34 14.36 21.04 10.77 20.90 25.06 fi 44.17 82.63 Indonesian 75.49 70.75 58.18 59.58 48.05 49.76 39.67 52.29 44.80 48.43 he 65.91 73.74 Italian 85.93 82.35 65.84 66.29 60.81 61.25 58.06 63.57 60.03 61.91 es 69.37 79.21 Norwegian 85.84 83.57 66.80 67.30 63.56 64.60 60.11 64.37 62.21 61.12 da 72.42 79.08 Polish 73.84 69.08 62.62 63.46 53.51 56.55 54.87 55.40 56.37 54.31 cs 63.80 76.37 Portuguese 84.22 82.33 63.94 64.91 61.01 61.59 56.99 63.16 58.27 57.79 es 68.80 77.66 Slovene 78.36 74.40 60.69 61.51 53.61 54.15 52.53 54.80 54.15 53.43 cs 63.76 73.63 Spanish 86.39 84.05 64.24 65.39 61.34 61.09 55.87 61.90 59.30 56.25 it 70.16 75.73 Swedish 86.28 84.43 65.28 66.52 60.74 62.16 57.48 62.45 59.94 61.12 no 66.80 74.86 Targets Estonian 75.76 68.76 63.43 65.94 51.31 57.58 48.48 58.41 54.34 58.62 no — — Greek 75.04 63.57 60.86 61.69 50.19 50.06 54.90 52.95 59.29 55.27 no — — Hungarian 73.70 69.52 47.80 50.84 44.83 45.38 46.66 42.33 49.85 47.62 fi — — Quechua 19.49 15.19 26.17 25.93 23.15 22.74 21.67 27.48 22.87 24.30 pl — — Romanian 78.08 74.67 62.08 62.52 52.46 51.95 51.23 54.78 51.01 54.27 id — — Tamil 44.27 35.23 22.41 22.16 21.61 17.34 34.07 15.98 34.67 37.99 hi — — Averages All 70.56 65.18 51.88 52.51 45.23 45.69 45.34 46.22 47.62 48.43 — — Sources 73.28 68.23 53.2 53.75 46.55 47.08 46.05 47.43 48.28 49.02 58.54 72.55 Targets 61.06 54.49 47.13 48.18 40.59 40.84 42.84 41.99 45.34 46.35 — — Table 3: POS tagging accuracies and UAS parsing scores for the models built using WTC data. The results are split for source and target languages. All baselines and upper bounds use IBM1 POS taggers, while our MULTI-PROJ systems use their respective IBM1 or IBM2 taggers. calized transfer (Rosa and Žabokrtský, 2015); and better than what can be obtained using an oracle (DELEX-SB) to select the source language. Direct supervision (FULL) upper bound unsur- prisingly records the highest scores in the experi- ment, as it uses biased in-language and in-domain training data. We also experiment with learning curves for direct supervision, with a goal of estab- lishing the amount of manually annotated sentences needed to beat our cross-lingual systems. We find that for most languages this number falls within the range of 100-400 in-domain sentences. 5 Discussion Function words In UD, a subset of function words—tags: ADP, AUX, CONJ, SCONJ, DET, PUNCT—have to be leaves in the dependency trees, unless, e.g., they participate in multiword expres- sions. Our predictions show some violations of this constraint (less than 1% of all words with these POS), but this ratio is similar to the amount of vi- olations found in the test data. Projectivity The UD treebanks are in general largely projective. Our UD test languages have an average of 89% fully projective sentences. How- ever, with IBM1 for example, we only predict 55% of all sentences to be projective. Regardless of the differences in UAS, we observe a corpus effect in the difference of projectivity of the predictions between using EBC (65%) and WTC (55%). We attribute the higher level of projectivity of EBC-projected tree- banks to Bible sentences being shorter. The least projective predictions are Farsi (17%) and Hindi (19%), for which we also obtain the low- est UASs. This may be a consequence of our naive tokenization, yielding unreliable alignments. How- ever, projectivity correlates more with UAS (ρ = 0.56) than with POS prediction accuracy (ρ = 0.34). Dependency length We observe that the average edge length on IBM1 and WTC is of 2.95, while for EBC it is 2.67. The average gold edge length is 309 3.6—which is significantly higher at p < 0.05 (Stu- dent’s t-test). However, the variance in gold edge length is about 1.2 times the deviation of predicted edge length. In other words, gold edges are often longer and more far-reaching. This difference in- dicates our predictions have worse recall for longer dependencies such as subordinate clauses, while be- ing more accurate in local, phrasal contexts. POS errors Unlike most previous work on cross- lingual dependency parsing, and following the no- table exception of McDonald et al. (2011), we rely on POS predictions from cross-lingual transfer mod- els. One may hypothesize that there is a significant error propagation from erroneous POS projection. We observe, however, that about 40% of wrong POS predictions are nevertheless assigned the right syn- tactic head. We argue that the fairly uniform noise on the POS labels helps the parsers regularize over the POS-dependency relations. Possible improvements We treat POS and syn- tactic dependencies as two separate annotation lay- ers and project them independently in our approach. Moreover, we project edge scores for dependencies, in contrast to only the single-best source POS tags. Johannsen et al. (2016) introduce an approach to joint projection of POS and dependencies, showing that exploiting the interactions between the two lay- ers yields even better cross-lingual parsers. Their approach also accounts for transferring tag distribu- tions instead of single-best POS tags. All the parsers in our experiments are restricted to 20k training sentences. EBC and WTC texts offer up to 120k training instances per language. We observe limited benefits of going beyond our training set cap, indicating a more elaborate instance selection-based approach would be more beneficial than just adding more training data. In our dependency graph projection, we normal- ize the weights per sentence. For future develop- ment, we note that corpus-level normalization might achieve the same balancing effect while still preserv- ing possibly important language-specific signals re- garding structural disambiguations. EBC and WTC constitute a (hopefully small) sub- set of the publicly available multilingual parallel corpora. The outdated EBC texts can be replaced by newer ones, and the EBC itself replaced or aug- mented by other online sources of Bible translations. Other sources include the UN Declaration of Human Rights, translated to 467 languages,18 and reposito- ries of movie subtitles, software localization files, and various other parallel resources, such as OPUS (Tiedemann, 2012).19 Our approach is language- independent and would benefit from extension to datasets beyond EBC and WTC. 6 Related work POS tagging While projection annotation of POS labels goes back to Yarowsky’s seminal work, Das and Petrov (2011) recently renewed interest in this problem. Das and Petrov (2011) go beyond our ap- proach to POS annotation by combining annotation projection and unsupervised learning techniques, but they restrict themselves to Indo-European lan- guages and a coarser tagset. Li et al. (2012) intro- duce an approach that leverages potentially noisy, but sizeable POS tag dictionaries in the form of Wik- tionaries for 9 resource-rich languages. Garrette et al. (2013) also consider the problem of learning POS taggers for truly low-resource languages, but sug- gest crowdsourcing such POS tag dictionaries. Finally, Agić et al. (2015) were the first to intro- duce the idea of learning models for more than a dozen truly low-resource languages in one go, and our contribution can be seen as a non-trivial exten- sion of theirs. Parsing With the exception of Zeman and Resnik (2008), initial work on cross-lingual dependency parsing focused on annotation projection (Hwa et al., 2005; Spreyer et al., 2010). McDonald et al. (2011) and Søgaard (2011) simultaneously took up the idea of delexicalized transfer after Zeman and Resnik (2008), but more importantly, they also intro- duced the idea of multi-source cross-lingual transfer in the context of dependency parsing. McDonald et al. (2011) were the first to combine annotation pro- jection and multi-source transfer, the approach taken in this paper. Annotation projection has been explored in the context of cross-lingual dependency parsing since Hwa et al. (2005). Notable approaches include the 18http://www.ohchr.org/EN/UDHR/Pages/ SearchByLang.aspx 19http://opus.lingfil.uu.se/ 310 http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx http://opus.lingfil.uu.se/ soft projection of reliable dependencies by Li et al. (2014), and the work of Ma and Xia (2014), who make use of the source-side distributions through a training objective function. Tiedemann and Agić (2016) provide a more de- tailed overview of model transfer and annotation projection, while introducing a competitive machine translation-based approach to synthesizing depen- dency treebanks. In their work, we note the IBM4 word alignments favor more closely related lan- guages, and that building machine translation sys- tems requires parallel data in quantities that far sur- pass EBC and WTC combined. The best results reported to date were presented by Rasooli and Collins (2015). They use the inter- section of languages represented in the Google de- pendency treebanks project and the languages rep- resented in the Europarl corpus. Consequently, their approach—similar to all the other approaches listed in this section—is potentially biased toward closely related Indo-European languages. 7 Conclusions We introduced a novel, yet simple and heuristics- free, method for inducing POS taggers and depen- dency parsers for truly low-resource languages. We only assume the availability of a translation of a set of documents that have been translated into many languages. The novelty of our dependency projec- tion method consists in projecting edge scores rather than edges, and specifically in projecting these anno- tations from multiple sources rather than from only one source. While we built models for more than a hundred languages during our experiments, we eval- uated our approach across 30 languages for which we had test data. The results show that our approach is superior to commonly used transfer methods. Acknowledgements We thank the editors and the anonymous reviewers for their valuable comments. This research is funded by the ERC Starting Grant LOWLANDS (#313695). References Željko Agić, Dirk Hovy, and Anders Søgaard. 2015. If All You Have is a Bit of the Bible: Learning POS Tag- gers for Truly Low-Resource Languages. In ACL. Thorsten Brants. 2000. TnT: A Statistical Part-of- Speech Tagger. In ANLP. Christos Christodouloupoulos and Mark Steedman. 2014. A Massively Parallel Corpus: The Bible in 100 Languages. Language Resources and Evaluation, 49(2). Dipanjan Das and Slav Petrov. 2011. Unsupervised Part- of-Speech Tagging with Bilingual Graph-Based Pro- jections. In ACL. Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A Simple, Fast, and Effective Reparameteriza- tion of IBM Model 2. In ACL. Dan Garrette, Jason Mielens, and Jason Baldridge. 2013. Real-World Semi-Supervised Learning of POS- Taggers for Low-Resource Languages. In ACL. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping Parsers via Syntactic Projection Across Parallel Texts. Natural Language Engineering, 11(3). Anders Johannsen, Željko Agić, and Anders Søgaard. 2016. Joint Part-of-Speech and Dependency Projec- tion from Multiple Sources. In ACL. Shen Li, João Graça, and Ben Taskar. 2012. Wiki-ly Supervised Part-of-Speech Tagging. In EMNLP. Zhenghua Li, Min Zhang, and Wenliang Chen. 2014. Soft Cross-lingual Syntax Projection for Dependency Parsing. In COLING. Xuezhe Ma and Fei Xia. 2014. Unsupervised Depen- dency Parsing with Transferring Distribution via Par- allel Guidance and Entropy Regularization. In ACL. André F. T. Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers. In ACL. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online Large-Margin Training of Dependency Parsers. In ACL. Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-Source Transfer of Delexicalized Dependency Parsers. In EMNLP. Ryan McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal Dependency An- notation for Multilingual Parsing. In ACL. Joakim Nivre, Željko Agić, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Balles- teros, John Bauer, Kepa Bengoetxea, Riyaz Ah- mad Bhat, Cristina Bosco, Sam Bowman, Giuseppe G. A. Celano, Miriam Connor, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Do- brovoljc, Timothy Dozat, Tomaž Erjavec, Richárd 311 Farkas, Jennifer Foster, Daniel Galbraith, Filip Gin- ter, Iakes Goenaga, Koldo Gojenola, Yoav Gold- berg, Berta Gonzales, Bruno Guillaume, Jan Hajič, Dag Haug, Radu Ion, Elena Irimia, Anders Jo- hannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Veronika Laippala, Alessandro Lenci, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martı́nez Alonso, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simon- etta Montemagni, Shunsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Prokopis Prokopidis, Sampo Pyysalo, Loganathan Ramasamy, Rudolf Rosa, Shadi Saleh, Sebastian Schuster, Wolf- gang Seeker, Mojgan Seraji, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Kiril Simov, Aaron Smith, Jan Štěpánek, Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Sumire Uematsu, Lar- raitz Uria, Viktor Varga, Veronika Vincze, Zdeněk Žabokrtský, Daniel Zeman, and Hanzhi Zhu. 2015. Universal Dependencies 1.2. Robert Östling. 2015. Word Order Typology Through Multilingual Word Alignment. In ACL. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A Universal Part-of-Speech Tagset. In LREC. Mohammad Sadegh Rasooli and Michael Collins. 2015. Density-Driven Cross-Lingual Transfer of Depen- dency Parsers. In EMNLP. Rudolf Rosa and Zdeněk Žabokrtský. 2015. KLcpos3: A Language Similarity Measure for Delexicalized Parser Transfer. In ACL. Kenji Sagae and Alon Lavie. 2006. Parser Combination by Reparsing. In NAACL. Kathrin Spreyer, Lilja Øvrelid, and Jonas Kuhn. 2010. Training Parsers on Partial Trees: A Cross-Language Comparison. In LREC. Anders Søgaard. 2011. Data Point Selection for Cross- Language Adaptation of Dependency Parsers. In ACL. Jörg Tiedemann and Željko Agić. 2016. Synthetic Tree- banking for Cross-Lingual Dependency Parsing. Jour- nal of Artificial Intelligence Research, 55. Jörg Tiedemann, Željko Agić, and Joakim Nivre. 2014. Treebank Translation for Cross-Lingual Parser Induc- tion. In CoNLL. Jörg Tiedemann. 2012. Parallel Data, Tools and Inter- faces in OPUS. In LREC. Jörg Tiedemann. 2014. Rediscovering Annotation Pro- jection for Cross-Lingual Parser Induction. In COL- ING. Dániel Varga, László Németh, Péter Halácsy, András Ko- rnai, Viktor Trón, and Viktor Nagy. 2005. Parallel Corpora for Medium Density Languages. In RANLP. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing Multilingual Text Analysis Tools via Robust Projection Across Aligned Corpora. In NAACL. Daniel Zeman and Philip Resnik. 2008. Cross-Language Parser Adaptation Between Related Languages. In IJCNLP Workshop on NLP for Less Privileged Lan- guages. 312