Universal Word Segmentation: Implementation and Interpretation Yan Shao, Christian Hardmeier, Joakim Nivre Department of Linguistics and Philology, Uppsala University {yan.shao, christian.hardmeier, joakim.nivre}@lingfil.uu.se Abstract Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of lan- guages with different writing systems and ty- pological characteristics. Additionally, we in- vestigate the correlations between various ty- pological factors and word segmentation ac- curacy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Uni- versal Dependencies datasets. Our model ob- tains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and He- brew, when compared to previous work. 1 Introduction Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation. It can be regarded as the problem of correctly identifying word forms from a character string. Word segmentation can be very challenging, es- pecially for languages without explicit word bound- ary delimiters, such as Chinese, Japanese and Viet- namese. Even for space-delimited languages like English or Russian, relying on white space alone generally does not result in adequate segmentation as at least punctuation should usually be separated from the attached words. For some languages, the space-delimited units in the surface form are too coarse-grained and therefore often further analysed, as in the cases of Arabic and Hebrew. Even though language-specific word segmentation systems are near-perfect for some languages, it is still useful to have a single system that performs reasonably with no or minimum language-specific adaptations. Word segmentation standards vary substantially with different definitions of the concept of a word. In this paper, we will follow the teminologies of Universal Dependencies (UD), where words are de- fined as basic syntactic units that do not always coincide with phonological or orthographic words. Some orthographic tokens, known in UD as mul- tiword tokens, therefore need to be broken into smaller units that cannot always be obtained by split- ting the input character sequence.1 To perform word segmentation in the UD frame- work, neither rule-based tokenisers that rely on white space nor the naive character-level sequence tagging model proposed previously (Xue, 2003) are ideal. In this paper, we present an enriched sequence labelling model for universal word segmentation. It is capable of segmenting languages in very diverse written forms. Furthermore, it simultaneously iden- tifies the multiword tokens defined by the UD frame- work that cannot be resolved simply by splitting 1Note that this notion of multiword token has nothing to do with the notion of multiword expression (MWE) as discussed, for example, in Sag et al. (2002). 421 Transactions of the Association for Computational Linguistics, vol. 6, pp. 421–435, 2018. Action Editor: Sebastian Padó . Submission batch: 3/2018; Revision batch: 6/2018; Published 7/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. the input character sequence. We adapt a regular sequence tagging model, namely the bidirectional recurrent neural networks with conditional random fields (CRF) (Lafferty et al., 2001) interface as the fundamental framework (BiRNN-CRF) (Huang et al., 2015) for word segmentation. The main contributions of this work include: 1. We propose a sequence tagging model for word segmentation, both for general purposes (mere splitting) and full UD processing (splitting plus occasional transduction). 2. We investigate the correlation between segmen- tation accuracy and properties of languages and writing systems, which is helpful in interpret- ing the gaps between segmentation accuracies across different languages as well as selecting language-specific settings for the model. 3. Our segmentation system achieves state-of-the- art accuracy on the UD datasets and improves on previous work (Straka and Straková, 2017) especially for the most challenging languages. 4. We provide an open source implementation.2 2 Word Segmentation in UD The UD scheme for cross-linguistically consistent morphosyntactic annotation defines words as syn- tactic units that have a unique part-of-speech tag and enter into syntactic relations with other words (Nivre et al., 2016). For languages that use whitespace as boundary markers, there is often a mismatch be- tween orthographic words, called tokens in the UD terminology, and syntactic words. Typical examples are clitics, like Spanish dámelo = da me lo (1 token, 3 words), and contractions, like French du = de le (1 token, 2 words). Tokens that need to split into multiple words are called multiword tokens and can be further subdivided into those that can be handled by simple segmentation, like English cannot = can not, and those that require a more complex transduc- tion, like French du = de le. We call the latter non- segmental multiword tokens. In addition to multi- word tokens, the UD scheme also allows multitoken words, that is, words consisting of multiple tokens, such as numerical expressions like 20 000. 2 https://github.com/yanshao9798/segmenter 3 Word Segmentation and Typological Factors We begin with the analysis of the difficulty of word segmentation. Word segmentation is fundamen- tally more difficult for languages like Chinese and Japanese because there are no explicit word bound- ary markers in the surface form (Xue, 2003). For Vietnamese, the space-segmented units are syllables that roughly correspond to Chinese characters rather than words. To characterise the challenges of word segmentation posed by different languages, we will examine several factors that vary depending on lan- guage and writing system. We will refer to these as typological factors although most of them are only indirectly related to the traditional notion of linguis- tic typology and depend more on writing system. • Character Set Size (CS) is the number of unique characters, which is related to how in- formative the characters are to word segmen- tation. Each character contains relatively more information if the character set size is larger. • Lexicon Size (LS) is the number of unique word forms in a dataset, which indicates how many unique word forms have to be identified by the segmentation system. Lexicon size in- creases as the dataset grows in size. • Average Word Length (AL) is calculated by dividing the total character count by the word count. It is negatively correlated with the den- sity of word boundaries. If the average word length is smaller, there are more word bound- aries to be predicted. • Segmentation Frequency (SF) denotes how likely it is that space-delimited units are fur- ther segmented. It is calculated by dividing the word count by the space-segment count. Lan- guages like Chinese and Japanese have much higher segmentation frequencies than space- delimited languages. • Multiword Token Portion (MP) is the per- centage of multiword tokens that are non- segmental. • Multiword Token Set Size (MS) is the number of unique non-segmental multiword tokens. The last two factors are specific to the UD scheme but can have a significant impact on word segmenta- tion accuracy. 422 Figure 1: K-Means clustering (K = 6) of the UD languages. PCA is applied for dimensionality reduction. CS LS AL SF MP MS 0.058 0.938 0.101 -0.043 -0.060 -0.028 Table 1: Pearson product-moment correlation coeffi- cients between dataset size and the statistical factors. All the languages in the UD dataset are charac- terised and grouped by the typological factors in Fig- ure 1. We standardise the statistics x of the proposed factors on the UD datasets with the arithmetic mean µ and the standard deviation σ as x−µ σ . We use them as features and apply K-Means clustering (K = 6) to group the languages. Principal component anal- ysis (PCA) (Abdi and Williams, 2010) is used for dimensionality reduction and visualisation. The majority of the languages in UD are space- delimited with few or no multiword tokens and they are grouped at the bottom left of Figure 1. They are statistically similar from the perspective of word segmentation. The Semitic languages Arabic and Hebrew with rich non-segmental multiword tokens are positioned at the top. In addition, languages with large character sets and high segmentation fre- quencies, such as Chinese, Japanese and Vietnamese are clustered together. Korean is distanced from the other space-delimited languages as it contains white-space delimiters but has a comparatively large character set. Overall, the x-axis of Figure 1 is pri- marily related to character set size and segmentation Language CS LS AL SF MP MS Czech 140 125,342 4.83 1.26 0.0018 9 Czech-CAC 93 66,256 5.06 1.20 0.0022 12 Czech-CLIT 96 2,774 5.30 1.14 0.0005 1 English 108 19,672 4.06 1.24 0.0 0 English-LinES 82 7,436 4.01 1.22 0.0 0 English-ParTUT 94 5,532 4.50 1.22 0.0002 6 Finnish 244 49,210 6.49 1.28 0.0 0 Finnish-FTB 95 39,717 5.94 1.14 0.0 0 French 298 42,250 4.33 1.27 0.0281 9 French-ParTUT 96 3,364 4.53 1.27 0.0344 4 French-Sequota 108 8,452 4.48 1.29 0.0277 7 Latin 57 6,927 5.05 1.28 0.0 0 Latin-ITTB 42 12,526 5.06 1.24 0.0 0 Portuguese 114 26,653 4.15 1.32 0.0746 710 Portuguese-BR 186 29,906 4.11 1.29 0.0683 35 Russian 189 25,708 5.21 1.26 0.0 0 Russian-SynTagRus 157 107,890 5.12 1.30 0.0 0 Slovenian 99 29,390 4.63 1.23 0.0 0 Slovenian-SST 40 4,534 4.29 1.12 0.0 0 Swedish 86 12,911 4.98 1.20 0.0 0 Swedish-LinES 86 9,659 4.50 1.19 0.0 0 Table 2: Different UD datasets in same languages and the statistical factors. frequency, while the y-axis is mostly associated with multiword tokens. Dataset sizes for different languages in UD vary substantially. Table 1 shows the correlation coef- ficients between the dataset size in sentence num- ber and the six typological factors. Apart from the lexicon size, all the other factors, including multi- word token set size, have no strong correlations with dataset size. From Table 2, we can see that the 423 Char. On considère qu’environ 50 000 Allemands du Wartheland ont péri pendant la période. Tags BEXBIIIIIIIEXBIEBIIIIIEXBIIIIEXBIIIIIIIEXBEXBIIIIIIIIEXBIEXBIIEXBIIIIIEXBEXBIIIIIES Figure 2: Tags employed for word segmentation. 50 000 is a multitoken word, while qu’environ and du are multiword tokens that should be processed differently. factors, except for lexicon size, are relatively sta- ble across different UD treebanks for the same lan- guage, which indicates that they do capture proper- ties of these languages, although some variation in- evitably occurs due to corpus properties like genre. In this paper, we thoroughly investigate the corre- lations between the proposed statistical factors and segmentation accuracy. Moreover, we aim to find specific settings that can be applied to improve seg- mentation accuracy for each language group. 4 Sequence Tagging Model Word segmentation can be modelled as a character- level sequence labelling task (Xue, 2003; Chen et al., 2015). Characters as basic input units are passed into a sequence labelling model and a sequence of tags that are associated with word boundaries are predicted. In this section, we introduce the boundary tags adopted in this paper. Theoretically, binary classification is sufficient to indicate whether a character is the end of a word for segmentation. In practice, more fine-grained tagsets result in higher segmentation accuracy (Zhao et al., 2006). Following the work of Shao et al. (2017), we employ a baseline tagset consisting of four tags: B, I, E, and S, to indicate a character positioned at the beginning (B), inside (I), or at the end (E) of a word, or occurring as a single-character word (S). The baseline tagset can be applied to word seg- mentation of Chinese and Japanese without further modification. For languages with space-delimiters, we add an extra tag X to mark the characters, mostly spaces, that do not belong to any words/tokens. As illustrated in Figure 2, the regular spaces are marked with X while the space in a multitoken word like 50 000 is disambiguated with I. To enable the model to simultaneously identify non-segmental multiword tokens for languages like Spanish and Arabic in the UD framework, we ex- tend the tagset by adding four tags B, I, E, S that correspond to B, I, E, S to mark corresponding Tags Applied Languages Baseline Tags B, I, E, S Chinese, Japanese, ... Boundary X Russian, Hindi, ... Transduction B, I, E, S Spanish, Arabic, ... Joint Sent. Seg. T, U All languages Table 3: Tag set for universal word segmentation. positions in non-segmental multiword tokens and to indicate their occurrences. As shown in Figure 2, the multiword token qu’environ is split into qu’ and environ and therefore the corresponding tags are BIEBIIIIIE. This contrasts with du, which should be transduced into de and le. Moreover, the extra tags disambiguate whether the multiword to- kens should be split or transduced according to the context. For instance, AÜØð (wamimma) in Arabic is occasionally split into ð (wa) and AÜØ (mimma) but more frequently transduced into ð (wa), áÓ (min) and AÓ (ma) . The corresponding tags are SBIE and BIIE, respectively. The transduction of the identi- fied multiword tokens will be described in detail in the following section. The complete tagset is summarised in Table 3. The proposed sequence model can easily be ex- tended to perform joint sentence segmentation by adding two more tags to mark the last character of a sentence (de Lhoneux et al., 2017). T is used if the character is a single-character word and U otherwise. T and U can be used together with B, I, E, S, X for general segmentation, or with B, I, E, S additionally for full UD processing. Joint sentence segmentation is not addressed any further in this paper. 5 Neural Networks for Segmentation 5.1 Main network The main network for regular segmentation as well as non-segmental multiword token identification is an adaptation of the BiRNN-CRF model (Huang et al., 2015) (see Figure 3). The input characters can be represented as con- 424 夏 天 太 热 (too) (hot)(summer) character representations GRU GRU GRU GRU forward RNN GRU GRU GRU GRU backward RNN B E S S CRF layer 太 热夏天output Figure 3: The BiRNN-CRF model for segmentation. The dashed arrows indicate that dropout is applied. ventional character embeddings. Alternatively, we employ the concatenated 3-gram model introduced by Shao et al. (2017). In this representation (Fig- ure 4), the pivot character in a given context is rep- resented as the concatenation of the character vec- tor representation along with the local bigram and trigram vectors. The concatenated n-grams encode rich local information as the same character has dif- ferent yet closely related vector representations in different contexts. For each n-gram order, we use a single vector to represent the terms that appear only once in the training set while training. These vectors are later used as the representations for unknown characters and n-grams in the development and test sets. All the embedding vectors are initialised ran- domly. The character vectors are passed to the forward and backward recurrent layers. Gated recurrent units (GRU) (Cho et al., 2014) are employed as the basic recurrent cell to capture long term dependencies and sentence-level information. Dropout (Srivastava et al., 2014) is applied to both the inputs and the out- puts of the bidirectional recurrent layers. A first- order chain CRF layer is added on top of the recur- 夏 天 太 热 (too) (hot)(summer) Vi,i Vi−1,i Vi−1,i+1 n-gram character representation V3 Figure 4: Concatenated 3-gram model. The third character is the pivot character in the given context. rent layers to incorporate transition information be- tween consecutive tags, which ensures that the op- timal sequence of tags over the entire sentence is obtained. The optimal sequence can be computed efficiently via the Viterbi algorithm. 5.2 Transduction The non-segmental multiword tokens identified by the main network are transduced into correspond- ing components in an additional step. Based on the statistics of the multiword tokens to be trans- duced on the entire UD training sets, 98.3% only have one possible transduction, which indicates that the main ambiguity of non-segmental multiword to- kens comes with identification, not transduction. We therefore transduce the identified non-segmental multiword tokens in a context-free fashion. For mul- tiword tokens with two or more valid transductions, we only adopt the most frequent one. In most languages that have multiword tokens, the number of unique non-segmental multiword to- kens is rather limited, such as in Spanish, French and Italian. For these languages, we build dictionar- ies from the training data to look up the multiword tokens. However, in some languages like Arabic and Hebrew, multiword tokens are very productive and therefore cannot be well covered by dictionar- ies generated from training data. Some of the avail- able external dictionary resources with larger cover- age, for instance the MILA lexicon (Itai and Wint- ner, 2008), do not follow the UD standards. In this paper, we propose a generalising ap- proach to processing non-segmental multiword to- kens. If there are more than 200 unique multi- word tokens in the training set for a language, we 425 Character embedding size 50 GRU/LSTM state size 200 Optimiser Adagrad Initial learning rate (main) 0.1 Decay rate 0.05 Gradient Clipping 5.0 Initial learning rate (encoder-decoder) 0.3 Dropout rate 0.5 Batch size 10 Table 4: Hyper-parameters for segmentation. train an attention-based encoder-decoder (Bahdanau et al., 2015) equipped with shared long-short term memory cells (LSTM) (Hochreiter and Schmidhu- ber, 1997). At test time, identified non-segmental multiword tokens are first queried in the dictionary. If not found, the segmented components are gen- erated with the encoder-decoder as character-level transduction. Overall, we utilise rich context to identify non-segmental multiword tokens, and then apply a combination of dictionary and sequence-to- sequence encoder-decoder to transduce them. 5.3 Implementation Our universal word segmenter is implemented us- ing the TensorFlow library (Abadi et al., 2016). Sentences with similar lengths are grouped into the same bucket and padded to the same length. We construct sub-computational graphs for each bucket so that sentences of different lengths are processed more efficiently. Table 4 shows the hyper-parameters adopted for the neural networks. We use one set of parame- ters for all the experiments as we aim for a sim- ple universal model, although fine-tuning the hyper- parameters on individual languages might result in additional improvements. The encoder-decoder is trained prior to the main network. The weights of the neural networks, including the embeddings, are initialised using the scheme introduced in Glo- rot and Bengio (2010). The network is trained us- ing back-propagation. All the random embeddings are fine-tuned during training by back-propagating gradients. Adagrad (Duchi et al., 2011) with mini- batches is employed for optimization. The initial learning rate η0 is updated with a decay rate ρ. The encoder-decoder is trained with the unique non-segmental multiword tokens extracted from the training set. 5% of the total instances are subtracted for validation. The model is trained for 50 epochs and the score of how many outputs exactly match the references is used for selecting the weights. For the main network, word-level F1-score is used to measure the performance of the model after each epoch on the development set. The network is trained for 30 epochs and the weight of the best epoch is selected. To increase efficiency and reduce memory de- mand both for training and decoding, we truncate sentences longer than 300 characters. At decoding time, the truncated sentences are reassembled at the recorded cut-off points in a post-processing step. 6 Experiments 6.1 Datasets and Evaluation Datasets from Universal Dependencies 2.0 (Nivre et al., 2016) are used for all the word segmentation ex- periments.3 In total, there are 81 datasets in 49 lan- guages that vary substantially in size. The training sets are available in 45 languages. We follow the standard splits of the datasets. If no development set is available, 10% of the training set is subtracted. We adopt word-level precision, recall and F1- score as the evaluation metrics. The candidate and the reference word sequences in our experiments may not share the same underlying characters due to the transduction of non-segmental multiword to- kens. The alignment between the candidate words and the references becomes unclear and therefore it is difficult to compute the associated scores. To re- solve this issue, we use the longest common subse- quence algorithm to align the candidate and the ref- erence words. The matched words are compared and the evaluation scores are computed accordingly: R = |c∩r| |r| (1) P = |c∩r| |c| (2) F = 2 · R ·P R + P (3) where c and r denote the sequences of candidate words and reference words, and |c|, |r| are their 3We employ the version that was used in the CoNLL 2017 shared task on UD parsing. 426 Basic Unit F1-score Training Time (s) Latin Character 82.79 572 Space-delimited Unit 87.62 218 Table 5: Different segmentation units employed for word segmentation on Vietnamese. Concatenated 3- gram is not used. lengths. |c∩r| is the number of candidate words that are aligned to reference words by the longest com- mon subsequence algorithm. The word-level evalu- ation metrics adopted in this paper are different from the boundary-based alternatives (Palmer and Burger, 1997). We adapt the evaluation script from the CoNLL 2017 shared task (Zeman et al., 2017) to calculate the scores. In the following experiments, we only report the F1-score. In the following sections, we thoroughly investi- gate correlations between several language-specific characteristics and segmentation accuracy. All the experimental results in Section 6.2 are obtained on the development sets. The test sets are reserved for final evaluation, reported in Section 6.3. 6.2 Language-Specific Characteristics 6.2.1 Word-Internal Spaces For Vietnamese and other languages with sim- ilar historical backgrounds, such as Zhuang and Hmongic languages (Zhou, 1991), the space- delimited syllables containing no punctuation are never segmented but joined into words with word- internal spaces instead. The space-delimited units can therefore be applied as the basic elements for tag prediction if we pre-split punctuation. Word seg- mentation for these languages thus becomes practi- cally the same as for Chinese and Japanese. Table 5 shows that a substantial improvement can be achieved if we use space-delimited syllables as the basic elements for word segmentation for Viet- namese. It also drastically increases both training and decoding speed as the sequence of tags to be predicted becomes much shorter. 6.2.2 Character Representation We apply regular character embeddings and con- catenated 3-gram vectors introduced in Section 5.1 to the input characters and test their performances 1 2 3 4 5 6 7 8 9 10 0.8 0.9 1 N/300 F 1- S co re Arabic Catalan Chinese English Japanese Spanish Figure 5: Segmentation results with unigram char- acter embeddings (dashed) and concatenated 3-gram vectors for character representations with different numbers of training instances N. respectively. First, the experiments are extensively conducted on all the languages with the full train- ing sets. The results show that the concatenated 3-gram model is substantially better than the regu- lar character embeddings on Chinese, Japanese and Vietnamese, but notably worse on Spanish and Cata- lan. For all the other languages, the differences are marginal. To gain more insights, we select six languages, namely Arabic, Catalan, Chinese, Japanese, English and Spanish for more detailed analysis via learn- ing curve experiments. The training sets are grad- ually extended by 300 sentences at a time. The results are shown in Figure 5. Regardless of the amounts of training data and the other typological factors, concatenated 3-grams are better on Chinese and Japanese and worse on Spanish and Catalan. We expect the concatenated 3-gram representation to outperform simple character embeddings on all languages with a large character set but no space de- limiters. Since adopting the concatenated 3-gram model drastically enlarges the embedding space, in the following experiments, including the final testing phase, concatenated 3-grams are only applied to Chinese, Japanese and Vietnamese. 427 1 2 3 4 5 6 7 8 9 10 0.6 0.7 0.8 0.9 1 N/300 F 1- S co re Arabic Chinese English Korean Russian Spanish Figure 6: Segmentation results with (dashed) and without space delimiters with different numbers of training instances N. 6.2.3 Space Delimiters Chinese and Japanese are not delimited by spaces. Additionally, continuous writing without spaces (scriptio continua) is evidenced in most Classical Greek and Latin manuscripts. We perform two sets of learning curve experiments to investigate the im- pact of white space on word segmentation. In the first set, we keep the datasets in their original forms. In the second set, we omit all white space. The ex- perimental results are presented in Figure 6. In general, there are huge discrepancies between the accuracies with and without spaces, showing that white space acts crucially as a word boundary in- dicator. Retaining the original forms of the space- delimited languages, very high accuracies can be achieved even with small amounts of training data as the model quickly learns that space is a reliable word boundary indicator. Moreover, we obtain rel- atively lower scores on space-delimited languages when space is ignored than Chinese using compara- ble amounts of training data, which shows that Chi- nese characters are more informative to word bound- ary prediction, due to the large character set size. 6.2.4 Non-Segmental Multiword Tokens The concept of multiword tokens is specific to UD. To explore how the non-segmental multiword tokens, as opposed to pure segmentation, influence 1 2 3 4 5 6 7 8 9 10 0.8 0.9 1 N/300 F 1- S co re Arabic French Hebrew Italian Portuguese Spanish Figure 7: Segmentation results with and without (dashed) processing non-segmental multiword to- kens with different training instances N. Language Data size Evaluation Scores Training Validation ACC MFS Arabic 3,500 184 77.84 82.64 Hebrew 2,995 157 84.81 92.35 Table 6: Accuracy of the seq2seq transducer on Ara- bic and Hebrew. segmentation accuracy, we conduct relevant experi- ments on selected languages. Similarly to the previ- ous section, two sets of learning curve experiments are performed. In the second set, all the multiword tokens that require transduction are regarded as sin- gle words without being processed. The results are presented in Figure 7. Word segmentation with full UD processing is no- tably more challenging for Arabic and Hebrew. Ta- ble 6 shows the evaluation of the encoder-decoder as the transducer for non-segmental multiword tokens on Arabic and Hebrew. The evaluation metrics ACC and MF-score (MFS) are adapted from the metrics used for machine transliteration evaluation (Li et al., 2009). ACC is exact match and MFS is based on edit distance. The transducer yields relatively higher scores on Hebrew while it is more challenging to process Arabic. In addition, different approaches to transducing the non-segmental multiword tokens are evaluated in Table 7. In the condition None, the identified non- 428 None Dictionary Transducer Mix Arabic 94.11 96.74 96.54 97.27 Hebrew 87.17 91.33 88.46 91.85 Table 7: Segmentation accuracies on Arabic and Hebrew with different ways of transducing non- segmental multiword tokens. segmental multiword tokens remain unprocessed. In Dictionary, they are mapped via the dictionary de- rived from training data if found in the dictionary. In Transducer, they are all transduced by the attention- based encoder-decoder. In Mix, in addition to utilis- ing the mapping dictionary, the non-segmental terms not found in the dictionary are transduced with the encoder-decoder. The results show that when the encoder-decoder is applied alone, it is worse than only using the dictionaries, but additional improve- ments can be obtained by combining both of them. The accuracy differences associated with non- segmental multiword tokens are nonetheless marginal on the other languages as shown in Figure 7. Regardless of their frequent occurrences, mul- tiword tokens are easy to process in general when the set of unique non-segmental multiword tokens is small. 6.2.5 Correlations with Accuracy We investigate the correlations between the pro- posed typological factors in Section 3 and segmen- tation accuracy using linear regression with Huber loss (Huber, 1964). The factors are used in addition to training set size as the features to predict the seg- mentation accuracies in F1-score. To collect more data samples, apart from experimenting with the full training data for each set, we also use smaller sets of 500, 1,000 and 2,000 training instances to train the models respectively if the training set is large enough. The features are standardised with the arith- metic mean and the standard deviation before fitting the linear regression model. The correlation coefficients of the linear regres- sion model are presented in Figure 8. We can see that segmentation frequency and multiword token set size are negatively correlated with segmentation accuracy. Overall, the UD datasets are strongly bi- ased towards space-delimited languages. Training set size is therefore not a strong factor as high accu- TS CS LS AL SF MP MS −1 0 1 ·10−2 Figure 8: Correlation coefficients between segmen- tation accuracy and the typological factors in the lin- ear regression model. The factors are training set size (TS), character set size (CS), lexicon size (LS), average word length (AL), segmentation frequency (SF), multitoken word portion (MP) and multitoken word size (MS). racies can be obtained with small amounts of train- ing data, which is consistent with the results of all the learning curve experiments. The other typolog- ical factors such as average word length and lexi- con size are less relevant to segmentation accuracy. Referring back to Figure 1, segmentation frequency and multiword token set size as the most influen- tial factors, are also the primary principal compo- nents that categorise the UD languages into different groups. 6.2.6 Language-Specific Settings Our model obtains competitive results with only a minimal number of straightforward language- specific settings. Based on the previous analysis of segmentation accuracy and typological factors, re- ferring back to Figure 1, we apply the following settings, targeting on specific language groups, to the segmentation system on the final test sets. The language-specific settings can be applied to new lan- guages beyond the UD datasets based on an analysis of the typological factors. 1. For languages with word-internal spaces like Vietnamese, we first separate punctuation and then use space-delimited syllables for bound- 429 Space NLTK UDPipe This Paper 80.86 95.64 99.47 99.45 Table 8: Average evaluation scores on UD lan- guages, excluding Chinese, Japanese, Vietnamese, Arabic and Hebrew. ary prediction. 2. For languages with large character sets and no space delimiters, like Chinese and Japanese, we use concatenated 3-gram representations. 3. For languages with more than 200 unique non- segmental multiword tokens, like Arabic and Hebrew, we use the encoder-decoder model for transduction. 4. For other languages, the universal model is suf- ficient without any specific adaptation. 6.3 Final Results We compare our segmentation model to UDPipe (Straka and Straková, 2017) on the test sets. UDPipe contains word segmentation, POS tagging, morpho- logical analysis and dependency parsing models in a pipeline. The word segmentation model in UD- Pipe is also based on RNN with GRU. For efficiency, UDPipe has a smaller character embedding size and no CRF interface. It also relies heavily on white- space and uses specific configurations for languages in which word-internal spaces are allowed. Auto- matically generated suffix rules are applied jointly with a dictionary query to handle multiword tokens. Moreover, UDPipe uses language-specific hyper- parameters for Chinese and Japanese. We employ UDPipe 1.2 with the publicly avail- able UD 2.0 models.4 The presegmented option is enabled as we assume the input text to be preseg- mented into sentences so that only word segmen- tation is evaluated. In addition, the CoNLL shared task involved some test sets for which no specific training data were available. This included a number of parallel test sets of known languages, for which we apply the models trained on the standard tree- banks, as well as four surprise languages, namely Buryat, Kurmanji, North Sami and Upper Sorbian, for which we use the small annotated data samples provided in addition to the test sets by the shared 4http://hdl.handle.net/11234/1-2364 task to build models and evaluation on those lan- guages. The main evaluation results are shown in Table 9. We also report the Macro Average F1-scores. The scores of the surprise languages are excluded and presented separately as no corresponding UDPipe models are available. Our system obtains higher segmentation accuracy overall. It achieves substantially better accuracies on languages that are challenging to segment, namely Chinese, Japanese, Vietnamese, Arabic and Hebrew. The two systems yield very similar scores, when these languages are excluded as shown in Table 8, in which the two systems are also compared with two rule-based baselines, a simple space-based to- keniser and the tokenisation model for English in NLTK (Loper and Bird, 2002). The NLTK model obtains relatively high accuracy while the space- based baseline substantially underperforms, which indicates that relying on white space alone is insuffi- cient for word segmentation in general. On the ma- jority of the space-delimited languages without pro- ductive non-segmental multiword tokens, both UD- Pipe and our segmentation system yield near-perfect scores in Table 9. In general, referring back to Fig- ure 1, languages that are clustered at the bottom-left corner are relatively trivial to segment. The evaluation scores are notably lower on Semitic languages as well as languages without word delimiters. Nonetheless, our system obtains substantially higher scores on the languages that are more challenging to process. For Chinese, Japanese and Vietnamese, our sys- tem benefits substantially from the concatenated 3-gram character representation, which has been demonstrated in Section 6.2.2. Besides, we em- ploy a more fine-grained tagset with CRF loss in- stead of the binary tags adopted in UDPipe. As presented in Zhao et al. (2006), more fine-grained tagging schemes outperform binary tags, which is supported by the experimental results on morpheme segmentation reported in Ruokolainen et al. (2013). We further investigate the merits of the fine- grained tags over the binary tags as well as the ef- fectiveness of the CRF interface by the experiments presented in Table 10 with the variances of our seg- mentation system. The fine-grained tags denote the boundary tags introduced in Table 3. The binary 430 Dataset UDPipe This Paper Dataset UDPipe This Paper Dataset UDPipe This Paper Ancient Greek 99.98 99.96 Ancient Greek-PROIEL 99.99 100.0 Arabic 93.77 97.16 Arabic-PUD 90.92 95.93 Basque 99.97 100.0 Bulgarian 99.96 99.93 Catalan 99.98 99.80 Chinese 90.47 93.82 Croatian 99.88 99.95 Czech 99.94 99.97 Czech-CAC 99.96 99.93 Czech-CLTT 99.58 99.64 Czech-PUD 99.34 99.62 Danish 99.83 100.0 Dutch 99.84 99.92 Dutch-LassySmall 99.91 99.96 English 99.05 99.13 English-LinES 99.90 99.95 English-PUD 99.69 99.71 English-ParTUT 99.60 99.51 Estonian 99.90 99.88 Finnish 99.57 99.74 Finnish-FTB 99.95 99.99 Finnish-PUD 99.64 99.39 French 98.81 99.39 French-PUD 98.84 97.23 French-ParTUT 98.97 99.32 French-Sequoia 99.11 99.48 Galician 99.94 99.97 Galician-TreeGal 98.66 98.07 German 99.58 99.64 German-PUD 97.94 97.74 Gothic 100.0 100.0 Greek 99.94 99.86 Hebrew 85.16 91.01 Hindi 100.0 100.0 Hindi-PUD 98.26 98.82 Hungarian 99.79 99.93 Indonesian 100.0 100.0 Irish 99.38 99.85 Italian 99.83 99.54 Italian-PUD 99.21 98.78 Japanese 92.03 93.77 Japanese-PUD 93.67 94.17 Kazakh 94.17 94.21 Korean 99.73 99.95 Latin 99.99 100.0 Latin-ITTB 99.94 100.0 Latin-PROIEL 99.90 100.0 Latvian 99.16 99.56 Norwegian-Bokmaal 99.83 99.89 Norwegian-Nynorsk 99.91 99.97 Old Church Slavonic 100.0 100.0 Persian 99.65 99.62 Polish 99.90 99.93 Portuguese 99.59 99.10 Portuguese-BR 99.85 99.52 Portuguese-PUD 99.40 98.98 Romanian 99.68 99.74 Russian 99.66 99.96 Russian-PUD 97.09 97.28 Russian-SynTagRus 99.64 99.65 Slovak 100.0 99.98 Slovenian 99.93 100.0 Slovenian-SST 99.91 100.0 Spanish 99.75 99.85 Spanish-AnCora 99.94 99.93 Spanish-PUD 99.44 99.39 Swedish 99.79 99.97 Swedish-LinES 99.93 99.98 Swedish-PUD 98.36 99.26 Turkish 98.09 97.85 Turkish-PUD 96.99 96.68 Ukrainian 99.81 99.76 Urdu 100.0 100.0 Uyghur 99.85 97.86 Vietnamese 85.53 87.79 Average 98.63 98.90 Table 9: Evaluation results on the UD test sets in F1-scores. The datasets are represented in the correspond- ing treebank codes. PUD suffix indicates the parallel test data. Two shades of green/red are used for better visualisation, with brighter colours for larger differences. Green represents that our system is better than UDPipe and red is used otherwise. BT BT+CRF FT FT+CRF Chinese 90.54 90.66 90.73 91.28 Japanese 91.54 91.64 91.88 91.94 Vietnamese 87.63 87.95 87.61 87.75 Arabic 94.47 96.74 94.73 97.16 Hebrew 85.34 90.74 85.53 91.98 Table 10: Comparison between the binary tags (BT) and the fine-grained tags (FT) as well as the effec- tiveness of the CRF interface on the development sets. tags include two basic tags B, I plus the correspond- ing tags B, I for non-segmental multiword tokens. White space is marked as I instead of X. The con- catenated 3-grams are not applied. In general, the experimental results confirm that the fine-grained tags are more beneficial except for Vietnamese. The fine-grained tagset contains more structured posi- tional information that can be exploited by the word segmentation model. Additionally, the CRF in- terface leads to notable improvements, especially Arabic French German Hebrew UDPipe 79.34 98.91 94.21 71.87 Our model 91.35 97.50 94.21 86.17 Table 11: Percentages of the correctly processed multiword tokens on the development sets. for Arabic and Hebrew. The combination of the fine-grained tags with the CRF interface achieves substantial improvements over the basic binary tag model that is analogous to UDPipe. For Arabic and Hebrew, apart from greatly bene- fiting from the fine-grained tagset and the CRF inter- face, our model is better at handling non-segmental multiword tokens (Table 11). The attention-based encoder-decoder as the transducer is much more powerful in processing the non-segmental multi- word tokens that are not covered by the dictionary than the suffix rules for analysing multiword tokens in UDPipe. UDPipe obtains higher scores on a few datasets. Our model overfits the small training data of Uyghur 431 Segmentation UDPipe parser Dozat et al. (2017) Accuracy UAS LAS UAS LAS UDPipe This Paper UDPipe This Paper UDPipe This Paper UDPipe This Paper UDPipe This Paper Arabic 93.77 97.16 72.34 78.22 66.41 71.79 77.52 83.55 72.89 78.42 Chinese 90.47 93.82 63.20 67.91 59.07 63.31 71.24 76.33 68.20 73.04 Hebrew 85.16 91.01 62.14 71.18 57.82 66.59 67.61 76.39 64.02 72.37 Japanese 92.03 93.77 78.08 81.77 76.73 80.83 80.21 83.79 79.44 82.99 Vietnamese 85.53 87.79 47.72 50.87 43.10 46.03 50.28 53.78 45.54 48.86 Table 12: Extrinsic evaluations with dependency parsing on the test sets. The parsing accuracies are reported in unlabelled attachment score (UAS) and labelled attachment score (LAS). Space NLTK Sample Transfer Buryat 71.99 97.99 88.07 97.99 (Russian) Kurmanji 78.97 97.37 93.37 96.71 (Spanish) North Sami 79.07 99.20 92.82 99.81 (German) Upper Sorbian 72.35 94.60 93.34 93.66 (Spanish) Table 13: Evaluation on the surprise languages. as it yields 100.0 F1-score on the development set. For a few parallel test sets, there are punctuation marks not found in the training data that cannot be correctly analysed by our system as it is fully data- driven without any heuristic rules for unknown char- acters. The evaluation results on the surprise languages are presented in Table 13. In addition to the seg- mentation models proposed in this paper, we present the evaluation scores of a space-based tokeniser as well as the NLTK model for English. As shown by the previous learning curve experiments in Sec- tion 6.2, very high accuracies can be obtained on the space-delimited languages with only small amounts of training data. However, in case of extreme data sparseness (less than 20 training sentences), such as for the four surprise languages in Table 13 and Kazakh in Table 9, the segmentation results are dras- tically lower even though the surprise languages are all space-delimited. For the surprise languages, we find that applying segmentation models trained on a different language with more training data yields better results than re- lying on the small annotated samples of the target language. Considering that the segmentation model is fully character-based, we simply select the model of the language that shares the most characters with the surprise language as its segmentation model. No annotated data of the surprise language are used for model selection. As shown in Table 13, the transfer approach achieves comparable segmentation accu- racies to NLTK. For space-delimited languages with insufficient training data, it may be beneficial to em- ploy a well-designed rule-based word segmenter as NLTK occasionally outperforms the data-driven ap- proach. As a form of extrinsic evaluation, we test the seg- menter in a dependency parsing setup on the datasets where we obtained substantial improvements over UDPipe. We present results for the transition-based parsing model in UDPipe 1.2 and for the graph- based parser by Dozat et al. (2017). The experimen- tal results are shown in Table 12. We can see that word segmentation accuracy has a great impact on parsing accuracy as the segmentation errors propa- gate. Having a more accurate word segmentation model is very beneficial for achieving higher pars- ing accuracy. 7 Related Work The BiRNN-CRF model is proposed by Huang et al. (2015) and has been applied to a number of se- quence labelling tasks, such as part-of-speech tag- ging, chunking and named entity recognition. Our universal word segmenter is a major exten- sion of the joint word segmentation and POS tagging system described by Shao et al. (2017). The origi- nal model is specifically developed for Chinese and only applicable to Chinese and Japanese. Apart from being language-independent, the proposed model in this paper employs an extended tagset and a comple- mentary sequence transduction component to fully process non-segmental multiword tokens that are present in a substantial amount of languages, such as Arabic and Hebrew in particular. It is a gener- alised segmentation and transduction framework. Our universal model is compared with the 432 This Paper Shao Che Björkelund Chinese 93.82 95.21 91.19 92.81 Japanese 93.77 94.79 92.95 91.68 Arabic 97.16 – 93.71 95.53 Hebrew 91.01 – 85.16 91.37 Table 14: Comparison between the universal model and the language-specific models. language-specific model of Shao et al. (2017) in Ta- ble 14. With pretrained character embeddings, en- semble decoding and joint POS tags prediction as introduced in Shao et al. (2017), considerable im- provements over the universal model presented in this paper can be obtained. However, the joint POS tagging system is difficult to generalise as single characters in space-delimited languages are usually not informative for POS tagging. Additionally, com- pared to Chinese, sentences in space-delimited lan- guages have a much greater number of characters on average. Combining the POS tags with segmenta- tion tags drastically enlarges the search space and therefore the model becomes extremely inefficient both for training and tagging. The joint POS tag- ging model is nonetheless applicable to Japanese and Vietnamese. Monroe et al. (2014) present a data-driven word segmentation system for Arabic based on a sequence labelling framework. An extended tagset is designed for Arabic-specific orthographic rules and applied together with hand-crafted features in a CRF frame- work. It obtains 98.23 F1-score on newswire Ara- bic Treebank,5 97.61 on Broadcast News Treebank,6 and 92.10 on the Egyptian Arabic dataset.7 For He- brew, Goldberg and Elhadad (2013) perform word segmentation jointly with syntactic disambiguation using lattice parsing. Each lattice arc corresponds to a word and its corresponding POS tag, and a path through the lattice corresponds to a specific word segmentation and POS tagging of the sentence. The proposed model is evaluated on the Hebrew Tree- bank (Guthmann et al., 2009). The joint word seg- mentation and parsing F1-score (76.95) is reported and compared against the parsing score (85.70) with gold word segmentation. The evaluation scores re- 5LDC2010T13, LDC2011T09, LDC2010T08 6LDC2012T07 7LDC2012E93,98,89,99,107,125, LDC2013E12,21 ported in both Monroe et al. (2014) and Goldberg and Elhadad (2013) are not directly comparable to the evaluation scores on Arabic and Hebrew in this paper, as they are obtained on different datasets. For universal word segmentation, apart from UD- Pipe described in Section 6.3, there are several systems that are developed for specific language groups. Che et al. (2017) build a similar Bi-LSTM word segmentation model targeting languages with- out space delimiters like Chinese and Japanese. The proposed model incorporates rich statistics-based features gathered from large-scale unlabelled data, such as character unigram embeddings, character bigram embeddings and the point-wise mutual in- formation of adjacent characters. Björkelund et al. (2017) use a CRF-based tagger for multiword token rich languages like Arabic and Hebrew. A predicted Levenshtein edit script is employed to transform the multiword tokens into their components. The evalu- ation scores on a selected set of languages reported in Che et al. (2017) and Björkelund et al. (2017) are included in Table 14 as well. More et al. (2018) adapt existing morphologi- cal analysers for Arabic, Hebrew and Turkish and present ambiguous word segmentation possibilities for these languages in a lattice format (CoNLL- UL) that is compatible with UD. The CoNLL-UL datasets can be applied as external resources for pro- cessing non-segmental multiword tokens.8 8 Conclusion We propose a sequence tagging model and apply it to universal word segmentation. BiRNN-CRF is adopted as the fundamental segmentation frame- work that is complemented by an attention-based sequence-to-sequence transducer for non-segmental multiword tokens. We propose six typological fac- tors to characterise the difficulty of word segmen- tation cross different languages. The experimental results show that segmentation accuracy is primarily correlated with segmentation frequency as well as the set of non-segmental multiword tokens. Using whitespace as delimiters is crucial to word segmen- tation, even if the correlation between orthographic tokens and words is not perfect. For space-delimited 8CoNLL-UL is not evaluated in our experiments as it is very recent work. 433 languages, very high accuracy can be obtained even with relatively small training sets, while more train- ing data is required for high segmentation accuracy for languages without spaces. Based on the analy- sis, we apply a minimal number of language-specific settings to substantially improve the segmentation accuracy for languages that are fundamentally more difficult to process. The segmenter is extensively evaluated on the UD datasets in various languages and compared with UDPipe. Apart from obtaining nearly perfect segmentation on most of the space-delimited lan- guages, our system achieves high accuracies on lan- guages without space delimiters such as Chinese and Japanese as well as Semitic languages with abundant multiword tokens like Arabic and Hebrew. Acknowledgments We acknowledge the computational resources pro- vided by CSC in Helsinki and Sigma2 in Oslo through NeIC-NLPL (www.nlpl.eu). This work is supported by the Chinese Scholarship Council (CSC) (No. 201407930015). We would like to thank the TACL editors and reviewers for their valuable feedback. References Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), pages 265–283. Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In International Con- ference on Learning Representations. Anders Björkelund, Agnieszka Falenska, Xiang Yu, and Jonas Kuhn. 2017. IMS at the CoNLL 2017 UD shared task: CRFs and perceptrons meet neural net- works. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 40–51. Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, and Ting Liu. 2017. The HIT-SCIR system for end-to-end pars- ing of universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 52–62. Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long short-term mem- ory neural networks for Chinese word segmentation. In Conference on Empirical Methods in Natural Lan- guage Processing, pages 1197–1206. Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder ap- proaches. arXiv preprint arXiv:1409.1259. Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From raw text to Universal De- pendencies – look, no tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies., pages 207–217. Timothy Dozat, Peng Qi, and Christopher D. Man- ning. 2017. Stanford’s graph-based neural depen- dency parser at the CoNLL 2017 shared task. In Pro- ceedings of the CoNLL 2017 Shared Task: Multilin- gual Parsing from Raw Text to Universal Dependen- cies, pages 20–30, Vancouver, Canada, August. Asso- ciation for Computational Linguistics. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural net- works. In International Conference on Artificial Intel- ligence and Statistics, pages 249–256. Yoav Goldberg and Michael Elhadad. 2013. Word seg- mentation, unknown-word resolution, and morpholog- ical agreement in a Hebrew parsing system. Computa- tional Linguistics, 39(1):121–160, March. Noemie Guthmann, Yuval Krymolowski, Adi Milea, and Yoad Winter. 2009. Automatic annotation of mor- phosyntactic dependencies in a modern Hebrew. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories, pages 1–12. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735– 1780. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- rectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Peter J Huber. 1964. Robust estimation of a location pa- rameter. The annals of mathematical statistics, pages 73–101. 434 Alon Itai and Shuly Wintner. 2008. Language resources for Hebrew. Language Resources and Evaluation, 42(1):75–98. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Con- ference on Machine Learning, ICML ’01, pages 282– 289. Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang. 2009. Report of NEWS 2009 machine translit- eration shared task. In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pages 1–18. Edward Loper and Steven Bird. 2002. NLTK: The nat- ural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computa- tional linguistics-Volume 1, pages 63–70. Association for Computational Linguistics. Will Monroe, Spence Green, and Christopher D Man- ning. 2014. Word segmentation of informal arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 206–211. Amir More, Özlem Çetinoğlu, Çağrı Çöltekin, Nizar Habash, Benoı̂t Sagot, Djamé Seddah, Dima Taji, and Reut Tsarfaty. 2018. CoNLL-UL: Universal morpho- logical lattices for Universal Dependency parsing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Na- talia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation, pages 1659–1666. David Palmer and John Burger. 1997. Chinese word seg- mentation and information retrieval. In AAAI Spring Symposium on Cross-Language Text and Speech Re- trieval, pages 175–178. Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2013. Supervised morphological seg- mentation in a low-resource learning setting using con- ditional random fields. In Proceedings of the Sev- enteenth Conference on Computational Natural Lan- guage Learning, pages 29–37, Sofia, Bulgaria. Asso- ciation for Computational Linguistics. Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copes- take, and Dan Flickinger. 2002. Multiword expres- sions: A pain in the neck for NLP. In International Conference on Intelligent Text Processing and Com- putational Linguistics, pages 1–15. Springer. Yan Shao, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2017. Character-based joint segmenta- tion and POS tagging for Chinese using bidirectional RNN-CRF. In Proceedings the 8th International Joint Conference on Natural Language Processing, pages 173–183. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search, 15(1):1929–1958. Milan Straka and Jana Straková. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UD- Pipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal De- pendencies, pages 88–99. Nianwen Xue. 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, pages 29–48. Daniel Zeman, Martin Popel, Milan Straka, Jan Ha- jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Fran- cis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martı́nez Alonso, Çağr Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Al- calde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multi- lingual parsing from raw text to Universal Dependen- cies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal De- pendencies, pages 1–19. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pages 87– 94. Youguang Zhou. 1991. The family of Chinese character- type scripts. Sino-Platonic Papers, 28. 435 436