ReliAble dependency arc recognition Expert Systems with Applications 41 (2014) 1716–1722 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a ReliAble dependency arc recognition 0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.08.070 ⇑ Corresponding author. Tel.: +86 13936137628. E-mail address: tliu@ir.hit.edu.cn (T. Liu). Wanxiang Che, Jiang Guo, Ting Liu ⇑ School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China a r t i c l e i n f o a b s t r a c t Keywords: Natural language processing Syntactic parsing Dependency parsing RADAR Binary classification We propose a novel natural language processing task, ReliAble dependency arc recognition (RADAR), which helps high-level applications better utilize the dependency parse trees. We model RADAR as a bin- ary classification problem with imbalanced data, which classifies each dependency parsing arc as correct or incorrect. A logistic regression classifier with appropriate features is trained to recognize reliable dependency arcs (correct with high precision). Experimental results show that the classification method can outperform a probabilistic baseline method, which is calculated by the original graph-based depen- dency parser. � 2013 Elsevier Ltd. All rights reserved. 1. Introduction As a fundamental task of natural language processing, depen- dency parsing has become increasingly popular in recent years. It aims to find a dependency parse tree among words for a sentence. Fig. 1 shows an example of dependency parse tree for a sentence, where sbj is a subject, obj is an object, etc. (Johansson & Nugues, 2007). Dependency parsing are widely used: in biomedical text mining (Kim, Ohta, Pyysalo, Kano, & Tsujii, 2009), as well as in tex- tual entailment (Androutsopoulos & Malakasiotis, 2010), informa- tion extraction (Wu & Weld, 2010; Banko, Cafarella, Soderland, Broadhead, & Etzioni, 2007) and sentiment analysis (Meena & Pra- bhakar, 2007). The performance of dependency parsing has increased recently (Kübler, McDonald, & Nivre, 2009). However, when we migrate dependency parsing systems from laboratory demonstrations to high-level applications, even the best parser available today still encounter some serious difficulties. First of all, parsing performance usually dramatically degrades in real fields because of domain migration. Secondly, since every parser inevitably will make some mistakes during decoding, out- puts from any dependency parser are always fraught with a variety of errors. Thus, in some high-level applications which expect to use correct parsing results, it is extremely important to be able to pre- dict the reliability of the auto-parsed results. If these applications just use correct parsing results and ignore incorrect results, their performances may be improved further. For instance, if an entity relation extraction (a kind of information extraction) system, which depends on parsing results heavily (Zhang, Zhang, Su, & Zhou, 2006), only extracts relations from correct parsing sentences, then the system can extract more accurate relations and import less wrong relations through incorrect parsing results. Although some implied relations in those incorrect parsing sentences are missed, these missing relations may be extracted from other sen- tences that can be parsed correctly while zooming in the data to the whole Web. Most large-margin based training algorithm for dependency parsing output models that predict a single parse tree of the in- put sentence, with no additional confidence information about the correctness of it. Therefore, an interesting problem is how to judge a parsing result as correct or not. However, it is difficult to obtain a parse tree in which all sub-structures are parsed cor- rectly. CoNLL 2009 Shared Task results show that only about 40% English and 35% Chinese sentences can be parsed complete cor- rectly (Hajič et al., 2009b). Some previous studies have ad- dressed the problem to recognize reliable parsing results (Reichart & Rappoport, 2007; Dell’Orletta & Venturi, 2011; Kawa- hara & Kurohashi, 2010; Ravi, Knight, & Soricut, 2008). A parsing result is reliable when, the result is correct with high precision. However, all these studies focus on judging if the parsing results of a whole sentence are reliable or not, which can cause the fol- lowing problems: 1. The reliable parsing results may still include some wrong pars- ing sub-structures. Different applications need different key sub-structures, such as backbone structures that are keys for semantic role labeling (Gildea & Jurafsky, 2002) and branch structures are important for multiword expression (Sag, Bald- win, Bond, Copestake, & Flickinger, 2002). If these key sub- structures are parsed incorrectly, even though the whole sen- tence is parsed with a high reliability, the tiny errors will be still harmful to these given applications. This problem results in a low precision. http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2013.08.070&domain=pdf http://dx.doi.org/10.1016/j.eswa.2013.08.070 mailto:tliu@ir.hit.edu.cn http://dx.doi.org/10.1016/j.eswa.2013.08.070 http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa Fig. 1. An example of dependency parse tree. W. Che et al. / Expert Systems with Applications 41 (2014) 1716–1722 1717 2. The unreliable parsing results may include some useful sub- structures but should not be discarded totally. For instance, extracting entity relations is possible if the parse tree path is correct between two entities despite other parts of the sentence being incorrectly parsed. Discarding unreliable parsing sen- tences can result in a low recall. Therefore, dependency arcs, novel reliability measuring objects for dependency parsing are proposed. A reliable dependency hap- pens when a word can find its parent and label the dependency relation between them correctly with high precision. Once all reli- able dependency arcs in a sentence are found, the corresponding parse paths or sub-trees from them can be mapped out. These reli- able sub-structures can be used according to the needs of different applications. In paying attention to reliable parts and ignoring unreliable ones in a sentence, the precision of applications can be improved. Meanwhile, when the number of reliable sub-structures is more than that extracted from reliable whole sentences, higher recall can be obtained. The problem of ReliAble Dependency Arc Recognition (RADAR) can be regarded a binary classification problem. The positive exam- ples are the correctly predicted arcs and the others are the negative examples. Thus, the problem can be converted to find appropriate classifiers and proper features. Different from normal binary classi- fication problems, the data are not balanced for RADAR. For the state-of-the-art dependency parser, the LAS (Labeled Attachment Score) can achieve about 80% in Chinese data set and 90% in Eng- lish data set (Hajič et al., 2009b), which means that the ratio of the number of correct dependency arcs to the number of incorrect dependency arcs is 4:1 for Chinese and 9:1 for English. Aside from learning from the imbalanced data, how to evaluate RADAR is an- other issue. The normal evaluation methods based on accuracy are not suitable for the problem. If an incorrect dependency arc is rec- ognized as a correct arc, the cost is larger than the opposite sce- nario. In addition, the classification accuracy would not be a suitable evaluation metric in an imbalanced scenario. Therefore, there is a need to find more appropriate evaluation criteria. The rest of the present paper is organized as follows. Section 2 presents related work. Section 3 describes the proposed method. Section 4 discusses the present experimental setting and results. We conclude and set the direction of the future work in Sections 5 and 6 respectively. 1 When the size of a set is 1, the accuracy of a sentence can be predicted. 2. Related work To the best of our knowledge, Yates, Schoenmackers, and Etzi- oni (2006) was the first work to address explicitly the parsing reli- ability recognition problem. They detected erroneous parses using web-based semantics. In addition, an ensemble method based on different parsers trained on different data sampled from a training corpus to select high quality parsing results was proposed as well (Reichart & Rappoport, 2007). Dell’Orletta and Venturi (2011) was another study that detect reliable dependency parses with some heuristic features. Kawahara and Uchimoto (2008) classified sen- tences into two classes, reliable and unreliable, with a binary clas- sifier. Ravi et al. (2008) predicted the accuracy of a parser on sets of sentence by fitting a real accuracy curve with linear regression algorithm.1 However, all these works focused on recognizing reliable parsing results of whole sentences and caused corresponding prob- lems for some applications as discussed in Section 1. Although the parsing reliability recognition of whole sentences can be used in Active Learning (Settles, 2010) or Semi/Un-super- vised Learning (Goldwasser, Reichart, Clarke, & Roth, 2011), recog- nizing sub-structures reliability is also useful. For instance, some studies (van Noord, 2007; Chen, Kawahara, Uchimoto, & Zhang, 2008, 2009) used sub-trees or word pairs extracted from a large auto-parsed corpus to help the dependency parser. However, the confidence of a sub-tree or a word pair is only expressed by its count that appears in the corpus. Therefore, their methods may be biased toward frequently appearing sub-trees or word pairs, which may be incorrect, and penalizes the sparse but correct ones. The studies most relevant to ours are done by Atserias, Attardi, Simi, and Zaragoza (2010) and Avihai Mejer (2012). They both re- ported the similar problem with ours. Atserias et al. (2010) shows how to use the probability scores that a transition-based parser normally computes, in order to assigning a confidence score to parse trees. They assign such score to each arc and the active learn- ing application uses the worst. Another independent work done by Avihai Mejer (2012) describes several methods for estimating the confidence in the per-edge correctness of a predicted dependency parse. The best method they confirmed in their study is based on model re-sampling, which is inefficient. Our work differs in that we proposed a novel supervised approach which makes use of additional information as the features for learning models. 3. Method description This section introduces the dependency parsing model and a method to estimate the probability of each dependency arc. A bin- ary classification method to recognize reliable arcs follows. Besides a classifier, the classification method includes three sorts of fea- tures and a process to construct training data. 3.1. Graph-based dependency parsing Given an input sentence x = w1 . . . wn, a dependency tree is de- noted by d ¼fðh; m; lÞ : 0 6 h 6 n; 0 < m 6 n; l 2Lg, where (h, m, l) represents a dependency arc wh ? wm whose head word (or father) is wh and modifier (or child) is wm with a dependency label l, and L is the set of all possible dependency relation labels. The artificial node w0, which always points to the root of the sentence, is used to simplify the formalizations. Then, an optimal dependency tree d̂ is determined based on x: d̂ ¼ arg max d Scoreðx; dÞ Recently, graph-based dependency parsing has gained interest due to its state-of-the-art performance (Kübler et al., 2009). Graph- based dependency parsing views the problem as finding the highest scoring tree from a directed graph. Based on dynamic programming decoding, it can find efficiently an optimal tree in a huge search space. In a graph-based model, the score of a dependency tree is fac- tored into scores of small parts (sub-trees): Scoreðx; dÞ¼ w � fðx; dÞ¼ X p # d Scoreðx; pÞ where f(x, d) refers to the feature vector and w is the corresponding weight vector, p is a scoring part that contains one or more depen- dence arcs in the dependency tree d. 1718 W. Che et al. / Expert Systems with Applications 41 (2014) 1716–1722 The present paper devotes to realize RADAR in graph-based parsing algorithm. The most intuitive method is to use the proba- bility of a dependency arc to denote its reliability. However, the graph-based parsing method is based on a discriminative model but not a generative model, therefore, we only can obtain the score of each arc (Score(x, p)) but not its accurate probability. To over- come this difficulty, a study of Koo, Globerson, Carreras, and Collins (2007)’s idea is used to estimate the probability of the dependency arc. Before estimating the arc probability, the probability of a dependency tree d, which can be obtained by exponentiating and renormalizing the score Score(x, d), has to be computed: Pðdjx; MÞ¼ expðScoreðx; dÞÞ Zðx; MÞ Zðx; MÞ¼ X d02T ðxÞ expðScoreðx; d0ÞÞ where M is the dependence parsing model, and Z(x;M) is a normal- ization factor. T ðxÞ is the set of all possible parse trees for the sentence. Given the conditional distribution P(djx;M), the probability of a dependency arc (h, m, l) is: Pððh; m; lÞjx; MÞ¼ X d02T ðxÞ:ðh;m;lÞ2d0 Pðd0jx; MÞ Note that both probabilities require a summation over the set T ðxÞ, which is exponential in sentence size n. For the sake of simplicity, we used the k-best dependency tree list in place of T ðxÞ, where k is set to 1000 in the experiments. 3.2. RADAR as binary classification In a dependency parse tree, all dependency arcs are classified into two classes: positive (the modifier and head words of an arc are extracted correctly and the syntactic relation is correct.2) and negative (otherwise). RADAR can be regarded as a binary classifica- tion problem naturally. For the classification problem, three ques- tions need to be answered: 1. Which classifier is used? 2. What features are extracted? and 3. How is the training data constructed? 3.2.1. Classifier The logistic regression classifier (Bishop, 2006) is used in the experiments. The main reason to select logistic regression is that it can estimate the probability of each class, which will be used as the criteria for RADAR later. In addition, logistic regression is fast for training and effective for predicting. The present study used L2- regularized logistic regression, which solves: arg min w kwk2 2 þ Cþ Xnþ fijyi¼þ1g logð1 þ eð�yi w T xiÞÞþ C� Xn� fijyi¼�1g logð1 þ eð�yi w T xiÞÞ where (xi,yi) is the ith training sample and w is the feature weight vector. C+ and C� are the penalties of constraints violation for posi- tive and negative classes respectively, which can be used to handle the imbalanced data problem. A larger C� can be set to prevent the classifier from tending to recognize incorrect arcs as correct ones (the majority class). 3.2.2. Features Besides a classifier algorithm, features are also important for a classification system. To predict whether a dependency arc is cor- 2 This is equivalent to find the correct head word and dependency relation for a word. rect or not, we defined three sorts of features. Below is the detail description of the features and intuitions. 1. Text-based features relate only to the original text of a sentence. � The length of a sentence: Longer sentences are harder to be parsed correctly than shorter sentences. We group the lengths into three buckets: with a bucket LS (long-sentence) for length 40+, a single bucket MS (middle-sentence) for 16– 40, and a single bucket SS (short-sentence) for 1–15. The length thresholds are tuning on the development data. � Number of unknown words: Words that never appear in the training set are unknown words. Sentences containing more unknown words are more unlikely to be parsed correctly. Since the number of unknown words usually ranges in a small interval, we didn’t group them info buckets. Therefore, each number is assigned a single bucket. Note that the two features indicated above have the same value over the words in the same sentence. � Unknown word property: The boolean feature represents whether a word is an unknown word. The parser perfor- mance is worse for an unknown word. � Current word: wm (current word at position m). � Bigrams of POS tags: tm�1tm (POS tags at position m � 1 and m) and tmtm+1. � Trigram of POS tag: tm�1tmtm+1. 2. Parser-based features are extracted based on parsing results. � The length of a dependency arc: Number of words between wm and wh (head word). As with the text-based ‘‘Length of Sentence’’ feature, we also attempted to group this feature into buckets. However, it did not help. So we just assign a single bucket for a single length. � Word collocation: wmwh (current word at position m and its head word). � POS tag collocation: tmth (POS of current word and its head word). � Dependency relation: lm (dependency relation label of cur- rent word and its head word). We have also considered the direction of the dependency arc with respect to the word, but it brings no improvement. � Combination of POS tag and dependency relation: tmlm. � The InBetween POS features (form of a POS trigram): tmtbth (POS of current word, of its parent, and of a word in between) � The Surrounding POS features (form of a POS 4-gram): tm�1- m�1tmthth�1,tm�1tmthth+1, tm+1tmthth�1,tm+1tmthth+1. � Combined features. The combination of some of the features above can prove very helpful. 3. Comparison among Parsers � Agreement of different parsers: Intuitively, if two parsers based on different learning schemes agree on one depen- dency arc, it will probably be a reliable one. In this paper, we use a transition-based parser (Nivre, 2006) as the refer- ence parser. If the reference parser products the same label- ing (including the head and dependency label) for the current word with the basic parser, this feature is set TRUE; otherwise FALSE. 3.3. Training process In this section, we will introduce how to train a RADAR model. Fig. 2 shows the training process flow. The core problem is how to construct a reliable arc training corpus. Here, n-fold validation method is used. At step , the traditional dependency parsing training corpus was divided into n folds. At each time, n-1 folds were selected to train a dependency parser with graph-based mod- Table 1 Number of sentences and dependency arcs in training, development, and test set and their corresponding performance. Lang Data set #{sent} #{arcs} LAS (%) Chinese Train 55,496 1,025,054 81.37 Dev 1,500 29,091 81.19 Test 3,000 56,786 81.43 Fig. 2. The process flow of training a reliable arc classifier. W. Che et al. / Expert Systems with Applications 41 (2014) 1716–1722 1719 els. Then the leaving fold was parsed with the parser at step and the obtained automatic parsing dependency arcs were split into positive (correct) and negative (incorrect) classes according to their comparison with the gold parse trees. The above steps were repeated n times and the RADAR training corpus was constructed at step . Finally at step , the reliable arc classifier was trained with the training corpus. English Train 39,060 837,340 89.72 Dev 1,325 29,024 88.36 Test 2,389 50,291 90.03 Table 2 The contribution of different sort of features on the development data of CoNLL2009- en and CDT. Lang Feature sort AUC-PR (%) Decrease Chinese All Features 94.89 N/A –Text-based 94.81 0.07% –Parser-based 93.06 1.75% –Comparison 93.90 0.99% English All Features 97.97 N/A –Text-based 97.95 0.02% –Parser-based 97.35 0.62% –Comparison 97.73 0.24% Table 3 AUC-PR and AUC-ROC measurements on the test data. Lang Classification Probabilistic AUC-PR (%) AUC-ROC (%) AUC-PR (%) AUC-ROC (%) Chinese 94.65 81.42 93.59 79.38 English 98.56 89.29 97.68 85.50 4. Experiments 4.1. Data set In order to evaluate the effectiveness of the our approach, we conducted experiments on both English data and Chinese data. For English, we used the data of CoNLL 2009 Shared Task: Syn- tactic Dependency Parsing (Hajič et al., 2009a). However, when dealing with sentences longer than 60 words, the 1000-best pars- ing process became extremely slow and memory consumption. Therefore, we just discard those sentences in our experiments. For Chinese, we used the Chinese Dependency Treebank (CDT) as the experimental data set (Liu, Ma, & Li, 2006). CDT consists of 60,000 sentences from the People’s Daily in 1990s. The third and fourth columns of Table 1 show the numbers of sentences and dependency arcs (not including the punctuations) in training, development, and test set respectively. 4.2. Dependency parser and classifier An open source NLP toolkit, mate-tools,3 was selected as the dependency parser, which implemented a fast and the state-of- the-art graph-based dependency model (Bohnet, 2010). We modified the source code of mate-tools to output k-best results of each sen- tence. The probability of each dependency arc was calculated with the method introduced in Section 3.1. The last column of Table 1 shows the performance of mate-tools on the CoNLL and CDT data set. Note that the training data are constructed with a fourfold vali- dation as described in Section 3.3 and the LAS of the training data is the average of these folds. Finally, the parser for the development and test data was trained on the whole training data. According to the LAS value as we can see, for English data, the number of positive examples rates about 8 to 9 times as the number of negative exam- ples, while for Chinese data that is 4 to 5 times, which implies an imbalance. The maltparser (Nivre et al., 2007) with default parameter set- tings was used as the transition-based reference parser to obtain the Agreement feature.4 The liblinear (Fan, Chang, Hsieh, Wang, & Lin, 2008) was used as 3 http://code.google.com/p/mate-tools/. 4 http://maltparser.org/. 5 http://www.csie.ntu.edu.tw/�cjlin/liblinear/. the RADAR classifier,5 which implements the L2-regularized logistic regression algorithm with parameters C+ and C�. 4.3. Evaluation method Accuracy is perhaps the most commonly used evaluation meth- od for a classification problem. However, for imbalanced data, accuracy can yield misleading conclusions. For example, for a bin- ary classification problem with a 90:10 class distribution, the Accu- racy can easily reach 90% when we simply classify all the samples to the majority class. In practical applications, it is strongly desired that the most of used arcs are reliable (high precision). However, if http://code.google.com/p/mate-tools/ http://maltparser.org/ http://www.csie.ntu.edu.tw/~cjlin/liblinear/ http://www.csie.ntu.edu.tw/~cjlin/liblinear/ Fig. 3. P@N curves on test data set. Fig. 4. PR curves on the test data. 1720 W. Che et al. / Expert Systems with Applications 41 (2014) 1716–1722 studies only care about the precision of reliable arcs, the number of recognized reliable arcs can be restricted. This situation leads to a low recall of reliable arcs, which are useless in practical applica- tions. The F measure is another normal evaluation method that considers precision and recall at the same time. However, it is not a proper evaluation here either. The same with accuracy as in the previous example, the minority class has much less impact on the F score than the majority class. In practice, the top N most reliable dependency arcs are impor- tant. It can be evaluated with another information retrieval evalu- ation method, P@N. The N is different for different applications. For instance, for a search engine, the N is set to 10 or 20 usually, be- cause users usually only check those top results. However, in the RADAR problem, the N should be larger, while it is difficult to fix the N for different applications. In case of imbalanced data, ROC curve is usually thought as a suitable evaluation metric. However, it was suggested that the PR curves give more informative pictures of the performance of a learning algorithm (Davis & Goadrich, 2006). Therefore, in the cur- rent experiments, the PR curve is used as another evaluation meth- od. In addition, we also calculate the Area Under both the ROC Curve and PR Curve in order to give a quantitative comparison. The AUCCalculator-0.2 tool was used to calculate the AUC-PR and AUC-ROC.6 Larger AUC values indicate better performance. 6 http://mark.goadrich.com/programs/AUC/. 4.4. Contribution of features To evaluate the contribution of different sorts of features men- tioned in Section 3.2.2, we build a classifier with all of these three sorts of features at first. Then, we threw away each sort of fea- tures respectively to see the performance decreasing. Once a sort of thrown features drop down the performance more than an- other, it indicates that this sort of features is more important than another. Contributions are evaluated on the development set of English data and Chinese data respectively. Table 2 shows the re- sults. Note that in order to achieve optimal AUC-PR, the parame- ters of logistic regression classifier are tuned on the development data. The final settings are C+ = 1 and C� = 1.4. Since C� > C+, we can prevent negative examples from being easily recognized into positive class. From Table 2, we can see that all of the three sorts of features are useful for RADAR, because each removed sort of features de- crease the performance. Among these three sorts, the parser-based features contribute the most and text-based features are the least important. The reason is probably that text-based features do not look at the full parse tree, they just evaluate the hardness of cor- rectly parsing a sentence or a modifier. However, this is far from enough to decide the reliability of an arc. At the same time, some of the text-based features, such as the sentence length and the un- known word number, have the same value for all words in a sen- tence. As a consequence, they would be less indicative. In addition, The comparison features are also helpful. http://mark.goadrich.com/programs/AUC/ W. Che et al. / Expert Systems with Applications 41 (2014) 1716–1722 1721 4.5. Results Table 3 shows the AUC-PR and AUC-ROC measurements on the test data. We can see that both for English and Chinese data, the classification method outperforms the probabilistic baseline sys- tem. In order to understand the improvement better and more intuitively, we provide the P@N curve with different N in Fig. 3 and the Precision-Recall (PR) curve in Fig. 4. We can see in the figures that, for different Recall/N, the classification method is all better than the probabilistic baseline, especially when N is small or Recall is low. Based on our anal- ysis, there are two reasons for this. First, the k-best approaching for estimating the probability of an arc is an approximate one. It depends very much on the selection of k. Although a 1000- best estimation is good enough, it is not ideal. The second fac- tor is that the classification method can make use of more extra information (features) such as parsers’ agreement, as well as more powerful tools (classifiers), which help to defeat the prob- abilistic method. We would also note that the k-best decoding in graph-based parsing comes at a high cost in terms of both memory and time consumption, especially when a large k (e.g. k = 500,1000. . .) is set. Thus the classification approach is much more efficient and flexible than the probabilistic approach. 5. Conclusion This present paper proposes a novel natural language process- ing task, RADAR (ReliAble Dependency Arc Recognition). The per- formance of practical applications, such as information extraction and question answering, has potential to be improved further if the reliable arcs can be recognized correctly rather than only rec- ognizing reliable parse trees of whole sentences. We model RADAR as a binary classification problem with imbal- anced data. Then we can elastically use various features and clas- sifiers to achieve better performance. In this paper, we design three sorts of features to express reliability of arcs. A logistic regression model with adjustable C parameter was used as the classifier, which can classify large scale data with high speed and performance. Different from normal binary classification problem, RADAR cannot be evaluated with accuracy or F measure because of the problem of data imbalance. Instead, the P@N curve and PR curve, together with the use of associated area under the curve (AUC- PR, AUC-ROC) evaluate RADAR systems more suitably. The experimental results show that the classification method with appropriate features can outperform the arc probabilistic baseline method. In addition, we also evaluated the contributions of different sorts features. 6. Future work In the future, the work can be extended in the following ways: 1. The improvement of the performance of RADAR will be in two aspects, classifier and features. Besides logistic regression, more classifiers can be compared, which should not only have a high speed and accuracy, but also have output probability. RADAR also can be regarded as a ranking problem and can use various learning to rank algorithms. Besides classification methods, more combinatorial features and global features can be com- bined, such as the reliabilities surrounding the other and cur- rent arcs. Hence, it is no longer a simple binary classification model. In addition, another kind of global features counts from large scale unlabeled data, such as the number of arcs, which are parsed automatically or the Google Web 1T n-gram count information can be used. 2. Apply RADAR to applications, such as entity relation extraction based on a web corpus. These applications may recognize larger reliable sub-structures, such as parse paths. However, corre- sponding reliable arcs cannot be accumulated simply. Thus, rec- ognizing reliable sub-structures becomes an interesting problem. 3. It would be necessary to apply our approach to transition-based parser, to check whether it produces similar results. 4. The last future work is domain adaption problem, i.e. how to adapt the reliability model trained on one domain into another domain or even open domains. This is also an open question for natural language processing. Acknowledgment We gratefully acknowledge the support of the National Natural Science Foundation of China (NSFC) via Grant 61133012 and 61370164. The National Basic Research Program (973 Program) of China via Grant 2014CB340503. References Androutsopoulos, I., & Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38(1), 135–187. Atserias, J., Attardi, G., Simi, M., & Zaragoza, H. (2010). Active learning for building a corpus of questions for parsing. LREC. European Language Resources Association. Avihai Mejer, K. C. (2012). Are you sure? Confidence in prediction of dependency tree edges. In NAACL-HLT2012. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence. IJCAI’07 (pp. 2670–2676). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.. Bishop, C. M. (2006). Pattern recognition and machine learning (Information science and statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.. Bohnet, B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Coling 2010 Organizing Committee, Beijing, China, August 2010. (pp. 89–97). Chen, W., Kawahara, D., Uchimoto, K., & Zhang, Y. (2008). Dependency parsing with short dependency relations in unlabeled data. In IJCNLP08. Chen, W., Kazama, J., Uchimoto, K., & Torisawa, K. (2009). Improving dependency parsing with subtrees from auto-parsed data. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing – EMNLP ’09 (vol. 2). Morristown, NJ, USA: Association for Computational Linguistics, p. 570. Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning. ICML ’06 (pp. 233–240). New York, NY, USA: ACM. Dell’Orletta, F., & Venturi, G. (2011). ULISSE: An unsupervised algorithm for detecting reliable dependency parses. In CoNLL11 (pp. 115–124). Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. Gildea, D., & Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational Linguistics, 28, 245–288. Goldwasser, D., Reichart, R., Clarke, J., & Roth, D. (2011). Confidence driven unsupervised semantic parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 1486–1495). Portland, Oregon, USA: Association for Computational Linguistics. Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M. A., Màrquez, L., et al. (2009a). The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task (pp. 1–18). Boulder, Colorado: Association for Computational Linguistics. Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M.A., Màrquez, L., Meyers, A., Nivre, J., Padó, S., Štěpánek, J., Straňák, P., Surdeanu, M., Xue, N., & Zhang, Y. (2009). The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In CoNLL-2009. Johansson, R., & Nugues, P. (2007). Extended constituent-to-dependency conversion for English. In Proceedings of NODALIDA 2007, Tartu, Estonia. Kawahara, D., & Kurohashi, S. (2010). Acquiring reliable predicate-argument structures from raw corpora for case frame compilation. In LREC10 (pp. 1389– 1393). Kawahara, D., & Uchimoto, K. (2008). Learning reliability of parses for domain adaptation of dependency parsing. In IJCNLP08 (pp. 709–714). http://refhub.elsevier.com/S0957-4174(13)00691-X/h0005 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0005 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0010 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0010 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0015 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0015 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0015 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0015 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0020 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0020 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0025 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0025 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0025 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0025 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0030 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0030 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0030 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0035 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0035 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0035 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0040 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0040 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0045 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0045 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0045 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0045 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0045 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0050 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0050 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0050 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0050 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0050 1722 W. Che et al. / Expert Systems with Applications 41 (2014) 1716–1722 Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y., & Tsujii, J. (2009). Overview of BioNLP’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task. BioNLP ’09 (pp. 1–9). Stroudsburg, PA, USA: Association for Computational Linguistics. Koo, T., Globerson, A., Carreras, X., & Collins, M. (2007). Structured prediction models via the matrix-tree theorem. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 141–150). Prague, Czech Republic: Association for Computational Linguistics. Kübler, S., McDonald, R. T., & Nivre, J. (2009). Dependency parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. Liu, T., Ma, J., & Li, S. (2006). Building a dependency treebank for improving chinese parser. Journal of Chinese Language and Computing, 16, 207–224. Meena, A., & Prabhakar, T. V. (2007). Sentence level sentiment analysis in the presence of conjuncts using linguistic analysis. In Proceedings of the 29th European Conference on IR Research. ECIR’07 (pp. 573–580). Berlin, Heidelberg: Springer-Verlag. Nivre, J. (2006). Inductive dependency parsing (Text, Speech and Language Technology). Secaucus, NJ, USA: Springer-Verlag New York, Inc.. Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kübler, S., et al. (2007). Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95–135. Ravi, S., Knight, K., & Soricut, R. (2008). Automatic prediction of parser accuracy. In EMNLP08 (pp. 887–896). Morristown, NJ, USA: Association for Computational Linguistics. Reichart, R., & Rappoport, A. (2007). An ensemble method for selection of high quality parses. In ACL07 (pp. 408–415). Sag, I. A., Baldwin, T., Bond, F., Copestake, A. A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for nlp. In Proceedings of the 3rd International Conference on Computational Linguistics and Intelligent Text Processing. CICLing ’02 (pp. 1–15). London, UK, UK: Springer-Verlag. Settles, B. (2010). Active learning literature survey. Tech. rep., University of Wisconsin-Madison. van Noord, G. (2007). Using self-trained bilexical preferences to improve disambiguation accuracy. In Proceedings of the 10th International Conference on Parsing Technologies – IWPT ’07 (pp. 1–10). Morristown, NJ, USA: Association for Computational Linguistics. Wu, F., & Weld, D. S. (2010). Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. ACL ’10 (pp. 118–127). Stroudsburg, PA, USA: Association for Computational Linguistics. Yates, A., Schoenmackers, S., & Etzioni, O. (2006). Detecting parser errors using web- based semantic filters. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing – EMNLP ’06 (pp. 27). Morristown, NJ, USA: Association for Computational Linguistics. Zhang, M., Zhang, J., Su, J., & Zhou, G. (2006). A composite kernel to extract relations between entities with both flat and structured features. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. ACL-44 (pp. 825–832). Stroudsburg, PA, USA: Association for Computational Linguistics. http://refhub.elsevier.com/S0957-4174(13)00691-X/h0055 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0055 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0055 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0055 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0060 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0060 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0060 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0060 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0060 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0065 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0065 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0070 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0070 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0075 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0075 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0075 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0075 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0080 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0080 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0085 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0085 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0085 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0090 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0090 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0090 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0095 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0095 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0095 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0095 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0100 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0100 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0100 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0100 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0105 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0105 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0105 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0105 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0110 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0110 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0110 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0110 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0115 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0115 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0115 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0115 http://refhub.elsevier.com/S0957-4174(13)00691-X/h0115 ReliAble dependency arc recognition 1 Introduction 2 Related work 3 Method description 3.1 Graph-based dependency parsing 3.2 RADAR as binary classification 3.2.1 Classifier 3.2.2 Features 3.3 Training process 4 Experiments 4.1 Data set 4.2 Dependency parser and classifier 4.3 Evaluation method 4.4 Contribution of features 4.5 Results 5 Conclusion 6 Future work Acknowledgment References