Domain Adaptation for Syntactic and Semantic Dependency Parsing Using Deep Belief Networks Haitong Yang, Tao Zhuang and Chengqing Zong National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China {htyang, tao.zhuang, cqzong}@nlpr.ia.ac.cn Abstract In current systems for syntactic and seman- tic dependency parsing, people usually de- fine a very high-dimensional feature space to achieve good performance. But these systems often suffer severe performance drops on out- of-domain test data due to the diversity of fea- tures of different domains. This paper fo- cuses on how to relieve this domain adapta- tion problem with the help of unlabeled tar- get domain data. We propose a deep learning method to adapt both syntactic and semantic parsers. With additional unlabeled target do- main data, our method can learn a latent fea- ture representation (LFR) that is beneficial to both domains. Experiments on English data in the CoNLL 2009 shared task show that our method largely reduced the performance drop on out-of-domain test data. Moreover, we get a Macro F1 score that is 2.32 points higher than the best system in the CoNLL 2009 shared task in out-of-domain tests. 1 Introduction Both syntactic and semantic dependency parsing are the standard tasks in the NLP community. The state- of-the-art model performs well if the test data comes from the domain of the training data. But if the test data comes from a different domain, the perfor- mance drops severely. The results of the shared tasks of CoNLL 2008 and 2009 (Surdeanu et al., 2008; Hajič et al., 2009) also substantiates the argument. To relieve the domain adaptation, in this paper, we propose a deep learning method for both syntactic and semantic parsers. We focus on the situation that, besides source domain training data and target do- main test data, we also have some unlabeled target domain data. Many syntactic and semantic parsers are devel- oped using a supervised learning paradigm, where each data sample is represented as a vector of fea- tures, usually a high-dimensional feature. The per- formance degradation on target domain test data is mainly caused by the diversity of features of differ- ent domains, i.e., many features in target domain test data are never seen in source domain training data. Previous work have shown that using word clus- ters to replace the sparse lexicalized features (Koo et al., 2008; Turian et al., 2010), helps relieve the performance degradation on the target domain. But for syntactic and semantic parsing, people also use a lot of syntactic features, i.e., features extracted from syntactic trees. For example, the relation path be- tween a predicate and an argument is a syntactic fea- ture used in semantic dependency parsing (Johans- son and Nugues, 2008). Figure 1 shows an exam- ple of this relation path feature. Obviously, syntac- tic features like this are also very sparse and usu- ally specific to each domain. The method of clus- tering fails in generalizing these kinds of features. Our method, however, is very different from clus- tering specific features and substituting these fea- tures using their clusters. Instead, we attack the do- main adaption problem by learning a latent feature representation (LFR) for different domains, which is similar to Titov (2011). Formally, we propose a Deep Belief Network (DBN) model to represent a data sample using a vector of latent features. This latent feature vector is inferred by our DBN model 271 Transactions of the Association for Computational Linguistics, vol. 3, pp. 271–282, 2015. Action Editor: Hal Daumé III. Submission batch: 3/2015; Published 5/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. wantsShe payto ayou .visit P SBJ OPRD IM OBJ NMOD ROOT OBJ Figure 1: A path feature example. The red edges are the path between She and visit and thus the relation path fea- ture between them is SBJ↑OPRD↓IM↓OBJ↓ based on the data sample’s original feature vector. Our DBN model is trained unsupervisedly on orig- inal feature vectors of data in both domains: train- ing data from the source domain, and unlabeled data from the target domain. So our DBN model can pro- duce a common feature representation for data from both domains. A common feature representation can make two domains more similar and thus is very helpful for domain adaptation (Blitzer, 2006). Dis- criminative models using our latent features adapt better to the target domain than models using origi- nal features. Discriminative models in syntactic and semantic parsers usually use millions of features. Applying a typical DBN to learn a sensible LFR on that many original features is computationally too expensive and impractical (Raina et al., 2009). Therefore, we constrain the DBN by splitting the original features into groups. In this way, we largely reduce the com- putational cost and make LFR learning practical. We carried out experiments on the English data of the CoNLL 2009 shared task. We use a basic pipelined system and compare the effectiveness of the two fea- ture representations: original feature representation and our LFR. Using the original features, the per- formance drop on out-of-domain test data is 10.58 points in Macro F1 score. In contrast, using the LFR, the performance drop is only 4.97 points. And we have achieved a Macro F1 score of 80.83% on the out-of-domain test data. As far as we know, this is the best result on this data set to date. 2 Related Work Dependency parsing and semantic role labeling are two standard tasks in the NLP community. There have been many works on the two tasks (McDon- ald et al., 2005; Gildea and Jurafsky, 2002; Yang and Zong, 2014; Zhuang and Zong, 2010a; Zhuang and Zong, 2010b, etc). Among them, researches on domain adaptation for dependency parsing and SRL are directly related to our work. Dredze et al., (2007) show that domain adaptation is hard for de- pendency parsing based on results in the CoNLL 2007 shared task (Nivre et al., 2007). Chen et al., (2008) adapted a syntactic dependency parser by learning reliable information on shorter dependen- cies in unlabeled target domain data. But they do not consider the task of semantic dependency pars- ing. Huang et al., (2010) used an HMM-based la- tent variable language model to adapt a SRL system. Their method is tailored for a chunking-based SRL system and can hardly be applied to our dependency based task. Weston et al., (2008) used deep neural networks to improve an SRL system. But their tests are on in-domain data. On methodology, the work in Glorot et al., (2011) and Titov (2011) is closely related to ours. They also focus on learning LFRs for domain adaptation. However, their work deals with domain adaptation for sentiment classification, which uses much fewer features and training samples. So they do not need to worry about computational cost as much as we do. Titov (2011) used a graphical model that has only one layer of hidden variables. On contrast, we need to use a model with two layers of hidden variables and split the first hidden layer to reduce computational cost. The model of Titov (2011) also embodies a specific classifier. But our model is in- dependent of the classifier to be used. Glorot et al., (2011) used a model called Stacked Denoising Auto-Encoders, which also contains multiple hidden layers. However, they do not exploit the hierarchi- cal structure of their model to reduce computational cost. By splitting, our model contains much less pa- rameters than theirs. In fact, the models in Glorot et al., (2011) and Titov (2011) cannot be applied to our task simply because of the high computational cost. 3 Our DBN Model for LFR In discriminative models, each data sample is rep- resented as a vector of features. Our DBN model maps this original feature vector to a vector of latent 272 features. And we use this latent feature vector to rep- resent the sample, i.e., we replace the whole original feature vector by the latent feature vector. In this section, we introduce how our DBN model repre- sent a data sample as a vector of latent features. Be- fore introducing our DBN model, we first review a simpler model called Restricted Boltzman Machines (RBM) (Hinton et al., 2006). When training a DBN model, RBM is used as a basic unit in a DBN. 3.1 Restricted Boltzmann Machines An RBM is an undirected graphical model with a layer of visible variables v = (v1, ...,vm), and a layer of hidden variables h = (h1, ...,hn). These variables are binary. Figure 2 shows a graphical rep- resentation of an RBM. ... ... ... ... (a) (b) h v h v Figure 2: Graphical representations of an RBM: (a) rep- resents an RBM. (b) is a more compact representation The parameters of an RBM are θ = (W,a,b) where W = (Wij)m×n is a matrix with Wij be- ing the weight for the edge between vi and hj, and a = (a1, ...,am), b = (b1, ...,bn) are bias vectors for v and h respectively. The probabilistic model of an RBM is: p(v,h|θ) = 1 Z(θ) exp(−E(v,h)) (1) where E(v,h) = − m∑ i=1 aivi − n∑ j=1 bjhj − m∑ i=1 n∑ j=1 viwijhj Z(θ) = ∑ v,h exp(−E(v,h)) Because the connections in an RBM are only be- tween visible and hidden variables, the conditional distribution over a hidden or a visible variable is quite simple: p(hj = 1|v) = σ(bj + m∑ i=1 viwij) (2) p(vi = 1|h) = σ(ai + n∑ j=1 hiwij) (3) where σ(x) = 1/(1 + exp(−x)) is the logistic sig- moid function. An RBM can be efficiently trained on a sequence of visible vectors using the Contrastive Divergence method (Hinton, 2002). 3.2 The Problem of Large Scale In our syntactic and semantic parsing task, all fea- tures are binary. So each data sample (an shift ac- tion in syntactic parsing or an argument candidate in semantic parsing) is represented as a binary feature vector. By treating a sample’s feature vector as vis- ible variable vector in an RBM, and taking hidden variables as latent features, we could get the LFR of this sample using the RBM. However, for our syntactic and semantic parsing tasks, training such an RBM is computationally impractical due to the following considerations. Let m,n denote respec- tively the number of visible and hidden variables in the RBM. Then there are O(mn) parameters in this RBM. If we train the RBM on d samples, then the time complexity for Contrastive Divergence training is O(mnd). For syntactic or semantic parsing, there are over 1 million unique binary features, and mil- lions of training samples. That means both m and d are in an order of 106. With m and n of that order, n should not be chosen too small to get a sensible LFR (Hinton, 2010). Our experience indicates that n should be at least in an order of 103. Now we see why the O(mnd) complexity is formidable for our task. 3.3 Our DBN Model A DBN is a probabilistic generative model that is composed of multiple layers of stochastic, latent variables (Hinton et al., 2006). The motivation of us- ing a DBN is two-fold. First, previous research has shown that a deep network can capture high-level correlations between visible variables better than an RBM (Bengio, 2009). Second, as shown in the pre- ceding subsection, the large scale of our task poses 273 ...h2 v h 1 ...... ...... ... ... ...... Figure 3: Our DBN model. The blue nodes stand for the visible variables (v) and the blank node stands for the hidden variables (h1 and h2). The symbols are also used in the figures of the following subsectins. a great challenge for learning an LFR. By manipu- lating the hierarchical structure of a DBN, we can significantly reduce the number of parameters in the DBN model. This largely reduces the computational cost for training the DBN. Without this technique, it is impractical to learn a DBN model with that many parameters on large training sets. As shown in Fig.3, our DBN model contains 2 layers of hidden variables: h1,h2, and a visible vec- tor v. The visible vector corresponds to a sample’s original feature vector. The second-layer hidden variable vector h2 are used as the LFR of this sam- ple. Suppose there are m,n1,n2 variables in v,h1,h2 respectively. To reduce the number of parameters in the DBN, we split its first layer (h1 − v) into k groups, as we will explain in the following sub- section. We confine the connections in this layer to variables within the same group. So there are only mn1/k parameters in the first layer. Without splitting, the number of parameters would be mn1. Therefore, learning that many parameters requires too much computation. By splitting, we reduce the number of parameters by a factor of k. If we choose k big enough, learning is feasible. The second layer (h2 −h1) is fully connected, so that the variables in the second layer can capture the relations between variables in different groups in the first layer. There are n1n2 parameters in the sec- ond layers. Because n1 and n2 are relatively small, learning the parameters in the second layer is also feasible. In summary, by splitting the first layer into groups, we have largely reduced the number of pa- rameters in our DBN model. This makes learning our DBN model practical for our task. In our task, visible variables corresponds to original binary fea- tures and the second layer hidden variables are used as the LFR of these original features. One deficiency of splitting is that the relationships between original features in different groups can not be captured by hidden variables in the first layer. However, this de- ficiency is compensated by using the second layer to capture relationships between all variables in the first layer. In this way, the second layer still cap- tures the relationships between all original features indirectly. 3.3.1 Splitting Features into Groups When we split the first layer into k groups, ev- ery group, except the last one, contains bm/kc vis- ible variables and bn1/kc hidden variables. The last group contains the remaining visible and hidden variables. But how to split the visible variables, i.e., the original features, into these groups? Of course there are many ways to split the original features. But it is difficult to find a good principle to split. So we tried two splitting strategies in this paper. The first strategy is very simple. We arrange all features as the order they appeared in the training data . Sup- pose each group contains r original features. We just put the first r unique features of training data into the first group, the following r unique features into the second group, and so on. The second strategy is more sophisticated. All features can be divided into three categories: the common features, the source-specific features and the target-specific features. Its main idea is to make each group contain the three categories of features evenly, which we think makes the distribution of fea- tures close to the ‘true’ distribution over domains. Let Fs and Ft denote the sets of features that ap- peared on source and target domain data respec- tively. We collect Fs and Ft from our training data. The features in Fs and Ft are are ordered the same as the order they appeared in training data. And let Fs∩t = Fs ∩ Ft (the common features), Fs\t = Fs\Ft (the source-specific features), Ft\s = Ft\Fs (the target-specific features). So, to evenly dis- tribute features in Fs∩t, Fs\t and Ft\s to each group, each group should consist of |Fs∩t|/k, |Fs\t|/k and |Ft\s|/k features from Fs∩t, Fs\t and Ft\s respec- 274 tively. Therefore, we put the first |Fs∩t|/k features from Fs∩t, the first |Fs\t|/k features from Fs\t and the first |Ft\s|/k features from Ft\s into the first group. Similarly, we put the second |Fs∩t|/k fea- tures from Fs∩t, the second |Fs\t|/k features from Fs\t and the second |Ft\s|/k features from Ft\s into the second group. The intuition of this strategy is to let features in Fs∩t act as pivot features that link fea- tures in Fs\t and Ft\s in each group. In this way, the first hidden layer might capture better relationships between features from source and target domains. 3.3.2 LFR of a Sample Given a sample represented as a vector of origi- nal features, our DBN model will represent it as a vector of latent features. The sample’s original fea- ture vector corresponds to the visible vector v in our DBN model in Figure 3. Our DBN model uses the second-layer hidden variable vector h2 to represent this sample. Therefore, we must infer the value of hidden variables in the second-layer given the vis- ible vector. This inference can be done using the methods in Hinton et al., (2006). Given the visible vector, the values of the hidden variables in every layer can be efficiently inferred in a single, bottom- up pass. 3.4 Training Our DBN Model Inference in a DBN is simple and fast. Nonetheless, training a DBN is more complicated. A DBN can be trained in two stages: greedy layer-wise pretraining and fine tuning (Hinton et al., 2006). 3.4.1 Greedy Layer-wise Pretraining In this stage, the DBN is treated as a stack of RBMs as shown in Figure 4. The second layer is treated as a single RBM. The first layer is treated as k parallel RBMs with each group being one RBM. These k RBMs are paral- lel because their visible variable vectors constitute a partition of the original feature vector. In this stage, we train these constituent RBMs in a bottom- up layer-wise manner. To learn parameters in the first layer, we only need to learn the parameters of each RBM in the first layer. With the original feature vector v given, these k RBMs can be trained using the Contrastive Diver- gence method (Hinton, 2002). After the first layer is ...h2 h 1 ...... ...... ... ... RBM ... ... RBM ... ... RBM ... ... RBM Figure 4: Stack of RBMs in pretraining. trained, we will fix the parameters in the first layer and start to train the second layer. For the RBM of the second layer, its visible vari- ables are the hidden variables in the first layer. Given an original feature vector v, we first infer the acti- vation probabilities for the hidden variables in the first layer using equation (2). And we use these ac- tivation probabilities as values for visible variables in the second layer RBM. Then we train the second layer RBM using contrastive divergence algorithm. Note that the activation probabilities are not binary values. But this is only a trick for training because using probabilities generally produces better models (Hinton et al., 2006). This trick does not change our assumption that each variable is binary. 3.4.2 Fine Tuning The greedy layer-wise pretraining initializes the parameters of our DBN to sensible values. But these values are not optimal and the parameters need to be fine tuned. For fine tuning, we unroll the DBN to form an autoencoder as in Hinton and Salakhutdinov (2006), which is shown in Figure 5. In this autoencoder, the stochastic activities of bi- nary hidden variables are replaced by its activation probabilities. So the autoencoder is in essence a feed-forward neural network. We tune the param- eters of our DBN model on this autoencoder using backpropagation algorithm. 4 Domain Adaptation with Our DBN Model In this section, we introduce how to use our DBN model to adapt a basic syntactic and semantic de- 275 ... ...... ...... ...... ...... ...... ...... ...... ...... Figure 5: Unrolling the DBN. pendency parsing system to target domain. 4.1 The Basic Pipelined System We build a typical pipelined system, which first an- alyze syntactic dependencies, and then analyze se- mantic dependencies. This basic system only serves as a platform for experimenting with different fea- ture representations. So we just briefly introduce our basic system in this subsection. 4.1.1 Syntactic Dependency Parsing For syntactic dependency parsing, we use a de- terministic shift-reduce method as in Nivre et al., (2006). It has four basic actions: left-arc, right-arc, shift, and reduce. A classifier is used to determine an action at each step. To decide the label for each dependency link, we extend the left/right-arc actions to their corresponding multi-label actions, leading to 31 left-arc and 66 right-arc actions. Altogether a 99- class problem is yielded for parsing action classifi- cation. We add arcs to the dependency graph in an arc eager manner as in Hall et al., (2007). We also projectivize the non-projective sequences in training data using the transformation from Nivre and Nils- son (2005). A maximum entropy classifier is used to make decisions at each step. The features utilized are the same as those in Zhao et al., (2008). 4.1.2 Semantic Dependency Parsing Our semantic dependency parser is similar to the one in Che et al., (2009). We first train a predicate sense classifier on training data, using the same fea- tures as in Che et al., (2009). Again, a maximum en- tropy classifier is employed. Given a predicate, we need to decide its semantic dependency relation with each word in the sentence. To reduce the number of argument candidates, we adopt the pruning strat- egy in Zhao et al., (2009), which is adapted from the strategy in Xue and Palmer (2004). In the seman- tic role classification stage, we use a maximum en- tropy classifier to predict the probabilities of a can- didate to be each semantic role. We train two differ- ent classifiers for verb and noun predicates using the same features as in Che et al., (2009). We use a sim- ple method for post processing. If there are dupli- cate arguments for ARG0∼ARG5, we preserve the one with the highest classification probability and remove its duplicates. 4.2 Adapting the Basic System to Target Domain In our basic pipeline system, both the syntactic and semantic dependency parsers are built using dis- criminative models. We train a syntactic parsing model and a semantic parsing model using the orig- inal feature representation. We will refer to this syn- tactic parsing model as OriSynModel, and the se- mantic parsing model as OriSemModel. However, these two models do not adapt well to the target do- main. So we use the LFR of our DBN model to train new syntactic and semantic parsing models. We will refer to the new syntactic parsing model as LatSyn- Model, and the new semantic parsing model as Lat- SemModel. Details of using our DBN model are as follows. 4.2.1 Adapting the Syntactic Parser The input data for training our DBN model are the original feature vectors on training and unla- beled data. Therefore, to train our DBN model, we first need to extract the original features for syntactic parsing on these data. Features on training data can be directly extracted using golden-standard annota- tions. On unlabeled data, however, some features cannot be directly extracted. This is because our syntactic parser uses history-based features which depend on previous actions taken when parsing a sentence. Therefore, features on unlabeled data can only be extracted after the data are parsed. To solve this problem, we first parse the unlabeled data using the already trained OriSynModel. In this way, we 276 can obtain the features on the unlabeled data. Be- cause of the poor performance of the OriSynModel on the target domain, the extracted features on un- labeled data contains some noise. However, exper- iments show that our DBN model can still learn a good LFR despite the noise in the extracted features. Using the LFR, we can train the syntactic parsing model LatSynModel. Then by applying the LFR on test and unlabeled data, we can parse the data using LatSynModel. Experiments in later sections show that the LatSynModel adapts much better to the tar- get domain than the OriSynModel. 4.2.2 Adapting the Semantic Parser The situation here is similar to the adaptation of the syntactic parser. Features on training data can be directly extracted. To extract features on unla- beled data, we need to have syntactic dependency trees on this data. So we use our LatSynModel to parse the unlabeled data first. And we automatically identify predicates on unlabeled data using a clas- sifier as in Che et al., (2008). Then we extract the original features for semantic parsing on unlabeled data. By feeding original features extracted on these data to our DBN model, we learn the LFR for se- mantic dependency parsing. Using the LFR, we can train the semantic parsing model LatSemModel. 5 Experiments 5.1 Experiment Setup 5.1.1 Experiment Data We use the English data in the CoNLL 2009 shared task for experiments. The training data and in-domain test data are from the WSJ corpus, whereas the out-of-domain test data is from the Brown corpus. We also use unlabeled data consist- ing of the following sections of the Brown corpus: K, L, M, N, P. The test data are excerpts from fic- tions. The unlabeled data are also excerpts from fic- tions or stories, which are similar to the test data. Although the unlabeled data is actually annotated in Release 3 of the Penn Treebank, we do not use any information contained in the annotation, only using the raw texts. The training, test and unlabeled data contains 39279, 425, and 16407 sentences respec- tively. 5.1.2 Settings of Our DBN Model For the syntactic parsing task, there are 748,598 original features in total. We use 7,486 hidden vari- ables in the first layer and 3,743 hidden variables in the second layer. For semantic parsing, there are 1,074,786 original features. We use 10,748 hidden variables in the first layer and 5,374 hidden variables in the second layer. In our DBN models, we need to determine the number of groups k. Because larger k means less computational cost, k should not be set too small. We empirically set k as follows: according to our experience, each group should contain about 5000 original features. We have about 106 original fea- tures in our tasks. So we estimate k ≈ 106/5000 = 200. And we set k to be 200 in the DBN models for both syntactic and semantic parsing. As for split- ting strategy, we use the more sophisticated one in subsection 3.3.1 because it should generate better re- sults than the simple one. 5.1.3 Details of DBN Training In greedy pretraining of the DBN, the contrastive divergence algorithm is configured as follows: the training data is divided to mini-batches, each con- taining 100 samples. The weights are updated with a learning rate of 0.3, momentum of 0.9, weight de- cay of 0.0001. Each layer is trained for 30 passes (epochs) over the entire training data. In fine-tuning, the backpropagation algorithm is configured as follows: The training data is divided to mini-batches, each containing 50 samples. The weights are updated with a learning rate of 0.1, mo- mentum of 0.9, weight decay of 0.0001. The fine- tuning is repeated for 50 epochs over the entire train- ing data. We use the fast computing technique in Raina et al., (2009) to learn the LFRs. Moreover, in greedy pretraining, we train RBMs in the first layer in par- allel. 5.2 Results and Discussion We use the official evaluation measures of the CoNLL 2009 shared task, which consist of three dif- ferent scores: (i) syntactic dependencies are scored using the labeled attachment score, (ii) semantic de- pendencies are evaluated using a labeled F1 score, and (iii) the overall task is scored with a macro av- 277 Test data System LAS Sem F1 Macro F1 WSJ Ori 87.63 84.82 86.24 Lat 87.30 84.25 85.80 Brown Ori 79.72 71.57 75.67 Lat 82.84 78.75 80.83 Table 1: The results of our basic and adapted systems erage of the two previous scores. The three scores above are represented by LAS, Sem F1, and Macro F1 respectively in this paper. 5.2.1 Comparison with Un-adapted System Our basic system uses the OriSynModel for syn- tactic parsing, and the OriSemModel for semantic parsing. Our adapted system uses the LatSynModel for syntactic parsing, and the LatSemModel for se- mantic parsing. The results of these two systems are shown in Table 1, in which our basic and adapted systems are denoted as Ori and Lat respectively. From the results in Table 1, we can see that Lat performs slightly worse than Ori on in-domain WSJ test data. But on the out-of-domain Brown test data, Lat performs much better than Ori, with 5 points im- provement in Macro F1 score. This shows the effec- tiveness of our method for domain adaptation tasks. 5.2.2 Different Splitting Configurations As described in subsection 5.1.2, we have em- pirically set the number of groups k to be 200 and chosen the more sophisticated splitting strategy. In this subsection, we experiment with different split- ting configurations to see their effects. Under each splitting configuration, we learn the LFRs using our the DBN models. Using the LFRs, we test the our adapted systems on both in-domain and out-of-domain data. Therefore we get many test results, each corresponding to a splitting configura- tion. The in-domain and out-of-domain test results are reported in Table 2 and Table 3 respectively. In these two tables, ‘s1’ and ‘s2’ represents the sim- ple and the more sophisticated splitting strategies in subsection 3.3.1 respectively. ‘k’ represents the number of groups in our DBN models. For both syntactic and semantic parsing, we use the same k in their DBN models. The ‘Time’ column reports the training time of our DBN models for both syn- tactic and semantic parsing. The unit of the ‘Time’ Str k Time(h) LAS Sem F1 Macro F1 s1 100 392 85.95 82.42 84.19 200 261 85.76 82.14 83.95 300 218 85.48 81.68 83.58 400 196 84.80 80.24 82.52 s2 100 392 86.22 83.03 84.63 200 261 86.10 82.89 84.50 300 218 85.72 82.24 83.98 400 196 84.96 81.13 83.05 Table 2: Results of different splitting configurations on in-domain WSJ development data Str k Time(h) LAS Sem F1 Macro F1 s1 100 392 82.81 78.77 80.82 200 261 82.73 78.49 80.63 300 218 82.44 77.90 80.37 400 196 81.83 76.72 79.31 s2 100 392 82.95 79.03 81.03 200 261 82.84 78.75 80.83 300 218 82.63 78.34 80.50 400 196 81.97 76.98 79.51 Table 3: Results of different splitting configurations on out-of-domain Brown test data column is the hour. Please note that we only need to train our DBN models once. And we report the training time in Table 2. For easy viewing, we re- peat those training times in Table 3. But this does not mean we need to train new DBN models for out- of-domain test. From Tables 2 and 3 we get the following obser- vations: First, although the more sophisticated splitting strategy ‘s2’ generate slightly better result than the simple strategy ‘s1’, the difference is not signifi- cant. This means that the hierarchical structure of our DBN model can robustly capture the relation- ships between features. Even with the simple split- ting strategy ‘s1’, we still get quite good results. Second, the ‘Time’ column in Table 2 shows that different splitting strategies with the same k value has the same training time. This is reasonable be- cause training time only depends on the number of parameters in our DBN model. And different split- ting strategies do not affect the number of parame- ters in our DBN model. 278 Third, the number of groups k affects both the training time and the final results. When k increases, the training time reduces but the results degrade. As k gets larger, the time reduction gets less obvious, but the degradation of results gets more obvious. When k = 100, 200, 300, there is not much differ- ence between the results. This shows that the results of our DBN model is not sensitive to the values of k within a range of 100 around our initial estima- tion 200. But when k is further away from our es- timation, e.g. k = 400, the results get significantly worse. Please note that the results in Tables 2 and 3 are not used to tune the parameter k or to choose a split- ting strategy in our DBN model. As mentioned in subsection 5.1.2, we have chosen k = 200 and the more sophisticated splitting strategy beforehand. In this paper, we always use the results with k = 200 and the ‘s2’ strategy as our main results, even though the results with k = 100 are better. 5.3 The Size of Unlabeled Target Domain Data An interesting question for our method is how much unlabeled target domain data should be used. To em- pirically answer this question, we learn several LFRs by gradually adding more unlabeled data to train our DBN model. We compared the performance of these LFRs as shown in Figure 6. 74 76 78 80 82 84 86 88 0 3000 6000 9000 12000 15000 18000 Target Domain Test Source Domain Test Figure 6: Macro F1 scores on test data with respect to the size of unlabeled target domain data used in DBN train- ing. The horizontal axis is the number of sentences in unlabeled target domain data and the coordinate axis is the Macro F1 Score. From Figure 6, we can see that by adding more unlabeled target domain data, our system adapts bet- ter to the target domain with only small degradation of result on source domain. However, with more un- labeled data used, the improvement on target domain result gradually gets smaller. 5.4 Comparison with other methods In this subsection, we compare our method with sev- eral systems. These are described below. Daume07. Daumé III (2007) proposed a simple and effective adaptation method by augmenting fea- ture vector. Its main idea is to augment the feature vector. They took each feature in the original prob- lem and made three versions of it: a general version, a source-specific version and a target-specific ver- sion. Thus, the augmented source data contains only general and source-specific versions; the augmented target data contains general and target-specific ver- sions. In the baseline system, we adopt the same technique for dependency and semantic parsing. Chen. The participation system of Zhao et al., (2009), reached the best result in the out-of-domain test of the CoNLL 2009 shared task. In Daumé III and and Marcu (2006), they pre- sented and discussed several ‘obvious’ ways to at- tack the domain adaptation problem without devel- oping new algorithms. Following their idea, we con- struct similar systems. OnlySrc. The system is trained on only the data of the source domain (News). OnlyTgt. The system is trained on only the data of the target domain (Fiction). All. The system is trained on all data of the source domain and the target domain. It is worth noting that training the systems of Daume07, OnlyTgt and All need the labeled data of the target domain. We utilize OnlySrc to parse the unlabeled data of the target domain to generate the labeled data. ALl comparison results are shown in Table 4, in which the ‘Diff’ column is the difference of scores on in-domain and out-of-domain test data. First, we compare OnlySrc, OnlyTgt and All. We can see that OnlyTgt performs very poor both in the source domain and in the target domain. It is not hard to understand that OnlyTgt performs poor in the source domain because of the adaptation prob- lem. OnlyTgt also performs poor in the target do- main. We think the main reason is that OnlyTgt is trained on the auto parsed data in which there are 279 Score System WSJ Brown Diff LAS OnlySrc 87.63 79.72 7.91 OnlyTgt 73.25 78.30 5.05 All 87.41 80.54 6.87 Daume07 87.47 80.46 7.01 Chen 89.19 82.38 6.81 Ours 87.30 82.84 4.46 Sem F1 OnlySrc 84.82 71.57 13.25 OnlyTgt 73.74 70.34 3.40 All 84.68 72.75 11.93 Daume07 84.52 72.90 11.62 Chen 86.15 74.58 11.57 Ours 84.25 78.75 5.50 Macro F1 OnlySrc 86.24 75.67 10.57 OnlyTgt 73.50 74.32 0.82 All 86.04 76.65 9.40 Daume07 86.00 76.68 9.32 Chen 87.69 78.51 9.18 Ours 85.80 80.83 4.97 Table 4: Comparison with other methods. many parsing errors. But we note that All performs better than both OnlySrc and OnlyTgt on the target domain test, although its training data contains some auto parsed data. Therefore, the data of the target domain, labeled or unlabeled, are potential in alle- viating the adaptation problem of different domains. But All just puts the auto parsed data of the target domain into the training set. Thus, its improvement on the test data of the target domain is limited. In fact, how to use the data of the target domain, espe- cially the unlabeled data, in the adaptation problem is still an open and hot topic in NLP and machine learning. Second, we compare Daume07, All and our method. In Daume07, they reported improvement on the target domain test. But one point to note is that the target domain data used in their experi- ments is labeled while in our case there is only un- labeled data. We can see Daume07 have compara- ble performance with All in which there is not any adaptation strategy besides adding more data of the target domain. We think the main reason is that there are many parsing errors in the data of the tar- get domain. But our method performs much better than Daume07 and All even though some faulty data are also utilized in our system. This suggests that our method successfully learns new robust represen- tations for different domains, even when there are some noisy data. Third, we compare Chen with our method. Chen reached the best result in the out-of-domain test of the CoNLL 2009 shared task. The results in Table 4 show that Chen’s system performs better than ours on in-domain test data, especially on LAS score. Chen’s system uses a sophisticated graph-based syn- tactic dependency parser. Graph-based parsers use substantially more features, e.g. more than 1.3 × 107 features are used in McDonald et al., (2005). Learning an LFR for that many features would take months of time using our DBN model. So at present we only use a transition-based parser. The better per- formance of Chen’s system mainly comes from their sophisticated syntactic parsing method. To reduce the sparsity of features, Chen’s sys- tem uses word cluster features as in Koo et al., (2008). On out-of-domain tests, however, our sys- tem still performs much better than Chen’s, espe- cially on semantic parsing. To our knowledge, on out-of-domain tests on this data set, our system has obtained the best performance to date. More im- portantly, the performance difference between indo- main and out-of-domain tests is much smaller in our system. This shows that our system adapts much better to the target domain. 6 Conclusions In this paper, we propose a DBN model to learn LFRs for syntactic and semantic parsers. These LFRs are common representations of original fea- tures in both source and target domains. Syntactic and semantic parsers using the LFRs adapt to tar- get domain much better than the same parsers us- ing original feature representation. Our model pro- vides a unified method that adapts both syntactic and semantic dependency parsers to a new domain. In the future, we hope to further scale up our method to adapt parsing models using substantially more features, such as graph-based syntactic dependency parsing models. We will also search for better split- ting strategies for our DBN model. Finally, although our experiments are conducted on syntactic and se- mantic parsing, it is expected that the proposed ap- 280 proach can be applied to the domain adaptation of other tasks with little adaptation efforts. Acknowledgements The research work has been partially funded by the Natural Science Foundation of China under Grant No.61333018 and supported by the West Light Foundation of Chinese Academy of Sciences under Grant No.LHXZ201301. We thank the three anony- mous reviewers and the Action Editor for their help- ful comments and suggestions. References Yoshua Bengio. 2009. Learning Deep Architectures for AI. In Foundations and Trends in Machine Learning, 2(1):1-127. John Blitzer, Ryan McDonald and Fernando Pereira. 2006. Domain Adaptation with sturctural correspon- dance learning. In Proceedings of ACL-2006. Wanxiang Che, Zhenghua Li, Yuxuan Hu, Yongqiang Li, Bing Qin, Ting Liu and Sheng Li. 2008. A Cascaded Syntactic and Semantic Dependency Parsing System. In Proceedings of CoNLL-2008 shared task. Wanxiang Che, Zhenghua Li, Yongqiang Li, Yuhang Guo, Bing Qin and Ting Liu. 2009. Multilingual Dependency-based Syntactic and Semantic Parsing. In Proceedings of CoNLL-2009 shared task. Wenliang Chen, YouzhengWu and Hitoshi Isahara. 2008. Learning reliable information for dependency parsing adaptation. In Proceedings of COLING-2008. Hal Daumé III. 2007. Frustratingly Easy Domain Adap- tation. In Proceedings of ACL-2007. Hal Daumé III and Daniel Marcu. 2006. Domain Adap- tation for Statistical Classifer. In Journal of Artificial Intelligence Research, 26(2006), 101-126. Mark Dredze, John Blitzer, Partha P. Talukdar, Kuzman Ganchev, Joao Graca and Fernando Pereira. 2007. Frustratingly Hard Domain Adaptation for Depen- dency Parsing. In Proceedings of EMNLP-CoNLL- 2007. Xavier Glorot, Antoine Bordes and Yoshua Bengio. 2011. Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach. In Pro- ceedings of International Conference on Machine Learning (ICML) 2011. Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- beling for semantic roles. In Computational Linguis- tics, 28(3): 245-288. I. Goodfellow, Q. Le, A. Saxe and A. Ng. 2009. Mea- suring invariances in deep networks. In Proceedings of Advances in Neural Information Processing Sys- tems(NIPS)2011. Jan Hajič, Massimiliano Ciaramita, Richard Johans- son, Daisuke Kawahara, Maria Antònia Martı́, Lluı́s Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue and Yi Zhang. 2009. The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages. In Proceedings of CoNLL-2009. J. Hall, J. Nilsson, J. Nivre, G. Eryiǧit, B. Megyesi, M. Nilsson, and M. Saers. 2007. Single Malt or Blended? A Study in Multilingual Parser Optimization. In Pro- ceedings of EMNLP-CoNLL-2007. Geoffrey Hinton. 2010. A Practical Guide to Train- ing Restricted Boltzmann Machines. In Technical re- port 2010-003, Machine Learning Group, University of Toronto. Geoffrey Hinton. 2002. Training products of experts by minimizing constrastive divergence. In Neural Com- putation, 14(8): 1711-1800. Geoffrey Hinton, Simon Osindero and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. In Neural Computation, 18(7): 1527-1554. Geoffrey Hinton and R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. In Science, 313(5786), 504-507. Richard Johansson and Pierre Nugues. 2008. Dependency-based semantic role labeling of Prop- Bank. In Proceedings of EMNLP-2008. Terry Koo, Xavier Carreras and Michael Collins. 2008. Simple Semi-supervised Dependency Parsing. In Pro- ceedings of ACL-HLT-2008. Lluı́s Màrquez, Xavier Carreras, Kenneth C.Litkowski and Suzanne Stevenson. 2008. Semantic Role Label- ing: An Introduction to the Special Issue. In Compu- tational Linguistics, 34(2):145-159. Ryan McDonald, Fernando Pereira, Jan Hajˇc, and Kiril Ribarov. 2005. Non-projective dependency parsing using spanning tree algortihms. In Proceedings of NAACL-HLT-2005. J. Nivre, J. Hall, S. Kübler, R. Mcdonald, J. Nilsson, S. Riedel, and D. Yuret. 2007. The CoNLL 2007 Shared Task on Dependency Parsing. In Proceedings of CoNLL-2007. J. Nivre, J. Hall, J. Nilsson, G. Eryiǧit and S. Marinov. 2006. Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines. In Proceedings of CoNLL-2006. J. Nivre, and J. Nilsson. 2005. Pseudo-projective depen- dency parsing. In Proceedings of ACL-2005. Rajat Raina, Anand Madhavan, and Andrew Y. Ng. 2009. Large-scale Deep Unsupervised Learning us- ing Graphics Processors. In Proceedings of the 26th 281 Annual International Conference on Machine Learn- ing(ICML), pages 152-164. Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluı́s Màrquez and Joakim Nivre. 2008. The CoNLL- 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies. In Proceedings of CoNLL- 2008. Ivan Titov. 2011. Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Represen- tation. In Proceedings of ACL-2011. Joseph Turian, Lev Ratinov and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of ACL- 2010. J. Weston, F. Rattle, and R. Collobert. 2008. Deep Learn- ing via Semi-Supervised Embedding. In Proceed- ings of International Conference on Machine Learn- ing(ICML). Nianwen Xue and Martha Palmer. 2004. Calibrating fea- tures for semantic role labeling. In Proceedings of EMNLP-2004. Haitong Yang and Chengqing Zong. 2014. Multi- Predicate Semantic Role Labeling. In Proceedings of EMNLP-2014. Hai Zhao, Wenliang Chen, Chunyu Kit, Guodong Zhou. 2009. Multilingual Dependency Learning: Exploiting Rich Features for Tagging Syntactic and Semantic De- pendencies. In Proceedings of CoNLL-2009 shared task. Hai Zhao and Chunyu Kit. 2008. Parsing Syntactic and Semantic Dependencies with Two Single-Stage Max- imum Entropy Models. In Proceedings of CoNLL- 2008. Tao Zhuang and Chengqing Zong. 2010a. A Minimum Error Weighting Combination Strategy for Chinese Se- mantic Role Labeling. In Proceedings of COLING 2010. Tao Zhuang and Chengqing Zong. 2010b. Joint Inference for Bilingual Semantic Role Labeling. In Proceedings of EMNLP 2010. 282