Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff Matthias Sperber1, Mirjam Simantzik2, Graham Neubig3, Satoshi Nakamura3, Alex Waibel1 1Karlsruhe Institute of Technology, Institute for Anthropomatics, Germany 2Mobile Technologies GmbH, Germany 3Nara Institute of Science and Technology, AHC Laboratory, Japan matthias.sperber@kit.edu, mirjam.simantzik@jibbigo.com, neubig@is.naist.jp s-nakamura@is.naist.jp, waibel@kit.edu Abstract In this paper, we study the problem of manu- ally correcting automatic annotations of natu- ral language in as efficient a manner as pos- sible. We introduce a method for automati- cally segmenting a corpus into chunks such that many uncertain labels are grouped into the same chunk, while human supervision can be omitted altogether for other segments. A tradeoff must be found for segment sizes. Choosing short segments allows us to reduce the number of highly confident labels that are supervised by the annotator, which is useful because these labels are often already correct and supervising correct labels is a waste of effort. In contrast, long segments reduce the cognitive effort due to context switches. Our method helps find the segmentation that opti- mizes supervision efficiency by defining user models to predict the cost and utility of su- pervising each segment and solving a con- strained optimization problem balancing these contradictory objectives. A user study demon- strates noticeable gains over pre-segmented, confidence-ordered baselines on two natural language processing tasks: speech transcrip- tion and word segmentation. 1 Introduction Many natural language processing (NLP) tasks re- quire human supervision to be useful in practice, be it to collect suitable training material or to meet some desired output quality. Given the high cost of human intervention, how to minimize the supervi- sion effort is an important research problem. Previ- ous works in areas such as active learning, post edit- (a) It was a bright cold (they) in (apron), and (a) clocks were striking thirteen. (b) It was a bright cold (they) in (apron), and (a) clocks were striking thirteen. (c) It was a bright cold (they) in (apron), and (a) clocks were striking thirteen. Figure 1: Three automatic transcripts of the sentence “It was a bright cold day in April, and the clocks were strik- ing thirteen”, with recognition errors in parentheses. The underlined parts are to be corrected by a human for (a) sentences, (b) words, or (c) the proposed segmentation. ing, and interactive pattern recognition have inves- tigated this question with notable success (Settles, 2008; Specia, 2011; González-Rubio et al., 2010). The most common framework for efficient anno- tation in the NLP context consists of training an NLP system on a small amount of baseline data, and then running the system on unannotated data to estimate confidence scores of the system’s predictions (Set- tles, 2008). Sentences with the lowest confidence are then used as the data to be annotated (Figure 1 (a)). However, it has been noted that when the NLP system in question already has relatively high accu- racy, annotating entire sentences can be wasteful, as most words will already be correct (Tomanek and Hahn, 2009; Neubig et al., 2011). In these cases, it is possible to achieve much higher benefit per anno- tated word by annotating sub-sentential units (Fig- ure 1 (b)). However, as Settles et al. (2008) point out, sim- ply maximizing the benefit per annotated instance is not enough, as the real supervision effort varies 169 Transactions of the Association for Computational Linguistics, 2 (2014) 169–180. Action Editor: Eric Fosler-Lussier. Submitted 11/2013; Revised 2/2014; Published 4/2014. c©2014 Association for Computational Linguistics. 1 3 5 7 9 11 13 15 17 19 0 2 4 6 Segment length A vg . t im e / i ns ta nc e [s ec ] Transcription task Word segmentation task Figure 2: Average annotation time per instance, plotted over different segment lengths. For both tasks, the effort clearly increases for short segments. greatly across instances. This is particularly impor- tant in the context of choosing segments to annotate, as human annotators heavily rely on semantics and context information to process language, and intu- itively, a consecutive sequence of words can be su- pervised faster and more accurately than the same number of words spread out over several locations in a text. This intuition can also be seen in our empiri- cal data in Figure 2, which shows that for the speech transcription and word segmentation tasks described later in Section 5, short segments had a longer anno- tation time per word. Based on this fact, we argue it would be desirable to present the annotator with a segmentation of the data into easily supervisable chunks that are both large enough to reduce the num- ber of context switches, and small enough to prevent unnecessary annotation (Figure 1 (c)). In this paper, we introduce a new strategy for nat- ural language supervision tasks that attempts to op- timize supervision efficiency by choosing an appro- priate segmentation. It relies on a user model that, given a specific segment, predicts the cost and the utility of supervising that segment. Given this user model, the goal is to find a segmentation that mini- mizes the total predicted cost while maximizing the utility. We balance these two criteria by defining a constrained optimization problem in which one cri- terion is the optimization objective, while the other criterion is used as a constraint. Doing so allows specifying practical optimization goals such as “re- move as many errors as possible given a limited time budget,” or “annotate data to obtain some required classifier accuracy in as little time as possible.” Solving this optimization task is computationally difficult, an NP-hard problem. Nevertheless, we demonstrate that by making realistic assumptions about the segment length, an optimal solution can be found using an integer linear programming for- mulation for mid-sized corpora, as are common for supervised annotation tasks. For larger corpora, we provide simple heuristics to obtain an approximate solution in a reasonable amount of time. Experiments over two example scenarios demon- strate the usefulness of our method: Post editing for speech transcription, and active learning for Japanese word segmentation. Our model predicts noticeable efficiency gains, which are confirmed in experiments with human annotators. 2 Problem Definition The goal of our method is to find a segmentation over a corpus of word tokens wN1 that optimizes supervision efficiency according to some predictive user model. The user model is denoted as a set of functions ul,k(wba) that evaluate any possible sub- sequence wba of tokens in the corpus according to criteria l2L, and supervision modes k2K. Let us illustrate this with an example. Sperber et al. (2013) defined a framework for speech transcrip- tion in which an initial, erroneous transcript is cre- ated using automatic speech recognition (ASR), and an annotator corrects the transcript either by correct- ing the words by keyboard, by respeaking the con- tent, or by leaving the words as is. In this case, we could define K={TYPE, RESPEAK, SKIP}, each constant representing one of these three supervision modes. Our method will automatically determine the appropriate supervision mode for each segment. The user model in this example might evaluate ev- ery segment according to two criteria L, a cost crite- rion (in terms of supervision time) and a utility cri- terion (in terms of number of removed errors), when using each mode. Intuitively, respeaking should be assigned both lower cost (because speaking is faster than typing), but also lower utility than typing on a keyboard (because respeaking recognition errors can occur). The SKIP mode denotes the special, unsuper- vised mode that always returns 0 cost and 0 utility. Other possible supervision modes include mul- tiple input modalities (Suhm et al., 2001), several human annotators with different expertise and cost 170 (Donmez and Carbonell, 2008), and correction vs. translation from scratch in machine translation (Spe- cia, 2011). Similarly, cost could instead be ex- pressed in monetary terms, or the utility function could predict the improvement of a classifier when the resulting annotation is not intended for direct hu- man consumption, but as training data for a classifier in an active learning framework. 3 Optimization Framework Given this setting, we are interested in simulta- neously finding optimal locations and supervision modes for all segments, according to the given cri- teria. Each resulting segment will be assigned ex- actly one of these supervision modes. We de- note a segmentation of the N tokens of corpus wN1 into MN segments by specifying segment bound- ary markers sM +11 =(s1=1, s2, . . . , sM +1=N +1). Setting a boundary marker si=a means that we put a segment boundary before the a-th word to- ken (or the end-of-corpus marker for a=N +1). Thus our corpus is segmented into token sequences [(wsj , . . . , wsj+1�1)] M j=1. The supervision modes assigned to each segment are denoted by mj . We favor those segmentations that minimize the cumu- lative value PM j=1[ul,mj (w sj+1 sj )] for each criterion l. For any criterion where larger values are intuitively better, we flip the sign before defining ul,mj (w sj+1 sj ) to maintain consistency (e.g. negative number of er- rors removed). 3.1 Multiple Criteria Optimization In the case of a single criterion (|L|=1), we obtain a simple, single-objective unconstrained linear opti- mization problem, efficiently solvable via dynamic programming (Terzi and Tsaparas, 2006). However, in practice one usually encounters several compet- ing criteria, such as cost and utility, and here we will focus on this more realistic setting. We balance competing criteria by using one as an optimization objective, and the others as constraints.1 Let crite- 1This approach is known as the bounded objective function method in multi-objective optimization literature (Marler and Arora, 2004). The very popular weighted sum method merges criteria into a single efficiency measure, but is problematic in our case because the number of supervised tokens is unspec- ified. Unless the weights are carefully chosen, the algorithm might find, e.g., the completely unsupervised or completely su- (at)% (what’s)% a% bright% …% [RESPEAK:1.5/2]/ [SKIP:0/0]/ 1/ cold%2/ 3/ 4/ 5/ 6/ [TYPE:2/5]/ [TYPE:1/4]/ [TYPE:1/4]/ [RESPEAK:0/3]/[SKIP:0/0]/ Figure 3: Excerpt of a segmentation graph for an ex- ample transcription task similar to Figure 1 (some edges are omitted for readability). Edges are labeled with their mode, predicted number of errors that can be removed, and necessary supervision time. A segmentation scheme might prefer solid edges over dashed ones in this exam- ple. rion l0 be the optimization objective criterion, and let Cl denote the constraining constants for the cri- teria l 2 L�l0 = L \ {l0}. We state the optimization problem: min M ;sM+11 ;m M 1 MX j=1 ⇥ ul0,mj � w sj+1 sj �⇤ s.t. MX j=1 ⇥ ul,mj � w sj+1 sj �⇤  Cl (8l 2 L�l0 ) This constrained optimization problem is difficult to solve. In fact, the NP-hard multiple-choice knap- sack problem (Pisinger, 1994) corresponds to a spe- cial case of our problem in which the number of seg- ments is equal to the number of tokens, implying that our more general problem is NP-hard as well. In order to overcome this problem, we refor- mulate search for the optimal segmentation as a resource-constrained shortest path problem in a di- rected, acyclic multigraph. While still not efficiently solvable in theory, this problem is well studied in domains such as vehicle routing and crew schedul- ing (Irnich and Desaulniers, 2005), and it is known that in many practical situations the problem can be solved reasonably efficiently using integer linear programming relaxations (Toth and Vigo, 2001). In our formalism, the set of nodes V represents the spaces between neighboring tokens, at which the algorithm may insert segment boundaries. A node with index i represents a segment break before the i-th token, and thus the sequence of the indices in a path directly corresponds to sM +11 . Edges E de- note the grouping of tokens between the respective pervised segmentation to be most “efficient.” 171 nodes into one segment. Edges are always directed from left to right, and labeled with a supervision mode. In addition, each edge between nodes i and j is assigned ul,k(w j�1 i ), the corresponding predicted value for each criterion l 2 L and supervision mode k 2 K, indicating that the supervision mode of the j-th segment in a path directly corresponds to mj . Figure 3 shows an example of what the result- ing graph may look like. Our original optimization problem is now equivalent to finding the shortest path between the first and last nodes according to criterion l0, while obeying the given resource con- straints. According to a widely used formulation for the resource constrained shortest path problem, we can define Eij as the set of competing edges between i and j, and express this optimization problem with the following integer linear program (ILP): min x X i,j2V X k2Eij xijkul0,k(s j�1 i ) (1) s.t. X i,j2V X k2Eij xijkul,k(s j�1 i )  Cl (8l 2 L�l0 ) (2) X i2V k2Eij xijk = X i2V k2Eij xjik (8j 2 V \{1, n}) (3) X j2V k2E1j x1jk = 1 (4) X i2V k2Ein xink = 1 (5) xijk 2 {0, 1} (8xijk 2 x) (6) The variables x={xijk|i, j 2 V , k 2 Eij } denote the activation of the k’th edge between nodes i and j. The shortest path according to the minimization objective (1), that still meets the resource constraints for the specified criteria (2), is to be computed. The degree constraints (3,4,5) specify that all but the first and last nodes must have as many incoming as out- going edges, while the first node must have exactly one outgoing, and the last node exactly one incom- ing edge. Finally, the integrality condition (6) forces all edges to be either fully activated or fully deacti- vated. The outlined problem formulation can solved directly by using off-the-shelf ILP solvers, here we employ GUROBI (Gurobi Optimization, 2012). 3.2 Heuristics for Approximation In general, edges are inserted for every supervision mode between every combination of two nodes. The search space can be constrained by removing some of these edges to increase efficiency. In this study, we only consider edges spanning at most 20 tokens. For cases in which larger corpora are to be anno- tated, or when the acceptable delay for delivering re- sults is small, a suitable segmentation can be found approximately. The easiest way would be to parti- tion the corpus, e.g. according to its individual doc- uments, divide the budget constraints evenly across all partitions, and then segment each partition inde- pendently. More sophisticated methods might ap- proximate the Pareto front for each partition, and distribute the budgets in an intelligent way. 4 User Modeling While the proposed framework is able to optimize the segmentation with respect to each criterion, it also rests upon the assumption that we can provide user models ul,k(w j�1 i ) that accurately evaluate ev- ery segment according to the specified criteria and supervision modes. In this section, we discuss our strategies for estimating three conceivable criteria: annotation cost, correction of errors, and improve- ment of a classifier. 4.1 Annotation Cost Modeling Modeling cost requires solving a regression prob- lem from features of a candidate segment to annota- tion cost, for example in terms of supervision time. Appropriate input features depend on the task, but should include notions of complexity (e.g. a confi- dence measure) and length of the segment, as both are expected to strongly influence supervision time. We propose using Gaussian process (GP) regres- sion for cost prediction, a start-of-the-art nonpara- metric Bayesian regression technique (Rasmussen and Williams, 2006)2. As reported on a similar task by Cohn and Specia (2013), and confirmed by our preliminary experiments, GP regression signifi- cantly outperforms popular techniques such as sup- 2Code available at http://www.gaussianprocess.org/gpml/ 172 port vector regression and least-squares linear re- gression. We also follow their settings for GP, em- ploying GP regression with a squared exponential kernel with automatic relevance determination. De- pending on the number of users and amount of train- ing data available for each user, models may be trained separately for each user (as we do here), or in a combined fashion via multi-task learning as pro- posed by Cohn and Specia (2013). It is also crucial for the predictions to be reliable throughout the whole relevant space of segments. If the cost of certain types of segments is system- atically underpredicted, the segmentation algorithm might be misled to prefer these, possibly a large number of times.3 An effective trick to prevent such underpredictions is to predict the log time instead of the actual time. In this way, errors in the critical low end are penalized more strongly, and the time can never become negative. 4.2 Error Correction Modeling As one utility measure, we can use the number of errors corrected, a useful measure for post editing tasks over automatically produced annotations. In order to measure how many errors can be removed by supervising a particular segment, we must es- timate both how many errors are in the automatic annotation, and how reliably a human can remove these for a given supervision mode. Most machine learning techniques can estimate confidence scores in the form of posterior probabil- ities. To estimate the number of errors, we can sum over one minus the posterior for all tokens, which estimates the Hamming distance from the reference annotation. This measure is appropriate for tasks in which the number of tokens is fixed in advance (e.g. a part-of-speech estimation task), and a reasonable approximation for tasks in which the number of to- kens is not known in advance (e.g. speech transcrip- tion, cf. Section 5.1.1). Predicting the particular tokens at which a human will make a mistake is known to be a difficult task (Olson and Olson, 1990), but a simplifying constant 3For instance, consider a model that predicts well for seg- ments of medium size or longer, but underpredicts the supervi- sion time of single-token segments. This may lead the segmen- tation algorithm to put every token into its own segment, which is clearly undesirable. human error rate can still be useful. For example, in the task from Section 2, we may suspect a certain number of errors in a transcript segment, and predict, say, 95% of those errors to be removed via typing, but only 85% via respeaking. 4.3 Classifier Improvement Modeling Another reasonable utility measure is accuracy of a classifier trained on the data we choose to annotate in an active learning framework. Confidence scores have been found useful for ranking particular tokens with regards to how much they will improve a clas- sifier (Settles, 2008). Here, we may similarly score segment utility as the sum of its token confidences, although care must be taken to normalize and cali- brate the token confidences to be linearly compara- ble before doing so. While the resulting utility score has no interpretation in absolute terms, it can still be used as an optimization objective (cf. Section 5.2.1). 5 Experiments In this section, we present experimental results ex- amining the effectiveness of the proposed method over two tasks: speech transcription and Japanese word segmentation.4 5.1 Speech Transcription Experiments Accurate speech transcripts are a much-demanded NLP product, useful by themselves, as training ma- terial for ASR, or as input for follow-up tasks like speech translation. With recognition accuracies plateauing, manually correcting (post editing) auto- matic speech transcripts has become popular. Com- mon approaches are to identify words (Sanchez- Cortina et al., 2012) or (sub-)sentences (Sperber et al., 2013) of low confidence, and have a human edi- tor correct these. 5.1.1 Experimental Setup We conduct a user study in which participants post-edited speech transcripts, given a fixed goal word error rate. The transcription setup was such that the transcriber could see the ASR transcript of parts before and after the segment that he was edit- ing, providing context if needed. When imprecise time alignment resulted in segment breaks that were 4Software and experimental data can be downloaded from http://www.msperber.com/research/tacl-segmentation/ 173 slightly “off,” as happened occasionally, that context helped guess what was said. The segment itself was transcribed from scratch, as opposed to editing the ASR transcript; besides being arguably more effi- cient when the ASR transcript contains many mis- takes (Nanjo et al., 2006; Akita et al., 2009), prelim- inary experiments also showed that supervision time is far easier to predict this way. Figure 4 illustrates what the setup looked like. We used a self-developed transcription tool to conduct experiments. It presents our computed seg- ments one by one, allows convenient input and play- back via keyboard shortcuts, and logs user interac- tions with their time stamps. A selection of TED talks5 (English talks on technology, entertainment, and design) served as experimental data. While some of these talks contain jargon such as medi- cal terms, they are presented by skilled speakers, making them comparably easy to understand. Initial transcripts were created using the Janus recognition toolkit (Soltau et al., 2001) with a standard, TED- optimized setup. We used confusion networks for decoding and obtaining confidence scores. For reasons of simplicity, and better compara- bility to our baseline, we restricted our experiment to two supervision modes: TYPE and SKIP. We conducted experiments with 3 participants, 1 with several years of experience in transcription, 2 with none. Each participant received an explanation on the transcription guidelines, and a short hands-on training to learn to use our tool. Next, they tran- scribed a balanced selection of 200 segments of varying length and quality in random order. This data was used to train the user models. Finally, each participant transcribed another 2 TED talks, with word error rate (WER) 19.96% (predicted: 22.33%). We set a target (predicted) WER of 15% as our optimization constraint,6 and minimize the predicted supervision time as our ob- jective function. Both TED talks were transcribed once using the baseline strategy, and once using the proposed strategy. The order of both strategies was reversed between talks, to minimize learning bias due to transcribing each talk twice. The baseline strategy was adopted according to 5www.ted.com 6Depending on the level of accuracy required by our final application, this target may be set lower or higher. Sperber et al. (2013): We segmented the talk into natural, subsentential units, using Matusov et al. (2006)’s segmenter, which we tuned to reproduce the TED subtitle segmentation, producing a mean segment length of 8.6 words. Segments were added in order of increasing average word confidence, until the user model predicted a WER<15%. The second segmentation strategy was the proposed method, similarly with a resource constraint of WER<15%. Supervision time was predicted via GP regres- sion (cf. Section 4.1), using segment length, au- dio duration, and mean confidence as input features. The output variable was assumed subject to addi- tive Gaussian noise with zero mean, a variance of 5 seconds was chosen empirically to minimize the mean squared error. Utility prediction (cf. Section 4.2) was based on posterior scores obtained from the confusion networks. We found it important to calibrate them, as the posteriors were overconfident especially in the upper range. To do so, we automat- ically transcribed a development set of TED data, grouped the recognized words into buckets accord- ing to their posteriors, and determined the average number of errors per word in each bucket from an alignment with the reference transcript. The map- ping from average posterior to average number of errors was estimated via GP regression. The result was summed over all tokens, and multiplied by a constant human confidence, separately determined for each participant.7 5.1.2 Simulation Results To convey a better understanding of the poten- tial gains afforded by our method, we first present a simulated experiment. We assume a transcriber who makes no mistakes, and needs exactly the amount of time predicted by a user model trained on the data of a randomly selected participant. We compare three scenarios: A baseline simulation, in which the base- line segments are transcribed in ascending order of confidence; a simulation using the proposed method, in which we change the WER constraint in small in- crements; finally, an oracle simulation, which uses 7More elaborate methods for WER estimation exist, such as by Ogawa et al. (2013), but if our method achieves improve- ments using simple Hamming distance, incorporating more so- phisticated measures will likely achieve similar, or even better accuracy. 174 (3) SKIP: “nineteen forty six until today you see the green” (4) TYPE: (5) SKIP: “Interstate conflict” (6) TYPE: (7) SKIP: . . . Figure 4: Result of our segmentation method (excerpt). TYPE segments are displayed empty and should be tran- scribed from scratch. For SKIP segments, the ASR tran- script is displayed to provide context. When annotating a segment, the corresponding audio is played back. 0 10 20 30 40 50 60 0 5 10 15 20 25 Post editing time [min] R es ul tin g W E R [% ] Baseline Proposed Oracle Figure 5: Simulation of post editing on example TED talk. The proposed method reduces the WER consider- ably faster than the baseline at first, later both converge. The much superior oracle simulation indicates room for further improvement. the proposed method, but uses a utility model that knows the actual number of errors in each segment. For each supervised segment, we simply replace the ASR output with the reference, and measure the re- sulting WER. Figure 5 shows the simulation on an example TED talk, based on an initial transcript with 21.9% WER. The proposed method is able to reduce the WER faster than the baseline, up to a certain point where they converge. The oracle simulation is even faster, indicating room for improvement through better confidence scores. 5.1.3 User Study Results Table 1 shows the results of the user study. First, we note that the WER estimation by our utility model was off by about 2.5%: While the predicted improvement in WER was from 22.33% to 15.0%, the actual improvement was from 19.96% to about 12.5%. The actual resulting WER was consistent Participant Baseline Proposed WER Time WER Time P1 12.26 44:05 12.18 33:01 P2 12.75 36:19 12.77 29:54 P3 12.70 52:42 12.50 37:57 AVG 12.57 44:22 12.48 33:37 Table 1: Transcription task results. For each user, the resulting WER [%] after supervision is shown, along with the time [min] they needed. The unsupervised WER was 19.96%. across all users, and we observe strong, consistent reductions in supervision time for all participants. Prediction of the necessary supervision time was ac- curate: Averaged over participants, 45:41 minutes were predicted for the baseline, 44:22 minutes mea- sured. For the proposed method, 32:11 minutes were predicted, 33:37 minutes measured. On average, participants removed 6.68 errors per minute using the baseline, and 8.93 errors per minute using the proposed method, a speed-up of 25.2%. Note that predicted and measured values are not strictly comparable: In the experiments, to provide a fair comparison participants transcribed the same talks twice (once using baseline, once the proposed method, in alternating order), resulting in a notice- able learning effect. The user model, on the other hand, is trained to predict the case in which a tran- scriber conducts only one transcription pass. As an interesting finding, without being informed about the order of baseline and proposed method, participants reported that transcribing according to the proposed segmentation seemed harder, as they found the baseline segmentation more linguistically reasonable. However, this perceived increase in dif- ficulty did not show in efficiency numbers. 5.2 Japanese Word Segmentation Experiments Word segmentation is the first step in NLP for lan- guages that are commonly written without word boundaries, such as Japanese and Chinese. We ap- ply our method to a task in which we domain-adapt a word segmentation classifier via active learning. In this experiment, participants annotated whether or not a word boundary occurred at certain positions in a Japanese sentence. The tokens to be grouped into segments are positions between adjacent characters. 175 5.2.1 Experimental Setup Neubig et al. (2011) have proposed a pointwise method for Japanese word segmentation that can be trained using partially annotated sentences, which makes it attractive in combination with active learn- ing, as well as our segmentation method. The authors released their method as a software pack- age “KyTea” that we employed in this user study. We used KyTea’s active learning domain adaptation toolkit8 as a baseline. For data, we used the Balanced Corpus of Con- temporary Written Japanese (BCCWJ), created by Maekawa (2008), with the internet Q&A subcor- pus as in-domain data, and the whitepaper subcor- pus as background data, a domain adaptation sce- nario. Sentences were drawn from the in-domain corpus, and the manually annotated data was then used to train KyTea, along with the pre-annotated background data. The goal (objective function) was to improve KyTea’s classification accuracy on an in- domain test set, given a constrained time budget of 30 minutes. There were again 2 supervision modes: ANNOTATE and SKIP. Note that this is essentially a batch active learning setup with only one iteration. We conducted experiments with one expert with several years of experience with Japanese word seg- mentation annotation, and three non-expert native speakers with no prior experience. Japanese word segmentation is not a trivial task, so we provided non-experts with training, including explanation of the segmentation standard, a supervised test with immediate feedback and explanations, and hands-on training to get used to the annotation software. Supervision time was predicted via GP regression (cf. Section 4.1), using the segment length and mean confidence as input features. As before, the output variable was assumed subject to additive Gaussian noise with zero mean and 5 seconds variance. To ob- tain training data for these models, each participant annotated about 500 example instances, drawn from the adaptation corpus, grouped into segments and balanced regarding segment length and difficulty. For utility modeling (cf. Section 4.3), we first nor- malized KyTea’s confidence scores, which are given in terms of SVM margin, using a sigmoid function (Platt, 1999). The normalization parameter was se- 8http://www.phontron.com/kytea/active.html lected so that the mean confidence on a development set corresponded to the actual classifier accuracy. We derive our measure of classifier improvement for correcting a segment by summing over one minus the calibrated confidence for each of its tokens. To analyze how well this measure describes the actual training utility, we trained KyTea using the back- ground data plus disjoint groups of 100 in-domain instances with similar probabilities and measured the achieved reduction of prediction errors. The cor- relation between each group’s mean utility and the achieved error reduction was 0.87. Note that we ig- nore the decaying returns usually observed as more data is added to the training set. Also, we did not attempt to model user errors. Employing a con- stant base error rate, as in the transcription scenario, would change segment utilities only by a constant factor, without changing the resulting segmentation. After creating the user models, we conducted the main experiment, in which each participant anno- tated data that was selected from a pool of 1000 in-domain sentences using two strategies. The first, baseline strategy was as proposed by Neubig et al. (2011). Queries are those instances with the low- est confidence scores. Each query is then extended to the left and right, until a word boundary is pre- dicted. This strategy follows similar reasoning as was the premise to this paper: To decide whether or not a position in a text corresponds to a word bound- ary, the annotator has to acquire surrounding context information. This context acquisition is relatively time consuming, so he might as well label the sur- rounding instances with little additional effort. The second strategy was our proposed, more principled approach. Queries of both methods were shuffled to minimize bias due to learning effects. Finally, we trained KyTea using the results of both methods, and compared the achieved classifier improvement and supervision times. 5.2.2 User Study Results Table 2 summarizes the results of our experi- ment. It shows that the annotations by each partic- ipant resulted in a better classifier for the proposed method than the baseline, but also took up consider- ably more time, a less clear improvement than for the transcription task. In fact, the total error for time predictions was as high as 12.5% on average, 176 Participant Baseline Proposed Time Acc. Time Acc. Expert 25:50 96.17 32:45 96.55 NonExp1 22:05 95.79 26:44 95.98 NonExp2 23:37 96.15 31:28 96.21 NonExp3 25:23 96.38 33:36 96.45 Table 2: Word segmentation task results, for our ex- pert and 3 non-expert participants. For each participant, the resulting classifier accuracy [%] after supervision is shown, along with the time [min] they needed. The unsu- pervised accuracy was 95.14%. where the baseline method tended take less time than predicted, the proposed method more time. This is in contrast to a much lower total error (within 1%) when cross-validating our user model training data. This is likely due to the fact that the data for train- ing the user model was selected in a balanced man- ner, as opposed to selecting difficult examples, as our method is prone to do. Thus, we may expect much better predictions when selecting user model training data that is more similar to the test case. Plotting classifier accuracy over annotation time draws a clearer picture. Let us first analyze the re- sults for the expert annotator. Figure 6 (E.1) shows that the proposed method resulted in consistently better results, indicating that time predictions were still effective. Note that this comparison may put the proposed method at a slight disadvantage by com- paring intermediate results despite optimizing glob- ally. For the non-experts, the improvement over the baseline is less consistent, as can be seen in Fig- ure 6 (N.1) for one representative. According to our analysis, this can be explained by two factors: (1) The non-experts’ annotation error (6.5% on av- erage) was much higher than the expert’s (2.7%), resulting in a somewhat irregular classifier learn- ing curve. (2) The variance in annotation time per segment was consistently higher for the non- experts than the expert, indicated by an average per-segment prediction error of 71% vs. 58% rela- tive to the mean actual value, respectively. Infor- mally speaking, non-experts made more mistakes, and were more strongly influenced by the difficulty of a particular segment (which was higher on av- erage with the proposed method, as indicated by a 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 Prop. Basel N.1E.1 N.2E.2 N.3E.3 N.4E.4 Annotation time [min.] C la ss if ie r A cc ur ac y . Figure 6: Classifier improvement over time, depicted for the expert (E) and a non-expert (N). The graphs show numbers based on (1) actual annotations and user mod- els as in Sections 4.1 and 4.3, (2) error-free annotations, (3) measured times replaced by predicted times, and (4) both reference annotations and replaced time predictions. lower average confidence).9 In Figures 6 (2-4) we present a simulation experi- ment in which we first pretend as if annotators made no mistakes, then as if they needed exactly as much time as predicted for each segment, and then both. This cheating experiment works in favor of the pro- posed method, especially for the non-expert. We may conclude that our segmentation approach is ef- fective for the word segmentation task, but requires more accurate time predictions. Better user models will certainly help, although for the presented sce- nario our method may be most useful for an expert annotator. 9Note that the non-expert in the figure annotated much faster than the expert, which explains the comparable classification result despite making more annotation errors. This is in contrast to the other non-experts, who were slower. 177 5.3 Computational Efficiency Since our segmentation algorithm does not guar- antee polynomial runtime, computational efficiency was a concern, but did not turn out problematic. On a consumer laptop, the solver produced seg- mentations within a few seconds for a single docu- ment containing several thousand tokens, and within hours for corpora consisting of several dozen doc- uments. Runtime increased roughly quadratically with respect to the number of segmented tokens. We feel that this is acceptable, considering that the time needed for human supervision will likely dominate the computation time, and reasonable approxima- tions can be made as noted in Section 3.2. 6 Relation to Prior Work Efficient supervision strategies have been studied across a variety of NLP-related research areas, and received increasing attention in recent years. Ex- amples include post editing for speech recogni- tion (Sanchez-Cortina et al., 2012), interactive ma- chine translation (González-Rubio et al., 2010), ac- tive learning for machine translation (Haffari et al., 2009; González-Rubio et al., 2011) and many other NLP tasks (Olsson, 2009), to name but a few studies. It has also been recognized by the active learn- ing community that correcting the most useful parts first is often not optimal in terms of efficiency, since these parts tend to be the most difficult to manually annotate (Settles et al., 2008). The authors advocate the use of a user model to predict the supervision ef- fort, and select the instances with best “bang-for-the- buck.” This prediction of supervision effort was suc- cessful, and was further refined in other NLP-related studies (Tomanek et al., 2010; Specia, 2011; Cohn and Specia, 2013). Our approach to user modeling using GP regression is inspired by the latter. Most studies on user models consider only super- vision effort, while neglecting the accuracy of hu- man annotations. The view on humans as a perfect oracle has been criticized (Donmez and Carbonell, 2008), since human errors are common and can negatively affect supervision utility. Research on human-computer-interaction has identified the mod- eling of human errors as very difficult (Olson and Olson, 1990), depending on factors such as user ex- perience, cognitive load, user interface design, and fatigue. Nevertheless, even the simple error model used in our post editing task was effective. The active learning community has addressed the problem of balancing utility and cost in some more detail. The previously reported “bang-for-the-buck” approach is a very simple, greedy approach to com- bine both into one measure. A more theoretically founded scalar optimization objective is the net ben- efit (utility minus costs) as proposed by Vijaya- narasimhan and Grauman (2009), but unfortunately is restricted to applications where both can be ex- pressed in terms of the same monetary unit. Vijaya- narasimhan et al. (2010) and Donmez and Carbonell (2008) use a more practical approach that specifies a constrained optimization problem by allowing only a limited time budget for supervision. Our approach is a generalization thereof and allows either specify- ing an upper bound on the predicted cost, or a lower bound on the predicted utility. The main novelty of our presented approach is the explicit modeling and selection of segments of various sizes, such that annotation efficiency is opti- mized according to the specified constraints. While some works (Sassano and Kurohashi, 2010; Neubig et al., 2011) have proposed using subsentential seg- ments, we are not aware of any previous work that explicitly optimizes that segmentation. 7 Conclusion We presented a method that can effectively choose a segmentation of a language corpus that optimizes supervision efficiency, considering not only the ac- tual usefulness of each segment, but also the anno- tation cost. We reported noticeable improvements over strong baselines in two user studies. Future user experiments with more participants would be desir- able to verify our observations, and allow further analysis of different factors such as annotator ex- pertise. Also, future research may improve the user modeling, which will be beneficial for our method. Acknowledgments The research leading to these results has received funding from the European Union Seventh Frame- work Programme (FP7/2007-2013) under grant agreement n 287658 Bridges Across the Language Divide (EU-BRIDGE). 178 References Yuya Akita, Masato Mimura, and Tatsuya Kawahara. 2009. Automatic Transcription System for Meetings of the Japanese National Congress. In Interspeech, pages 84–87, Brighton, UK. Trevor Cohn and Lucia Specia. 2013. Modelling Anno- tator Bias with Multi-task Gaussian Processes: An Ap- plication to Machine Translation Quality Estimation. In Association for Computational Linguistics Confer- ence (ACL), Sofia, Bulgaria. Pinar Donmez and Jaime Carbonell. 2008. Proactive Learning : Cost-Sensitive Active Learning with Mul- tiple Imperfect Oracles. In Conference on Information and Knowledge Management (CIKM), pages 619–628, Napa Valley, CA, USA. Jesús González-Rubio, Daniel Ortiz-Martı́nez, and Fran- cisco Casacuberta. 2010. Balancing User Effort and Translation Error in Interactive Machine Translation Via Confidence Measures. In Association for Compu- tational Linguistics Conference (ACL), Short Papers Track, pages 173–177, Uppsala, Sweden. Jesús González-Rubio, Daniel Ortiz-Martı́nez, and Fran- cisco Casacuberta. 2011. An active learning scenario for interactive machine translation. In International Conference on Multimodal Interfaces (ICMI), pages 197–200, Alicante, Spain. Gurobi Optimization. 2012. Gurobi Optimizer Refer- ence Manual. Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. 2009. Active Learning for Statistical Phrase-based Machine Translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies Conference (NAACL- HLT), pages 415–423, Boulder, CO, USA. Stefan Irnich and Guy Desaulniers. 2005. Shortest Path Problems with Resource Constraints. In Column Gen- eration, pages 33–65. Springer US. Kikuo Maekawa. 2008. Balanced Corpus of Contem- porary Written Japanese. In International Joint Con- ference on Natural Language Processing (IJCNLP), pages 101–102, Hyderabad, India. R. Timothy Marler and Jasbir S. Arora. 2004. Survey of multi-objective optimization methods for engineer- ing. Structural and Multidisciplinary Optimization, 26(6):369–395, April. Evgeny Matusov, Arne Mauser, and Hermann Ney. 2006. Automatic Sentence Segmentation and Punctuation Prediction for Spoken Language Translation. In Inter- national Workshop on Spoken Language Translation (IWSLT), pages 158–165, Kyoto, Japan. Hiroaki Nanjo, Yuya Akita, and Tatsuya Kawahara. 2006. Computer Assisted Speech Transcription Sys- tem for Efficient Speech Archive. In Western Pacific Acoustics Conference (WESPAC), Seoul, Korea. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise Prediction for Robust , Adapt- able Japanese Morphological Analysis. In Associa- tion for Computational Linguistics: Human Language Technologies Conference (ACL-HLT), pages 529–533, Portland, OR, USA. Atsunori Ogawa, Takaaki Hori, and Atsushi Naka- mura. 2013. Discriminative Recognition Rate Esti- mation For N-Best List and Its Application To N-Best Rescoring. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6832– 6836, Vancouver, Canada. Judith Reitman Olson and Gary Olson. 1990. The Growth of Cognitive Modeling in Human-Computer Interaction Since GOMS. Human-Computer Interac- tion, 5(2):221–265, June. Fredrik Olsson. 2009. A literature survey of active ma- chine learning in the context of natural language pro- cessing. Technical report, SICS Sweden. David Pisinger. 1994. A Minimal Algorithm for the Multiple-Choice Knapsack Problem. European Jour- nal of Operational Research, 83(2):394–410. John C. Platt. 1999. Probabilistic Outputs for Sup- port Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers, pages 61–74. MIT Press. Carl E. Rasmussen and Christopher K.I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, USA. Isaias Sanchez-Cortina, Nicolas Serrano, Alberto San- chis, and Alfons Juan. 2012. A prototype for Inter- active Speech Transcription Balancing Error and Su- pervision Effort. In International Conference on Intel- ligent User Interfaces (IUI), pages 325–326, Lisbon, Portugal. Manabu Sassano and Sadao Kurohashi. 2010. Using Smaller Constituents Rather Than Sentences in Ac- tive Learning for Japanese Dependency Parsing. In Association for Computational Linguistics Conference (ACL), pages 356–365, Uppsala, Sweden. Burr Settles, Mark Craven, and Lewis Friedland. 2008. Active Learning with Real Annotation Costs. In Neural Information Processing Systems Conference (NIPS) - Workshop on Cost-Sensitive Learning, Lake Tahoe, NV, United States. Burr Settles. 2008. An Analysis of Active Learning Strategies for Sequence Labeling Tasks. In Confer- ence on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 1070–1079, Honolulu, USA. Hagen Soltau, Florian Metze, Christian Fügen, and Alex Waibel. 2001. A One-Pass Decoder Based on Poly- morphic Linguistic Context Assignment. In Auto- matic Speech Recognition and Understanding Work- 179 shop (ASRU), pages 214–217, Madonna di Campiglio, Italy. Lucia Specia. 2011. Exploiting Objective Annota- tions for Measuring Translation Post-editing Effort. In Conference of the European Association for Machine Translation (EAMT), pages 73–80, Nice, France. Matthias Sperber, Graham Neubig, Christian Fügen, Satoshi Nakamura, and Alex Waibel. 2013. Efficient Speech Transcription Through Respeaking. In Inter- speech, pages 1087–1091, Lyon, France. Bernhard Suhm, Brad Myers, and Alex Waibel. 2001. Multimodal error correction for speech user inter- faces. Transactions on Computer-Human Interaction, 8(1):60–98. Evimaria Terzi and Panayiotis Tsaparas. 2006. Efficient algorithms for sequence segmentation. In SIAM Con- ference on Data Mining (SDM), Bethesda, MD, USA. Katrin Tomanek and Udo Hahn. 2009. Semi-Supervised Active Learning for Sequence Labeling. In Interna- tional Joint Conference on Natural Language Process- ing (IJCNLP), pages 1039–1047, Singapore. Katrin Tomanek, Udo Hahn, and Steffen Lohmann. 2010. A Cognitive Cost Model of Annotations Based on Eye-Tracking Data. In Association for Compu- tational Linguistics Conference (ACL), pages 1158– 1167, Uppsala, Sweden. Paolo Toth and Daniele Vigo. 2001. The Vehicle Routing Problem. Society for Industrial & Applied Mathemat- ics (SIAM), Philadelphia. Sudheendra Vijayanarasimhan and Kristen Grauman. 2009. Whats It Going to Cost You?: Predicting Ef- fort vs. Informativeness for Multi-Label Image Anno- tations. In Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 2262–2269, Miami Beach, FL, USA. Sudheendra Vijayanarasimhan, Prateek Jain, and Kristen Grauman. 2010. Far-sighted active learning on a bud- get for image and video recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3035–3042, San Francisco, CA, USA, June. 180