Staff-line removal with Selectional Auto-Encoders Antonio-Javier Gallego, Jorge Calvo-Zaragoza∗ Department of Software and Computing Systems University of Alicante, Carretera San Vicente del Raspeig s/n, 03690 Alicante, Spain Abstract Staff-line removal is an important preprocessing stage as regards most Optical Music Recognition systems. The common procedures employed to carry out this task involve image processing techniques. In contrast to these traditional methods, which are based on hand-engineered transformations, the problem can also be approached from a machine learning point of view if representative ex- amples of the task are provided. We propose doing this through the use of a new approach involving auto-encoders, which select the appropriate features of an input feature set (Selectional Auto-Encoders). Within the context of the prob- lem at hand, the model is trained to select those pixels of a given image that belong to a musical symbol, thus removing the lines of the staves. Our results show that the proposed technique is quite competitive and significantly out- performs the other state-of-art strategies considered, particularly when dealing with grayscale input images. Keywords: Staff-line removal, Optical Music Recognition, Auto-encoders, Convolutional Networks 1. Introduction Music is an important vehicle for cultural transmission, which is a key ele- ment as regards understanding the social, cultural and artistic trends of each ∗Corresponding author: Tel.: +349-65-903772; Fax: +349-65-909326 Email addresses: jgallego@dlsi.ua.es (Antonio-Javier Gallego), jcalvo@dlsi.ua.es (Jorge Calvo-Zaragoza) Preprint submitted to Expert Systems with Applications March 7, 2018 Usuario Texto escrito a máquina This is a previous version of the article published in Expert Systems with Applications. 2017, 89: 138-148. doi:10.1016/j.eswa.2017.07.002 http://dx.doi.org/10.1016/j.eswa.2017.07.002 period of history. A large number of musical documents have, therefore, been carefully preserved over the centuries and are scattered throughout cathedrals,5 libraries or historical archives (Fujinaga et al., 2014). A significant effort has been made in recent decades to digitize these docu- ments by means of scanners for their storage and distribution (Choudhury et al., 2000). However, in order to make the music contained in the documents truly accessible, it is necessary for the images to be transcribed to a structured digital10 format that makes it possible to encode the content (notes, musical symbols, tonality, etc.) of the document. This also makes it possible to perform other interesting tasks, such as large-scale computational music analysis, search and retrieval by content, or transcription between different musical notations (Han- kinson et al., 2012).15 The process of converting a scanned document into a musical structured digital format can be carried out manually by a user. The disadvantage is that it involves costs in terms of both resources and time. In addition, this process is especially tedious—because of the burdensome software for score edition— and very prone to introducing errors.20 The research field known as Optical Music Recognition (OMR), which fo- cuses on detecting and storing the musical content of a score from a scanned image (Raphael & Wang, 2011), has therefore been postulated as an important alternative that mitigates the aforementioned disadvantages of manual tran- scription. The objective of the OMR process is to import a scanned musical25 score and export its musical content to a machine-readable format 1), typically MusicXML or MEI. The OMR task is similar to the recognition of text, typically known as Optical Character Recognition. However, unlike the text scenario in which words are analyzed sequentially with a single element to identify at each time30 instant, musical notation is considerably more difficult to recognize. This is principally owing to the possible existence of simultaneous matching elements, as in the case of polyphonic pieces with multiple notes that sound at the same time, thus resulting in several musical symbols being placed on the same interval. But 2 (a) Example of input piece for an OMR system (b) Symbolic representation of the piece Figure 1: The task of Optical Music Recognition (OMR) is to analyze an image containing a musical score in order to export its musical content to a machine-readable format. it is also because of the presence of marks of expression, dynamics, articulations35 or even text to be sung in works with a vocal presence, among others. Most current systems employ segmentation and classification approaches (Rebelo et al., 2010; Wen et al., 2015). The first important obstacle that the OMR process must overcome is, therefore, the staff (or pentagram) lines: the set of five parallel lines on which musical symbols are located depending on40 their pitch. The staff-line removal stage is usually performed after the binariza- tion of the document in the OMR workflow (Rebelo & Cardoso, 2013). This binarization step helps to reduce the complexity of the problem, and it is ad- visable to apply strategies based on histogram analysis, connected components, or morphological operators.45 Despite being necessary for musical readability, staff lines complicate the au- tomatic detection and segmentation of symbols. Some specific works have taken advantage of specific features of printed and/or ancient notation to approach the 3 problem of maintaining the staff lines (Ramirez & Ohya, 2014; Calvo-Zaragoza et al., 2015); however, the established OMR pipeline includes their detection50 and removal (Rebelo et al., 2012). This process must remove the staff lines while maintaining as much of the symbol information as possible (Fig. 2). (a) Example of input score for an OMR system (b) Input score after staff-line removal Figure 2: Example of a perfect staff-line removal process. This paper proposes a framework with which to remove staff lines that is based on machine learning, that is, labeled examples can be used to train a model to perform the task. This allows using the same approach in a wide55 range of scores. We make use of a new approach of auto-encoder, which is trained to select only those pieces of the image that belong to musical symbols. The remainder of the paper is structured as follows: Section 2 presents the background to staff detection and removal; Section 3 describes our approach with which to model the process; Section 4 contains the experimentation per-60 formed and the results obtained; and finally, the current work concludes in Section 5. 4 2. Background This section presents the background to our approach. First, methods pro- posed for staff-line removal are introduced, after which the principles of auto-65 encoders are briefly described, given their importance as regards the present work. 2.1. Staff-line detection and removal Staff-line removal has been an active research field for many years. A com- prehensive review and comparison of the first attempts considered for this task70 can be consulted in the work of Dalitz et al. (2008), who divided the staff-line removal strategies proposed until then into four categories: the Line Track- ing (Randriamahefa et al., 1993; Bainbridge & Bell, 1997), Vector Field (Martin & Bellissant, 1991; Roach & Tatem, 1988), Runlength (Carter & Bacon, 1992; Fujinaga, 2005) and Skeleton (Ng, 2001) methods. However, given the interest75 in the task, many other methods have been proposed more recently. Dos Santos Cardoso et al. (2009) proposed a method that considers the staff lines to be connecting paths between the two margins of the score. The score is then modeled as a graph so that staff detection is solved as a maximization problem. This strategy was later improved and extended to be used on grayscale80 scores (Rebelo & Cardoso, 2013). Dutta et al. (2010) developed a method that considers the staff line segment as a horizontal connection of vertical black runs with a uniform height. These segments are validated using neighboring properties before removing them. Su et al. (2012) started by estimating the properties of the staves such as85 height and space. An attempt is then made to predict the direction of the lines and fit an approximate staff, which is subsequently adjusted. Géraud (2014) developed a method that entails a series of morphological operators: first, a permissive hit-or-miss with a horizontal line pattern, followed by a horizontal median filter and a dilation operation. A binary mask is then90 obtained with a morphological closing. Finally, a vertical median filter is applied 5 to the largest components of the mask. The procedure is directly applied to the image, which eventually removes staff lines. Notwithstanding all the efforts made, the staff-removal stage is still inaccu- rate and often produces noise—staff lines not completely removed. Although it95 is possible to use more aggressive methods that minimize this noise, they may cause the partial or total loss of some musical symbols. The trade-off between these two aspects, in addition to the accuracy of the techniques, has hitherto led to the inevitable production of errors during this stage. Moreover, the differences among score style, sheet conditions and scanning100 processes have led researchers to develop ad-hoc method for staff-line detection and removal. This results in methods that are not robust when applied to differ- ent types of document (from different eras, or with different notations or styles). In modern notation, a staff is composed by five black parallel lines over a white background. However, that is not always true in old music notation because105 (i) staff lines may appear with different ink colors, even closer to background color than to symbol color, (ii) handwritten staff lines are not totally straight, (iii) the thickness of the lines is irregular because of quill leakage, and (iv) the staff does may have less than five lines. In addition, lyrics in modern scores are always far enough from the staff, whereas there is much overlapping be-110 tween music and lyrics in old notation, and so it could also hinder the staff-line removal process. Therefore, many of the assumptions for staff-line removal in modern music are not always fulfilled in different eras, thereby being extremely difficult to develop methods that are able to work on any kind of scores with an acceptable accuracy.115 Here we introduce a new and more generalized framework based on ma- chine learning that can be applied to a wide variety of musical notation styles and musical documents. The main advantage of using machine learning lies in its ability to be generalized when compared with hand-crafted systems. In this respect, a machine learning strategy for staff-line removal has recently been pro-120 posed, which consists of training a classifier that discriminates between whether a given pixel belongs to a symbol or to a staff line (Calvo-Zaragoza et al., 2016). 6 The foreground pixels of the image are queried so that those classified as staff are removed. Nevertheless, it has two important disadvantages, the first being that this strategy does not improve the performance of other state-of-the-art125 algorithms for staff-line removal. The second is its high computational cost, which results from having to classify each pixel in the image. The aim of this work is to alleviate the aforementioned issues by using spe- cialized auto-encoders. The following section presents a brief introduction to auto-encoders and their related extensions in order to provide the reader with130 a better understanding of the proposed framework. 2.2. Auto-encoders Auto-encoders consist of feed-forward neural networks for which the input and output must be exactly the same. The network typically consists of two stages that learn the functions f and g, which are called encoder and decoder135 functions, respectively. Formally speaking, given an input x, the network must minimize the diver- gence L(x, g(f(x))). The hidden layers of the encoder perform a mapping of the input —usually decreasing its dimension— until an intermediate representation is attained. The same input is then subsequently recovered by means of the140 hidden layers of the decoder function. An auto-encoder might initially appear to be useless because it is trained to learn the identity function. Nevertheless, the encoder function f is typically forced to produce a representation with a lower dimensionality than the input. The encoder function therefore provides a meaningful compact representation145 of the input, which might be of great interest as regards feature learning or dimensionality reduction (Wang et al., 2014). The idea of auto-encoders was proposed decades ago (Hinton & Zemel, 1994), and it has since been an active research field (Deng et al., 2010; Lauly et al., 2014). As auto-encoders are feed-forward networks, they can be trained by using150 conventional optimization algorithms such as gradient descent. In some applications, it is assumed that an input x̂ is received after passing 7 through a noisy process that corrupted the actual input x. In this case, a denois- ing auto-encoder might be taught to minimize the divergence L(x, g(f(x̂))). The network, therefore, not only focuses on copying the input but also on removing155 the noise (Vincent et al., 2010; Bengio et al., 2013). In the context of staff-line removal, we could have formulated the problem by assuming that these lines are the result of noise and a denoising auto-encoder could, therefore have been taught to remove them. However, in this paper we go one step further and propose a new type of auto-encoder especially designed160 for the problem at hand. In this case, we wish neither to learn the identity function nor an underlying error but rather a codification that maintains only those input features that are relevant. That is, the output of our auto-encoder is a codification with the same dimension as the input that indicates the original features that must be maintained. Note that, in the staff-line removal formula-165 tion, the relevant features will be those pixels of the image that depict symbols, while those containing staff lines must be discarded. 3. Staff-line removal with Selectional Auto-Encoders As mentioned above, the traditional formulation for the staff-line removal task considers a binary image as input; that is, a binarization process is assumed170 to be performed before this step. The binary nature of modern musical scores (black ink on white paper) has, to some extent, justified this pipeline. However, it should be borne in mind that document binarization is not a trivial question — especially when dealing with ancient documents (Ntirogiannis et al., 2014). Furthermore, the staff-line removal stage is actually a process that attempts175 to leave only the musical symbols in the image. From this point of view, our proposal focuses on selecting the pixels of the input image that correspond to musical symbols, regardless of the nature of the input (see Fig. 3), thus leading to a more generalizable approach that can be applied to binary, grayscale or color images.180 This has been done by making use of a new type of auto-encoder. As 8 Input score image Binarization Staff-line removal Symbol-only score image (a) Traditional approach for isolating musical symbols in score images Pixel selectionInput score image Symbol-only score image (b) Proposed approach for isolating musical symbols in score images Figure 3: Comparison of traditional and proposed approaches for isolating symbols in musical score images. mentioned above, the model is trained to select which features of the input layer are relevant for the task at hand (that is, the pixels that belong to mu- sic symbols). From here on, we shall refer to this model as the Selectional Auto-Encoder (SAE). The SAE is trained to perform a function such that185 s : R(w×h) → [0, 1](w×h). In other words, it learns a binary map over a w × h image that preserves the input shape. Following the idea of auto-encoders, however, the function is further divided into encoding and decoding stages. The topology of an SAE can be quite varied. However, we have restricted ourselves to considering convolutional models. Convolutional models have been190 applied with great success to the detection, segmentation and recognition of objects and regions in images, and have even come close to human performance in some of these tasks (LeCun et al., 2015). They can take advantage of local connections, shared weights, pooling and the use of many connected layers. The hierarchy of layers of our SAE consists of a series of convolutional plus195 pooling layers, until an intermediate layer in which meaningful representations of the input are attained. As these layers are applied, filters are able to relate parts of the image that were initially far apart. It then follows a series of convolutional plus upsampling layers that reconstruct the image up to the same input size. The last layer consists of a set of neurons with sigmoid activation200 that predict a value in the range of [0, 1], depending on the selectional level for the corresponding input feature. The scheme of our hierarchy is illustrated in Fig. 4. The specific configuration, along with a suitable size of the input/output 9 Convolutions & Poolings 5x5x96 5x5x96 5x5x96 5x5x96 5x5x96 5x5x96 Convolutions & UpSampling Figure 4: General overview of a SAE used for staff-line removal. The output layer consists of the activation level assigned to each input feature (white signifies activated) layer, has been adjusted by means of a grid search of the network configuration205 hyper-parameters, as will be detailed in the next section. As a preliminary proof-of-concept test, we also carried out some research with non-convolutional models, which proved to be less suitable for the task at hand. Since an SAE is a type of feed-forward network, the training stage is carried out by means of back-propagation, considering the cross-entropy loss function210 between each output activation and its expected activation. Let n be the number of training examples, and d be the dimensionality of the input (and output). Let us denote the activation of the jth neuron of the last layer for the example i as aij and its desired activation as yij. The loss L for the training set can therefore be computed as:215 L = − 1 nd n∑ i=1 d∑ j=1 [yij ln aij + (1 −yij) ln(1 −aij)] The learning of the network parameters is performed by means of stochastic gradient descent (Bottou, 2010) with a mini-batch size of 8 samples, considering the adaptive learning rate proposed by Zeiler (2012). This training stage consists of providing the SAE with examples of images and their corresponding ground- truth, that is, binary maps over the pixels that belong to musical symbols (see220 Fig. 4). Once the SAE has been properly trained, removing staff lines from the image of a musical score consists of querying the image patch by patch. Each patch 10 is passed through the SAE, which outputs the selection level assigned to each input pixel. Those pixels whose selection value exceeds a certain threshold are225 considered to belong to a musical symbol, whereas the others are discarded. Figure 5 illustrates this process. (a) Score cut into patches of fixed size (b) Selection values obtained (c) Score after thresholding Figure 5: Example of staff-line removal task using an SAE. The input image is parsed patch by patch by means of the network. A selection value is predicted for each pixel in a patch (shown here as grayscale levels). Finally, a thresholding is applied in order to select the pixels that will eventually be maintained. 11 4. Experiments This section presents the experiments carried out to evaluate the goodness of our proposal 1. We took advantage of the ICDAR / GREC 2013 Competition230 on Music Scores (Fornés et al., 2013) staff-line removal contest by making use of the same dataset to allow reproducible research and future comparisons with our results. This corpus contains trios consisting of a grayscale image of a score and its corresponding binarized version with and without staff lines (see Fig. 6). This235 dataset therefore provides readily-available data with which to train the model, along with testing data for evaluation. The corpora is organized into train and test sets, containing 4 000 and 2 000 samples, respectively. (a) Grayscale score (b) Binary score (c) Ground-truth Figure 6: Example of sample from the GREC/ICDAR 2013 Staff-Line Removal Competition dataset. In our case, the train set is further subdivided into training and validation partitions. These partitions are distributed in 80 % and 20 % out of the total240 set of available training scores, respectively. Training data is used to perform 1For the sake of reproducible research, the code of the experiments is available at http: //github.com/ajgallego/staff-lines-removal available under the conditions of the GNU General Public License version 3. 12 http://github.com/ajgallego/staff-lines-removal http://github.com/ajgallego/staff-lines-removal the optimization of the network weights by means of gradient descent during a maximum of 200 epochs, with a mini-batch size of 8 samples. Validation data is used to monitor the training process and prevent over-fitting. Thus, the training process is stopped if the validation performance do not increase during245 5 epochs. In order to evaluate the results, the F-measure (F-m) metric will be consid- ered, following the guidelines of this contest to select the best method. Let TP be the number of true positives (symbol pixels correctly classified), FP be the number of false positives (symbol pixels incorrectly classified) and FN be the250 number of false negatives (staff-line pixels incorrectly classified). Therefore, F-m = 2 · TP 2 · TP + FN + FP We first present the results obtained with the different topologies proposed for the SAE, with a study of the influence of certain hyper-parameters. We then compare our proposal with other methods for solving the same task that participated in the latest contest.255 We also include further results concerning related issues, such as the robust- ness of the model with regard to the threshold. Before carrying out that analysis, we should state that the results shown below are obtained by assuming the best threshold for each case: 0.3 for binary images and 0.1 for the grayscale. The amount of data required to train the model successfully and the representational260 power of the model are also analyzed. 4.1. Hyper-parameter selection This section presents the results obtained with different topologies of the SAE, namely the number of layers and input sizes. In order to reduce the search space, we have restricted ourselves to: i) considering symmetric models,265 that is, those with the same number of coding and decoding layers (from 1 to 3 for each stage); ii) considering only square images of sides 64, 128, 256, 384 and 512. 13 There are also a number of parameters to be tuned, such as the kernel size of the convolutions, the number of filters per layer and the batch size selected270 during training. We have performed comprehensive experimentation in order to tune these parameters by means of a grid search of: • The number of filters per convolutional layer: 16, 32, 64, 96 • The kernel size of each convolutional layer: 3, 5, 7, 9, 11 Since the number of configurations becomes huge, this first experiment fo-275 cus on finding the optimal hyper-parameters only for the binary format of the dataset. The idea of doing this hyper-parameter tuning only with binary images is twofold: on the one hand, it is the case that has been traditionally studied and, on the other hand, we do not want the grayscale results, the more complex scenario, depending on an excessive tuning of the model that best fits. Our280 premise is that the goodness of our work lies in the proposed approach, and not in selecting the optimum topology for every case. In addition, claiming that a topology is the optimal one entails a very exhaustive search, and so we believe that exploiting that avenue is not of actual interest in this work. We there- fore assume that the best configuration for this binary case will also achieve285 competitive results in any other domain. Table 1 shows the best result attained by each different SAE configuration on the validation set. For the sake of readability, it merely reports the best configuration as regards the number of filters and the kernel size of the con- volutional layers obtained for each combination of number of layers and input290 size. As an initial remark, we should state that our approach is relatively robust to the different number of configurations, since all of the results consist of very accurate figures. With regard to the window size, the results do not yield any magic number but the best figures appear to indicate that either smaller295 or bigger window sizes would not significantly improve the results. On the contrary, the number of layers has a higher impact on the performance, leading to a variation in the F-m attained of up to 2 units. 14 Window size # layers 64 128 256 384 512 Average 1 96.99 97.08 97.21 98.40 96.94 97.32 2 97.73 98.70 97.79 97.73 97.79 97.95 3 97.80 97.84 99.13 97.81 97.61 98.01 Table 1: F-m (%) attained by the different number of layers considered (rows) in combination with different values of input window size (columns) for binary images. The values in bold type highlight the best results in each row. The model with 3 layers and 256 × 256 input patches performed best in this experimentation. Specifically, the complete configuration that reported the300 best accuracy comprises 96 filters per layer, and kernel sizes of 5. A technical description is given in Table 2. In the following, we shall assume that this configuration is a representative of our proposal for both binary and grayscale images. 4.2. Comparison with state-of-the-art305 Taking advantage of the aforementioned contest, we tested our method against state-of-the-art staff-line removal strategies. Since the conditions and participants are very different, the binary and grayscale format of the contest are analyzed separately. The test set provided is further divided into three subsets (TS1, TS2, and310 TS3) in order to measure the robustness of the participants with regard to the deformations applied to the scores: 3D distortions in TS1 (three types of distortions with 166, 167, and 167 samples; 500 scores in total), local noise in TS2 (two types of distortions with 250 samples each; 500 scores in total), and both 3D distortion and local noise in TS3 (6 specific distortions, equally315 distributed, as a result of the combination of the previous ones; 1 000 scores in total). Table 3 shows the results obtained by the participants in the contest for the binary case. The main idea of each method was described in Sect. 2 — unlike 15 Input Encoding Decoding Output Conv(96,5,5,ReLU) Conv(96,5,5,ReLU) MaxPool(2,2) UpSamp(2,2) Conv(96,5,5,ReLU) Conv(96,5,5,ReLU) MaxPool(2,2) UpSamp(2,2) [0, 255]256×256 Conv(96,5,5,ReLU) Conv(96,5,5,ReLU) [0, 1]256×256 MaxPool(2,2) UpSamp(2,2) Conv(96,5,5,ReLU) Conv(96,5,5,ReLU) MaxPool(2,2) UpSamp(2,2) Conv(1,5,5,Sigmoid) Table 2: Detailed description of the selected SAE architecture. Conv(f,h,w,a) stands for a convolution operator of f filters, with h × w pixel kernels with an a activation function; Max- Pool(h,w) stands for the max-pooling operator with a w × h kernel and stride; UpSamp(h,w) denotes an up-sampling operator of h rows and w columns; ReLU and Sigmoid denote Rec- tifier Linear Unit and Sigmoid activations, respectively. TAU, which was a method specifically designed to participate in the contest.320 Those readers who require more detailed information about the participants are referred to the competition report (Visaniy et al., 2013), as some of the aforementioned strategies were slightly tuned for the contest. Moreover, we include the work of Calvo-Zaragoza et al. (2016) (Pixel), since their results were obtained under the same conditions of the contest. The figures of our approach325 are those obtained by the selected configuration topology, based on the results depicted in the previous section. As can be seen, most participants are able to achieve good performance fig- ures, being some of them really close to the optimum. However, our approach is able to improve on all the results obtained previously in all cases considered.330 This improvement is especially remarkable in TS3, when distortions are more aggressive. The comparison with Pixel method is also illustrative of the good- 16 TS1 TS2 TS3 Whole TAU (Visaniy et al., 2013) 85.72 81.72 82.29 83.01 NUS (Su et al., 2012) 69.85 96.25 67.43 75.24 NUASi-lin (Dalitz et al., 2008) 94.99 94.86 94.00 94.29 NUASi-skel (Dalitz et al., 2008) 94.25 93.80 92.92 93.34 LRDE (Géraud, 2014) 97.73 96.86 96.98 97.14 INESC (Dos Santos Cardoso et al., 2009) 89.29 97.72 88.52 91.01 Pixel (Calvo-Zaragoza et al., 2016) 94.10 98.11 94.00 95.04 Our approach 99.11 99.36 99.03 99.13 Table 3: F-m (%) comparison of the participants in the ICDAR / GREC 2013 staff removal contest (binary format) and our approach based on SAE. The values in bold type represent the best result in each set, whereas the underlined values represent the best result without considering our model. The column Whole refers to the weighted average obtained for the three test sets. ness of our proposal, since it demonstrates that the performance is not only achieved by using a supervised learning scheme (Pixel also does so) but because of the adequacy of the proposed model. Taking the test corpora as a whole,335 our approach is able to improve up to 2 points with respect to the best result achieved so far. As mentioned previously, the dataset provided in the contest also contains a grayscale version of the scores. Our approach can easily be extended to deal with grayscale images with no further effort but to change the training data. In340 this case, only two of the methods submitted to the contest dealt with grayscale images: LRDE and INESC. Table 4 shows the results obtained by these partic- ipants, when compared to those obtained by our SAE, including again those of Pixel method. It is clear that the participants performance decreases remarkably, particu-345 larly as regards the INESC method. LRDE maintains a fair accuracy in TS1, but its performance is much worse in TS2 and TS3. Pixel method has a more robust behavior, and achieves similar figures regardless of the distortions ap- plied to the images. With more room for improvement, our method achieves a 17 TS1 TS2 TS3 Whole LRDE (Géraud, 2014) 92.17 79.47 79.88 82.85 INESC (Dos Santos Cardoso et al., 2009) 38.50 52.11 38.87 42.09 Pixel (Calvo-Zaragoza et al., 2016) 92.56 88.84 89.76 90.24 Our approach 99.14 99.34 98.94 99.09 Table 4: F-m (%) comparison of the participants in the ICDAR / GREC 2013 staff removal contest (grayscale format) and our approach based on SAE. The values in bold type represent the best result in each set, whereas the underlined values represent the best result without considering our model. The column Whole refers to the weighted average obtained for the three test sets. performance far superior to the participants. It also undergoes a certain drop in350 accuracy with grayscale images but results still consist of very accurate figures, clearly outperforming the other methods. In this case, our approach is able to improve up to almost 10 points the state-of-the-art figure. Results reported above only reflect the average performance of the methods. In order to minimize the possibility that the differences are due to chance varia-355 tion, we perform a pairwise, non-parametric Wilcoxon signed-rank test Demsar (2006). We considered the 11 independent results (one per specific distortion applied) to perform these tests. It resulted in p-values below 0.01 in all pairwise comparisons between our approach and the other methods, in both binary and grayscale formats. Therefore, our approach proves to outperform the rest of the360 configurations with an alpha confidence level of 99%. 4.3. Analysis The objective of this section is to analyze some of the characteristics of the SAE for the task of staff-line removal. The quantitative results obtained by our method are very close to the op-365 timum, especially for binarized images. Thus, it is interesting to look into the actual impact of such room for improvement. Figure 7 shows a qualitative exam- ple of the performance obtained by the SAE for a piece of image in both binary and grayscale formats. These examples depict an F-m of 99.06 % and 98.17 18 %, respectively, therefore being good representatives of the results reported in370 the previous section. As can be observed, the differences between the outputs and the ground-truth are hardly perceptible. The mistakes might be related to pixels very close to the boundaries of symbols, which in any case does not seem to make a noticeable difference on the results. (a) Binary format (b) Grayscale format (c) Ground-truth (d) Selection for the binary format (e) Selection for the grayscale format Figure 7: Example of a staff-line removal using the proposed SAE configuration for both binary and grayscale formats, depicting an F-m of 99.06 % and 98.17 %, respectively. At this point, it would be interesting to measure the impact of the accuracy of375 this staff-line removal in a functional OMR system, with respect to the accuracy obtained by considering previous works. Note that in musical notation it is important to be as precise as possible in this process: regions of FP (staff- line segments maintained) may involve the incorrect detection of small musical 19 symbols such as the dot, or changing the meaning of some notes (quarter note380 confused with an eighth note); similarly, regions of FN (symbol information removed) may make the system mislabel a quarter note as a half note. Taking into account that music notation must fulfill strong grammatical constraints, these hypothetical isolated mistakes may cause many errors in other regions of the score. Unfortunately, there are no reproducible strategies for complete OMR385 workflow, and so carrying out this experiment would imply a study beyond the scope of this paper. However, we place this issue at future work considerations. Furthermore, having demonstrated the accuracy of the approach, it is impor- tant to stress its principal differences from conventional methods for staff-line removal. As stated above, one interesting property is that the SAE follows a su-390 pervised learning approach, so its extension as regards dealing with any type of input image is feasible. This applies not only in the case of a different nature of the image —such as RGB images— but also in that of different types of features of the score, different notational styles, heterogeneous document conditions and so on. It is true, however, that the approach cannot be immediately applied to395 any domain but our claim is that it is much easier to set up the SAE environ- ment (basically, training data and some parameter tweaking) than developing a completely new staff-line removal strategy. Pixel method is also based on supervised learning but there are two im- portant differences. The first is that our method achieves significatively better400 results, while the second, which is indeed the most relevant from a practical point of view, is that Pixel method is a computationally expensive technique, since it has to classify every single pixel in the image. In this respect, once the SAE has been trained, the computation time needed to process a score is in the order of seconds, while that of Pixel method is in the order of hours.405 In the following sections, we delve into interesting aspects of the proposed model. We first study the influence of the threshold used to convert SAE predic- tions into the binary selectional output. An incremental learning scenario is also considered to check the number of samples needed for the model to learn the task. Finally, we show the representational capabilities of the SAE by analyzing410 20 the intermediate codification of the input patch. 4.3.1. Threshold As stated in Section 3, the SAE predicts a value for each input feature, which can be understood as the level of selection. In this respect, for the results shown above it was assumed that a threshold equal to 0.3 (binary images) and415 0.1 (grayscale images) would discriminate between whether or not the feature was eventually activated. Here we report the reason why these values were chosen and the real impact that other configurations have. Figure 8 shows the F-m obtained for different threshold values — within the range of [0, 1] — considering the binary and grayscale format of the images. The420 values obtained by the best state-of-the-art algorithms have also been included for the sake of comparison. It can be observed that, although the thresholds considered are effectively those that obtain the best performances, any other threshold would have outper- formed the state-of-the-art strategies with a fair margin. These figures reinforce425 the robustness of the activations predicted by the SAE. These values tend to be close to either 0 or 1, thereby decreasing the importance of the threshold. 4.3.2. Training set size We shall now focus on assessing the impact of the amount of data used to train the model on the performance figures. To this end, an additional430 experiment was performed in which the number of training scores was iteratively increased. The results of this study are shown in Fig. 9. In the case of binary images, in which the variability is much lower, the SAE is able to perform very com- petitively with a relatively small number of scores. The model outperforms the435 state-of-the-art result with 50 scores. On the contrary, the case of grayscale images is more complex, and more samples are needed to reach stable results. Note, however, that the state-of-the-art is reached with a few number of training scores. 21 90 92 94 96 98 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 F -m Threshold SAE LRDE Pixel (a) Binary format 80 85 90 95 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 F -m Threshold SAE LRDE Pixel (b) Grayscale format Figure 8: Influence of the threshold parameter on the performance of the SAE. Performance of state-of-the-art strategies are also included for a better comparison. Furthermore, it should be emphasized that, in both cases, the curve shows440 an upward tendency when reaching the total number of training scores. It seems that a greater number of training samples could still improve the results obtained, especially in the case of grayscale. 4.3.3. Intermediate representation As a final analysis, this section studies the type of intermediate representa-445 tion that the SAE encodes. In the case of the model that we have determined is the most appropriate for this task (see Section 4.1), the intermediate repre- sentation comprises 96 × 32 × 32 features. In other words, it codifies 96 images of 32 × 32. A full example of intermediate representation can be seen in Fig. 10. In450 22 90 92 94 96 98 100 0 500 1000 1500 2000 2500 3000 3500 4000 F -m Training size SAE LRDE Pixel (a) Binary format 80 85 90 95 100 0 500 1000 1500 2000 2500 3000 3500 4000 F -m Training size SAE LRDE Pixel (b) Grayscale format Figure 9: Performance of the SAE with regard to the amount of training scores used to train. Performance of state-of-the-art strategies are also included for a better comparison. our grid search, which was carried out to determine the best model parameters, the number of intermediate codes did not lead to an excessive variation in the performance figures, which is why many of the intermediate representations are quite similar: there is a high redundancy of information in the case of 96 codes, signifying that similar results could be obtained with a fewer number of them.455 The actual utility of all the codes might only arise in the most complex cases. Furthermore, in order to compare the difference between binary and grayscale input images, Table 5 shows examples of the intermediate codifications obtained from both formats (enlarged for visualization). It is interesting to note that many of these intermediate representations are also quite similar, regardless of460 the type of input image. This could evidence that the SAE is actually learning a good internal representation to perform this task, discarding the information 23 (a) Input patch (b) Intermediate representations Figure 10: Illustration of the 96 intermediate representations for the depicted input and output patches. that refers to the specific characteristics of the input image. 4.4. Experiment with old documents A new experiment is described in this section with the aim at verifying the465 adaptability of the approach model to a different type of document images. It is obvious that, in this case, data-driven strategies have advantages because they are provided with information of the specific domain on which they are going to be applied. Traditional methods for staff-line removal have not taken into account the great heterogeneity that can be found in musical documents, thus470 leading to solutions that are not generalizable. We want to show here that the SAE entails a more adaptable approach, and it also behaves better than other supervised learning approaches. For this experiment we use a set of 20 staff sections from sacred music com- posed during the second half of the XVIIth century. These compositions were475 handwritten in a music book by a copyist of that time. In addition, they do not depict common modern notation, but the so-called mensural notation. Since 24 Input Intermediate representations (enlarged) Output Table 5: Example of intermediate representation of the SAE. music was intended to be sung, the sections depict both music notation and ac- companying text (lyrics), which might hinder the staff-line removal algorithms. We have manually created a binary version of the considered staff sections,480 which are originally depicted in grayscale, as well as their corresponding ground- truth (without staff lines). An example of this corpus is illustrated in Table 6. Source Binary Ground truth Table 6: Examples of the corpus created for staff-line removal on mensural notation manuscripts. Unlike the previous case, for which there exist a large dataset with thousands of images to train the models, this experiment shows a more real scenario. In 25 such a situation, it cannot be assumed that a large corpus can be effortlessly485 obtained, but the network must learn with a limited amount of examples. How- ever, we resort to fine-tuning to alleviate this situation. That is, a pre-trained model for other types of scores (ie. that obtained in the previous experiment) is initially considered, and then examples of the new domain are used to re- estimate the parameters of the model. In addition, the extracted patches from490 the training images are not totally disjoint but overlapping is considered in order to have a larger number of training examples from the same number of labeled documents. In this experiment, we first focus on selecting the most appropriate strategy to approach a new domain (mensural notation). The first option is to directly495 use the pre-trained network obtained from the previous experiment with modern notation. However, we also measure the performance obtained assuming that we have a limited number of tagged data of the new domain. Thus, we include in the comparison the performance of a trained network from scratch with these new data, as well as starting from the pre-trained network and performing a500 fine tuning process. To take the most out of the available data, we have established a 5-fold cross- validation scheme. That is, for each fold, 16 images are considered for training (20% of which are used to monitor the training and prevent over-fitting) and 4 images are used to measure the performance. The average results of this505 experiment can be checked in Table 7. In this case we can observe relevant differences between binary and grayscale formats. In the formed, the network trained with other types of scores is quite reliable—even better than training it from scratch with the new data—as the staff lines are similar to those depicted in the previous corpora. However, it is510 appreciated that adding data from the new domain allows boosting the recog- nition. In the grayscale format, it is observed that training with domain data, even with a limited number of examples, is more beneficial than directly consider the pre-trained network with data of another type of score. The difference with the binary case is that here the model has to deal also with the background of515 26 Method Binary Grayscale Pre-trained 96.16 84.05 Trained from scratch 95.05 91.45 Pre-trained + fine-tuning 97.98 95.71 Table 7: F-m (%) comparison among the different strategies for training the SAE in a new domain (mensural notation). Results report average performance considering a 5-fold cross validation experiment in both binary and grayscale format. Values in bold represent the best average accuracy in each set. the score, which presents relevant differences with respect to the background of the previous corpus. Nevertheless, it is also reported that the best performance is achieved when combining a pre-trained network with the new data. Finally, we compare the best case obtained by our model (that is, starting with a pre-trained model and fine-tuning) with the algorithms previously pro-520 posed. In particular, we consider the 3 methods that obtained the best result in the benchmark established by the aforementioned staff-line removal contest: LRDE, INESC and Pixel. The first two were specially designed for binary im- ages, and so we only consider their most favorable case. Furthermore, Pixel is easily usable in both contexts. Thus, the results of this comparative study can525 be consulted in Table 8. It can be observed that the SAE is able to outperform the results obtained with other approaches, even with a greater difference than in the previous ex- periment, in both binary and grayscale. Although the supervised approach is clearly an advantage in this context, it is also reported that our approach no-530 ticeably improves the results obtained by Pixel. An illustrative detail of these results is given in Fig. 11. A statistical significance test is performed again, considering each complete document as an independent sample. These tests resulted in p-values below 0.01 when comparing our method against the rest of the strategies, implying535 therefore that our method significantly improves their performance with an 27 Method Binary Grayscale LRDE 92.81 — INESC 90.89 — Pixel 91.24 86.64 SAE 97.98 95.71 Table 8: F-m (%) comparison among LRDE, INESC, Pixel, and SAE methods for the staff-line removal over mensural notation manuscripts. Results report average performance considering a 5-fold cross validation experiment in both binary and grayscale format. Values in bold represent the best average accuracy in each set. A dash mark (—) is used when the method in the row is not applicable. alpha confidence level of 99%. 5. Conclusions This work has studied the removal of staff lines from musical scores by considering a machine learning approach. Our proposal consists of using a540 new type of model called the Selectional Auto-Encoder (SAE), a convolutional neural network that learns which characteristics of the input should be selected by means of coding and decoding stages. In the context of the problem to be addressed, the model is trained to select only those pixels of the input that belong to a musical symbol. The process of removing the staff lines can therefore545 be solved, once the model is trained appropriately, by iterating the image of a score patch by patch. An activation in the range of [0, 1] is obtained for each pixel, indicating the level of selection predicted. Those pixels whose activation surpass a certain threshold are considered to be part of a musical symbol, and are otherwise discarded.550 Our comprehensive experimentation on a standard dataset has demonstrated the goodness and robustness of the approach. Regardless of the specific features of the chosen model — such as the size of the input window or the depth of the network — the results obtained were competitive. The model that obtained the 28 (a) Source (b) Binary (c) Ground-truth (d) INESC (e) LRDE (f) Pixel (bin) (g) Pixel (gr) (h) SAE (bin) (i) SAE (gr) Figure 11: Qualitative detail of the performance achieved by the different staff-line removal strategies for mensural notation from old manuscripts. Grayscale format (source), manual binarization, and ground-truth are also included for a better illustration. best figures was specifically that which takes a window of size 256×256, with 3555 convolutional layers per stage, 96 filters per layer and a 5×5 convolution kernel per filter. When compared to other algorithms proposed for the same task, the SAE performs significatively better for both binary and gray input images. The goodness of our approach is more evident in the latter case (up to more than 10 points of improvement in F-m).560 We have also studied the behavior of our model with regard to the chosen threshold, showing that this value does not have a particular impact on the results. However, it has been determined that a value of 0.3 is the most ap- propriate for dealing with binary inputs, and 0.1 for grayscale. An incremental study has also been carried out, and we have observed that the model attains565 a competitive performance with few samples. However, in the case of grayscale 29 images, the best values are only achieved with a large number of examples. Finally, we have shown the intermediate representations that the SAE learns, regarding which two main observations can be made: these representations have a high redundancy of information, which could mean that the whole capacity570 of the model is only needed in the most complex cases; the intermediate rep- resentations for the binary and grayscale cases are quite similar, which could indicate that the SAE is actually learning how to attain a deep representation of the task, independently of the characteristics of the input image. As prospects for future work, the intention is twofold. On the one hand, we575 intend to include our staff-line removal process in a functional OMR system so as to study its impact in a goal-directed way. We are especially interested in measuring the final performance of the system as regards the staff-line removal accuracy. On the other hand, we are interested in learning the task with a limited number of labeled samples. Note that the reference corpus for the prob-580 lem of staff-line removal is expensive to obtain. The idea is, therefore, to follow semi-supervised approaches in which the task can be learned in an unsupervised manner and a fine-tuning process with a small labeled corpus is carried out. It would be of great interest to carry out studies by increasing the variability of the input images. Musical scores can be highly heterogeneous, and we thus wish585 to obtain a model that can adapt to any type of them. Our idea is, therefore, to develop models that can process scores of a very different style from those seen during training by means of Transfer Learning (Pan & Yang, 2010) or Domain Adaptation (Patel et al., 2015) strategies. Acknowledgments590 This work was partially supported by the Spanish Ministerio de Educación, Cultura y Deporte through a FPU fellowship (AP2012-0939) and the Span- ish Ministerio de Economı́a y Competitividad through Project TIMuL (No. TIN2013-48152-C2-1-R, supported by UE FEDER funds). 30 References595 Bainbridge, D., & Bell, T. C. (1997). Dealing with superimposed objects in optical music recognition. In Proceedings of the 6th International Conference on Image Processing and Its Applications (pp. 756–760). Bengio, Y., Yao, L., Alain, G., & Vincent, P. (2013). Generalized denoising auto- encoders as generative models. In Advances in Neural Information Processing600 Systems (pp. 899–907). Bottou, L. (2010). Large-scale machine learning with stochastic gradient de- scent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer. Calvo-Zaragoza, J., Barbancho, I., Tardón, L. J., & Barbancho, A. M. (2015). Avoiding staff removal stage in optical music recognition: application to scores605 written in white mensural notation. Pattern Anal. Appl., 18 , 933–943. Calvo-Zaragoza, J., Micó, L., & Oncina, J. (2016). Music staff removal with supervised pixel classification. International Journal on Document Analysis and Recognition, 19 , 211–219. Carter, N., & Bacon, R. (1992). Automatic recognition of printed music. In610 H. Baird, H. Bunke, & K. Yamamot (Eds.), Structured Document Image Anal- ysis (pp. 454–465). Springer. Choudhury, G. S., Droetboom, M., DiLauro, T., Fujinaga, I., & Harrington, B. (2000). Optical music recognition system within a large-scale digitization project. In ISMIR 2000, 1st International Symposium on Music Information615 Retrieval, Plymouth, Massachusetts, USA, October 23-25, 2000, Proceedings . Dalitz, C., Droettboom, M., Pranzas, B., & Fujinaga, I. (2008). A comparative study of staff removal algorithms. IEEE Trans. Pattern Anal. Mach. Intell., 30 , 753–766. Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets.620 Journal of Machine Learning Research, 7 , 1–30. 31 Deng, L., Seltzer, M. L., Yu, D., Acero, A., Mohamed, A., & Hinton, G. E. (2010). Binary coding of speech spectrograms using a deep auto-encoder. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010625 (pp. 1692–1695). Dos Santos Cardoso, J., Capela, A., Rebelo, A., Guedes, C., & Pinto da Costa, J. (2009). Staff Detection with Stable Paths. IEEE Trans. Pattern Anal. Mach. Intell., 31 , 1134–1139. Dutta, A., Pal, U., Fornes, A., & Llados, J. (2010). An Efficient Staff Removal630 Approach from Printed Musical Documents. In 2010 20th International Con- ference on Pattern Recognition (ICPR) (pp. 1965–1968). Fornés, A., Kieu, V. C., Visani, M., Journet, N., & Dutta, A. (2013). The IC- DAR/GREC 2013 Music Scores Competition: Staff Removal. In 10th Inter- national Workshop on Graphics Recognition, Current Trends and Challenges635 GREC 2013, Bethlehem, PA, USA, August 20-21, 2013, Revised Selected Pa- pers (pp. 207–220). Fujinaga, I. (2005). Staff detection and removal. In S. George (Ed.), Visual Perception of Music Notation (pp. 1–39). Hershey, PA: Idea Group Inc. Fujinaga, I., Hankinson, A., & Cumming, J. E. (2014). Introduction to SIMSSA640 (single interface for music score searching and analysis). In Proceedings of the 1st International Workshop on Digital Libraries for Musicology, DLfM@JCDL 2014, London, United Kingdom, September 12, 2014 (pp. 1–3). Géraud, T. (2014). A Morphological Method for Music Score Staff Removal. In Proceedings of the 21st International Conference on Image Processing (ICIP)645 (pp. 2599–2603). Paris, France. Hankinson, A., Burgoyne, J. A., Vigliensoni, G., & Fujinaga, I. (2012). Creating a large-scale searchable digital collection from printed music materials. In 32 Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, April 16-20, 2012 (Companion Volume) (pp. 903–908).650 Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and helmholtz free energy. In Advances in Neural Information Pro- cessing Systems (pp. 3–10). Lauly, S., Larochelle, H., Khapra, M., Ravindran, B., Raykar, V. C., & Saha, A. (2014). An autoencoder approach to learning bilingual word representations.655 In Advances in Neural Information Processing Systems (pp. 1853–1861). LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 , 436–444. Martin, P., & Bellissant, C. (1991). Low-level analysis of music drawing images. In First International Conference on Document Analysis and Recognition (pp.660 417–425). Ng, K. (2001). Music manuscript tracing. In International Workshop on Graph- ics Recognition (pp. 330–342). Springer. Ntirogiannis, K., Gatos, B., & Pratikakis, I. (2014). ICFHR2014 competition on handwritten document image binarization (H-DIBCO 2014). In 14th Inter-665 national Conference on Frontiers in Handwriting Recognition, ICFHR 2014, Crete, Greece, September 1-4, 2014 (pp. 809–813). Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on knowledge and data engineering , 22 , 1345–1359. Patel, V. M., Gopalan, R., Li, R., & Chellappa, R. (2015). Visual Domain670 Adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32 , 53–69. Ramirez, C., & Ohya, J. (2014). Automatic recognition of square notation symbols in western plainchant manuscripts. Journal of New Music Research, 43 , 390–399.675 33 Randriamahefa, R., Cocquerez, J. P., Fluhr, C., Pepin, F., & Philipp, S. (1993). Printed music recognition. In Document Analysis and Recognition, 1993., Proceedings of the Second International Conference on (pp. 898–901). IEEE. Raphael, C., & Wang, J. (2011). New approaches to optical music recognition. In Proceedings of the 12th International Society for Music Information Retrieval680 Conference, ISMIR 2011, Miami, Florida, USA, October 24-28, 2011 (pp. 305–310). Rebelo, A., Capela, G., & Cardoso, J. S. (2010). Optical recognition of music symbols - A comparative study. International Journal on Document Analysis and Recognition, 13 , 19–31.685 Rebelo, A., & Cardoso, J. (2013). Staff Line Detection and Removal in the Grayscale Domain. In 2013 12th International Conference on Document Anal- ysis and Recognition (ICDAR) (pp. 57–61). Rebelo, A., Fujinaga, I., Paszkiewicz, F., Marçal, A. R. S., Guedes, C., & Cardoso, J. S. (2012). Optical music recognition: state-of-the-art and open690 issues. International Journal of Multimedia Information Retrieval , 1 , 173– 190. Roach, J., & Tatem, J. (1988). Using domain knowledge in low-level visual pro- cessing to interpret handwritten music: an experiment. Pattern recognition, 21 , 33–44.695 Su, B., Lu, S., Pal, U., & Tan, C. (2012). An Effective Staff Detection and Removal Technique for Musical Documents. In 2012 10th IAPR International Workshop on Document Analysis Systems (DAS) (pp. 160–164). Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep700 network with a local denoising criterion. Journal of Machine Learning Re- search, 11 , 3371–3408. 34 Visaniy, M., Kieu, V., Fornes, A., & Journet, N. (2013). ICDAR 2013 Music Scores Competition: Staff Removal. In 2013 12th International Conference on Document Analysis and Recognition (ICDAR) (pp. 1407–1411).705 Wang, W., Huang, Y., Wang, Y., & Wang, L. (2014). Generalized autoencoder: A neural network framework for dimensionality reduction. In The IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR) Workshops (pp. 490–497). Wen, C., Rebelo, A., Zhang, J., & Cardoso, J. S. (2015). A new optical music710 recognition system based on combined neural network. Pattern Recognition Letters , 58 , 1–7. Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701 . 35 Introduction Background Staff-line detection and removal Auto-encoders Staff-line removal with Selectional Auto-Encoders Experiments Hyper-parameter selection Comparison with state-of-the-art Analysis Threshold Training set size Intermediate representation Experiment with old documents Conclusions