key: cord-0513918-j9nrc2e2 authors: Halpern, Bence Mark; Rebernik, Teja; Tienkamp, Thomas; Son, Rob van; Brekel, Michiel van den; Wieling, Martijn; Witjes, Max; Scharenborg, Odette title: Manipulation of oral cancer speech using neural articulatory synthesis date: 2022-03-31 journal: nan DOI: nan sha: dac14f7bc6cd127c49b35e5c7871ad86e2dd4bae doc_id: 513918 cord_uid: j9nrc2e2 We present an articulatory synthesis framework for the synthesis and manipulation of oral cancer speech for clinical decision making and alleviation of patient stress. Objective and subjective evaluations demonstrate that the framework has acceptable naturalness and is worth further investigation. A subsequent subjective vowel and consonant identification experiment showed that the articulatory synthesis system can manipulate the articulatory trajectories so that the synthesised speech reproduces problems present in the ground truth oral cancer speech. Oral cancer is a type of cancer where a tumour is located inside the oral cavity, most typically on the tongue or floor of the mouth. Approximately 530, 000 people get diagnosed with this condition every year worldwide [1] , including around 1000 in the Netherlands [2] . To treat oral cancer, (part of) the tissues surrounding the tumour are removed during surgery, which subsequently affects the patients' speech. There is large uncertainty regarding how the speech will be impacted by the surgery. This uncertainty causes significant distress to the patients, affecting their quality of life [3] . A speech synthesis system that could predict how a patient's voice would sound after surgery (post-operative speech), based on a surgical plan or a biomechanical model, could help clinicians and patients to make informed decisions about the surgery and alleviate the stress of the patients. Such a system would enable comparing multiple surgical plans and choosing the one that provides the best speech outcomes. Even if it is impossible a achieve a good speech outcome, patients could be counselled through the process using the synthesised samples. Therefore, our long term aim is to build a pathological speech prediction system that can be connected to a biomechanical model incorporating the surgical planning [4] . Previous studies have successfully shown that voice conversion systems can synthesise pathological speech of varying severity [5, 6, 7] . A limitation of these voice conversion systems is that they cannot be connected to biomechanical models, therefore, are unsuitable for our task. Articulatory synthesis could provide a way to connect biomechanical models to speech prediction. Articulatory synthesis is a way to synthesise speech by either physical simulation of the speech production process [8] , or by using biosignals relevant to the articulation process, so-called data-driven approaches. Examples of the latter include ultrasound tongue imaging (UTI) [9, 10] , magnetic resonance imaging (MRI) [11] , permanent magnetic articulography (PMA) [12, 13, 14] and Figure 1 : Outline of the current approach fitting into a future post-operative speech prediction framework. In the current study, we test the articulatory synthesis system (red box). The biomechanical model is presented in [4] , while the voice conversion is presented in [7] . electromagnetic articulography (EMA) [15, 16, 17] . Among these techniques, EMA will be used as it has higher temporal resolution and spatial accuracy than the other techniques. Figure 1 shows how the articulatory synthesis could connect the biomechanical model to the speech prediction. First, clinicians would plan the surgery with a new patient. The relevant surgical variables (i.e., which tissues to remove from the vocal tract) of this planning would then be incorporated in a biomechanical model. Then, tracking points in the biomechanical model could be converted into an EMA signal, which could be used to directly synthesise speech using the articulatory synthesis model. During training time, the articulatory system needs post-operative speech, which is impossible to obtain for the new patient (as this is the exact task for prediction). Therefore, at prediction time, the speech can be only synthesised with the vocal characteristics of a different (previously treated) oral cancer speaker. Thus, in order to achieve a patient-specific synthesis, the vocal characteristics of the synthesised speech need to be converted. This can be achieved using the voice conversion technique previously presented in [7] , which uses the preoperative speech and the synthesised post-operative speech to generate the new patient's post-operative speech. Finally, different post-operative samples could be synthesised using different surgical plans which could allow adjustment of surgical plans (which tissues/organs to spare), leading to better speech outcomes. In the present paper, we carried out a feasibility study of such an oral cancer articulatory speech synthesis system (AS). We tested the AS in two different setups. The first setup (Synthesis) was a standard AS system, with the goal of synthesising new sentences based on articulatory trajectories, e.g., from a biomechanical model or EMA measurements. The second setup (Manipulation) tested whether it is possible to manipulate existing sentences by manipulating their articulatory trajectories slightly, which is often sufficient for clinical purposes. Specifi-cally, we investigated two desired properties for the AS system: 1) the synthesised speech has to sound natural; 2) changes in the biomechanical model and the EMA trajectory should induce the correct acoustic changes. To summarise, we were interested in the following: RQ1 Is it possible to synthesise oral cancer speech using articulatory synthesis so that the synthesised speech has comparable naturalness to the (a) real (ground truth) oral cancer speech, (b) and to the synthetic (predicted) healthy speech? RQ2 Can we perform phoneme-level manipulations on the articulatory trajectory in order to synthesise samples that correspond perceptually to the intended modifications? The synthesised speech samples can be found online. 1 To answer these questions, we collected a corpus consisting of 3.3 hrs of parallel speech and articulation recordings from seven native Dutch speakers. Five of the speakers had been diagnosed with and treated for oral cancer (three males and two females), the other two speakers were healthy controls (one male and one female). 2 The total duration of recordings per speaker varied between 24 min (nki01) and 32 min (nki03). The speech was recorded in a sound-dampened booth using a Sennheiser ME66 microphone with a sampling frequency of 22,050 Hz. After the recording, the audio was downsampled to 16 kHz and mixed to mono. The speech and non-speech regions were then automatically annotated using Praat and then manually corrected [18] . The articulatory trajectories were recorded with the NDI-VOX electromagnetic articulograph, with a sampling frequency of 400 Hz, using 10 electrode channels. The recorded articulatory trajectories have a spatial accuracy of around 0.1 mm [19] . Reference sensors (used to subtract head movements from articulator movements) were placed on both mastoids, the upper incisor, and the nasal bridge. Two movement sensors were placed on the mobile tongue, two on the vermillion border of the upper and lower lips, and one on the lower incisor or jaw. Not all movement sensors could be placed for every speaker. The stimuli contained sentences from three sources. First, we selected from the Wablieft newspaper corpus [20] sentences that together covered all Dutch phonemes and included many sentences with plosives, as these are known to be difficult for oral cancer speakers [21, 22, 23] . Second, we included several Dutch texts that are commonly used for assessing speech impairment. The sentences in the first and second categories were produced once. Finally, we also included custom sentences into the stimuli to test the phoneme-level manipulation capability of the articulatory synthesis framework (RQ2). These custom sentences included five target words embedded in a carrier phrase and were repeated by the subjects five times in random order. The custom sentences share a similar form targeting (1.) vowels, (2.) and sibilants in CVC contexts: There were a total of 226 sentences used as stimuli. Table 1 shows the breakdown of the sentences. The full list of stimuli 1 http://slg.web.rug.nl/speech-samples/ password: interspeech2022 2 We wanted to include more participants, however, the data collection of the research has been severely affected by the COVID-19. can be found online. 3 All our experiments use the articulatory synthesis framework (AS), which is a speaker-dependent articulatory synthesis network (explained in Section 3.2). As Table 1 shows, the AS was trained with two different train-test partitioning of the dataset, which we named Synthesis and Manipulation. To test the capability of the AS on the synthesis task, the Synthesis setup contained no overlapping sentences between the training and the test set. We also created a different partitioning of the dataset that we called Manipulation. The Manipulation setup was based on including one variation of the custom sentences (and all the other sentences) in the training set and the other variations of the custom sentences in the test set: target words baat and sok were in the training set, while target words biet, boet and shock were in the test set. The Manipulation setup allowed us to test explicitly whether the AS could synthesise phoneme-level differences. For example, a possible failure case would be that the AS copies the nearest example from the training set, i.e., even though boet was said, baat would be synthesised, as that is the closest example in the training set. We carried out subjective vowel and consonant identification experiments (see Section 3.4) to test for this phonemelevel generalisation property (RQ2). Because the sentences in the test set are partially seen during training, we expected that this setup would also result in more natural speech. Note that such a setup is not unrealistic for our clinical scenario: using the Manipulation setup over the Synthesis setup has the considerable, but acceptable disadvantage that only sentences in the training set can be manipulated. Objective naturalness evaluation experiments were performed on the test of the Synthesis and Manipulation, with additional subjective evaluation on the test set of Synthesis to answer RQ1, further explained in Section 3.3. To achieve speaker-dependent articulatory synthesis, we used a neural network architecture based on the best practices from several prior studies on AS [15, 24] . The neural network used the static, ∆, ∆ − ∆ articulatory trajectories as input features, which were downsampled to 200 Hz so that the sampling rate of the input and the output frames are matched. The neural network was a four layer LSTM with 128 units, and a final regression layer that matches the size of the output acoustic features. The output acoustic features were the static, ∆, ∆ − ∆ Melcepstrum (MCEP), which were extracted using the WORLD vocoder [25] . The predicted speech was then synthesised from the output acoustic features as follows. First, the output acoustic features were processed using maximum likelihood parameter generation (MLPG) to get a more robust MCEP estimation than the one provided by the neural network [26] . The F0 and the BAP parameters required for synthesis were directly taken from the ground truth (copy synthesis) because we were primarily interested in the manipulation of the articulatory signals, and copy synthesis allowed more natural speech synthesis. Finally, the copy synthesis features were combined with the estimated MCEP features using the WORLD vocoder to obtain the final speech signal (predicted). We additionally synthesised samples using MCEP copy synthesis to measure the upper bound of naturalness (resynthesis). In all cases, the neural network was trained with an Adam optimiser [27] , using a learning rate of 0.001, with early stopping until a maximum of 50 epochs. Before training on the oral cancer data, a baseline was established using the mngu0 dataset [28] , where our network obtained a Mel-cepstral distortion (MCD) of 6.4 dB (see Section 3.3 for more details on the MCD), which confirmed the correctness of our architecture implementation in a high-resource scenario. The source of the baseline is available online. 4 Using the architecture described above, we trained a neural network from scratch on each speaker's data in both the Synthesis and the Manipulation setup. It was expected that the amount of training data would affect the results in both setups, therefore, data augmentation was performed by adding Gaussian noise with zero mean and four different standard deviation levels (σ = 10 [0,−1,−2,−3] ) to the articulatory trajectories. We repeated the data augmentation experiments with three different random seeds and performed a paired t-test (paired by random seed) to check that the observed improvements are due to the data augmentation. The naturalness of synthesised speech examples was evaluated with objective and subjective methods. To objectively evaluate the samples, we calculated the Mel-cepstral distortion (MCD) measure [29] on the entirety of the Synthesis and Manipulation test sets. For subjective evaluation, we ran a five-point scale mean opinion score (MOS) based perceptual experiment, similar to the one carried out in [6] . Each listener rated 105 utterances: 5 random sentences from the Synthesis test set × 3 conditions (ground truth, resynthesis, predicted) × 7 speakers. We expect the naturalness to be lower in the pathological ground truth than in the healthy ground truth based on our previous studies [6, 7] . For all subjective evaluation experiments (including Section 3.4), we used 9 native Dutch listeners. Note that the listeners were mostly from the Central Netherlands region, while the speakers were from the Northern region. We evaluated the success of the articulatory manipulation using a vowel and consonant identification test. In the vowel identification task, the listeners were given the sentence Hij Table 2 : Objective (MCD in dB, lower is better) and subjective (MOS, higher is better) naturalness evaluation results. The columns contain the identifier and type of the speaker (healthy/oral cancer). ( †) indicates that the effect of data augmentation is statistically significant at p < 0.05. (*) in the case of the MOS indicates that the reduction in MOS is statistically significant at p < 0.05. Oral heeft tamme X gezegd and asked to replace the X with biet/baat/boet. Alternatively, listeners could indicate a different word via an input text field. In the consonant identification experiment, the listeners were provided with the sentence Hij heeft tamme X gezegd and had to choose from either shock, sok or write a different word via a text field as described above. Homonyms were not corrected. In addition to the sentences predicted by the neural network, the stimuli for each experiment included the ground truth stimuli. Including ground truth is essential because we expect that the listeners might already have difficulties identifying vowels and consonants in the ground truth data. In total each listener rated 56 sentences for the consonant identification task (2 words × 2 conditions (ground truth, predicted) × 2 repetitions × 7 speakers), and 84 sentences (same, but 3 words) for the vowel identification task. Finally, to assess the impact of the manipulation statistically, a Fisher's exact test was performed [30] . In the MCD rows of Table 2 , we only report the results of the best data augmentation experiments, and it is also indicated whether the data augmentation improves the models ( †) and if this improvement is significant (*). In the Synthesis case, data augmentation improved our models five out of seven times, but this improvement was never significant. In the Manipulation case, the data augmentation improved the models five out of seven times, and it was significant two times. Therefore, the noise-based data augmentation only mildly improved the naturalness of the predicted sentences compared to no augmentation. From the MCD rows of Table 2 , we can see that the average objective naturalness scores of the oral cancer speakers are comparable to healthy speakers both in the Synthesis and Manipulation setup. Furthermore, it can be observed that the results are higher (worse) than we have obtained with our mngu0 baseline (6.4 dB, see Section 3.2). The only difference between our baseline and the AS models is the number of utterances: the mngu0 training set contains 1226 utterances, while our dataset only contains 226. This observation suggests that the higher MCD results obtained are most likely due to the difference in the amount of data used. From the ground truth MOS (subjective naturalness) rows of Table 2 we can see that the oral cancer speakers achieve lower MOS than the healthy speakers, which is in line with the results of our previous studies [6, 7] . Furthermore, the resynthesis MOS row shows that the vocoder impacts the naturalness of the predicted speech: with the exception of nki04 and nki06 the vocoder makes the naturalness significantly worse (Wilcoxon, p < 0.05), which points towards the use of improved (neural) vocoders as a possible future improvement direction for the AS. Regarding the naturalness of the synthesised sentences we can see (predicted MOS row) that the synthesised results are significantly worse (Wilcoxon, p < 0.05) than the vocoder results, which means that there is room for improvement for the AS itself. The obtained MOS vary between 1.93 and 2.62. The mean MOS of the healthy speakers is slightly larger than that of the oral cancer speakers, but the difference is not statistically significant (p = 0.09). Therefore, the AS has a similar performance in healthy and oral cancer speakers. We conclude that the naturalness of the predicted oral cancer stimuli is reasonable, although significantly lower than that of the ground truth oral cancer stimuli (RQ1.a). However, the naturalness of predicted oral cancer speech is comparable to the predicted healthy speech (RQ1.b). Slightly higher (poorer) MCD values are observed compared to our mngu0 baseline, which is most likely due to the lack of training data. Furthermore, we observed lower (better) MCD values in the Manipulation setup than in the Synthesis setup, therefore Manipulation seems to be a more promising approach than Synthesis for our AS. Neither vowel nor consonant identification mistakes with less than five replies (n < 5) will be reported. Table 3 (top) shows the results of the vowel identification experiment. No recognition errors are observed with the healthy control and healthy predicted speech that have more than 5 replies. For the oral cancer ground truth speech, we can see that biet [bit] is commonly misidentified. The most common misidentification pair is buut [byt] (14.4%), followed by bit [bIt] (7.7%). In the case of predicted speech, we find that biet bit (62.2%, n=56) is often misidentified as buut (25.6%, n=23), bit (10.0%, n=9). Additionally, we find that boet [but] (82.2%, n=74) is misidentified as biet (7.7%, n=7), buut (6.6%, n=6). Overall, the difference between predicted and ground truth vowels is not statistically significant (p = 1) for baat, but significant for biet (p = 0.03) and boet (p < 10 −4 ). The misidentifications with biet and boet suggest that duration and tongue height differences need to be modelled better by the AS. Table 3 (bottom) shows the results of the consonant identification experiment. We can see that in the healthy case, we do not have any misidentification. For the predicted case, about half of the time shock [sOk] is perceived as sok [SOk]: upon inspection of the data, it turned out that these results are mostly exclusive to nki07, and therefore most likely a speaker-specific issue. For the oral cancer ground truth speech, shock is misidentified as sok (10%, n=10) . Sok was also identified as shock (12.5%, n=8). This shock/sok misidentification is in line with existing evidence, which finds that sibilants are impacted in oral cancer speech [23, 31] . In the case of the predicted samples shock is misclassified as sok (20%, n=16) and sok is misclassified as shock (13.8%, n=11). Overall, the difference between predicted and ground truth are not statistically significant for shock (p = 0.81) and sok (p = 0.39). Therefore, the results suggest that sibilant manipulations are modelled well with our approach. Overall, we find that the AS system has a promising ability to manipulate both healthy and oral cancer speech on the phoneme-level: we found no significant issues with the sibilants, but the front/back, height and vowel duration aspects have to be improved (RQ2). In this paper, we presented an articulatory synthesis framework for the synthesis and manipulation of oral cancer speech for clinical decision making and alleviation of patient stress. Objective and subjective evaluations carried out on the articulatory synthesised speech demonstrated that the framework has reasonable naturalness on the synthesis of new sentences. The results are slightly poorer compared to our high-resource baseline most likely due to the lack of training data, which is only mildly alleviated by noise-based data augmentation. The system achieves good objective naturalness on the manipulated sentences. Our vowel and consonant identification experiment indicate that our articulatory synthesis system is able to manipulate sibilants and reproduce perceptual problems that are found in the ground truth oral cancer speech, such as the weak contrast between the sibilants sok and shock. Future work needs to investigate more advanced neural vocoders, improved sequenceto-sequence architectures to model vowel duration better, and better placement of electrodes to capture appropriate cues of tongue frontality and height. Speaker-independent articulatory synthesis could be also used to use to increase the amount of data available for training, and likely, increase the naturalness of the synthesised speech. Our results suggest that it is feasible to perform certain articulatory manipulations in data-driven systems, which is an important step towards enabling the connection of patient-specific biomechanical models to speech synthesis systems. The authors would like to thank Finnian Kelly and Róbert Tóth for their constructive comments on the manuscript. This work received ethical clearance (NL76137.042.20). B.M.H. is funded through the EU's H2020 research and innovation programme under MSC grant agreement No 766287. The Department of Head and Neck Oncology and Surgery of the NCI receives a research grant from Atos Medical (Hörby, Sweden), which contributes to the existing infrastructure for quality of life research. The global incidence of lip, oral cavity, and pharyngeal cancers by subsite in 2012 Overlevingscijfers van mondkanker Quality of life and oral function following radiotherapy for head and neck cancer An interactive surgical simulation tool to assess the consequences of a partial glossectomy on a biomechanical model of the tongue An objective evaluation framework for pathological speech synthesis 14th ITG Conference Towards Identity Preserving Normal to Dysarthric Voice Conversion Pathological voice adaptation with autoencoderbased voice conversion Artisynth: A biomechanical simulation platform for the vocal tract and upper airway DNN-based ultrasound-to-speech conversion for a silent speech interface F0 estimation for DNN-based ultrasound silent speech interfaces CNN-Based Phoneme Classifier from Vocal Tract MRI Learns Embedding Consistent with Articulatory Topology Direct speech reconstruction from articulatory sensor data by machine learning A silent speech system based on permanent magnet articulography and direct synthesis Analysis of phonetic similarity in a silent speech interface based on permanent magnetic articulography Articulatory-to-speech Conversion Using Bi-directional Long Short-term Memory Articulatory controllable speech modification based on statistical inversion and production mappings Articulatory-to-acoustic conversion using BLSTM-RNNs with augmented input representation Praat: doing phonetics by computer Accuracy assessment of two electromagnetic articulographs: Northern digital inc. wave and northern digital inc. vox Wablieft: An Easy-to-Read Newspaper Corpus for Dutch Consonant intelligibility and tongue motility in patients with partial glossectomy Speech outcomes for partial glossectomy surgery: Measures of speech articulation and listener perception Detecting and Analysing Spontaneous Oral Cancer Speech in the Wild Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors' Orientation Information WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications Speech parameter generation algorithms for HMM-based speech synthesis Adam: A method for stochastic optimization Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus Mel-cepstral distance measure for objective speech quality assessment Fisher's exact test for m × N contingency table Acoustic analysis of changes in articulation proficiency in patients with advanced head and neck cancer treated with chemoradiotherapy