Abstract
The variability of the human voice is a challenge for speaker verification systems, influenced by individual traits and environmental conditions. This research introduces a novel approach that uses dual-bandwidth spectrograms with the Fast ResNet-34 neural network architecture for speaker verification. Dual-bandwidth spectrograms are data structures similar to multi-channel images, generated by stacking spectrograms derived from the same audio segment using two different window sizes. In this study, we employed window sizes of 5 ms and 30 ms. This approach captures a wider range of voice features across multiple temporal and spectral resolutions. Our findings demonstrate a statistically significant improvement in system performance, achieving an Equal Error Rate (EER) of 1.64% ±0.13%. This represents a 26% enhancement over the previously reported benchmark EER of 2.22% ±0.05%, validating our hypothesis that dual-bandwidth spectrograms offer a more detailed and comprehensive representation of voice features for accurate speaker verification. Analysis of individual bandwidth contributions reveals that narrowband spectrograms carry more relevant features for speaker verification, while the combination with broadband spectrograms provides complementary information.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Speaker verification (SV) is the task of determining whether two voice samples belong to the same speaker. Unlike speaker identification [1], which attempts to identify a speaker from a closed set of individuals, SV compares the features extracted from any pair of voices to assess their similarity. This open-set approach offers greater flexibility and applicability in biometric systems [2].
The variability of the human voice poses a significant challenge for SV systems. Individual traits such as age, accent, emotional state, intonation, and language, as well as environmental factors like background noise, reverberation, and recording devices, can influence the voice signal [3, 4]. Despite these variations, the voice possesses unique attributes that remain relatively stable, primarily due to the anatomical characteristics of the speaker’s vocal tract.
In the context of machine learning, SV is typically addressed by training a model to extract individualized voice characteristics and compare them to determine a similarity score between two voice samples. This score is then compared to a decision threshold to classify the samples as originating from the same speaker or different speakers. The Equal Error Rate (EER) is a commonly used metric for calibrating the decision threshold. It balances the rate of false positives and false negatives.
In this work, our primary contribution is the exploration of dual-bandwidth spectrograms for speaker verification. We propose adapting an existing deep neural network architecture for SV to process a combination of two spectrograms generated from the same audio segment. Our hypothesis is that the integration of narrowband and broadband spectrograms will significantly enhance the accuracy of SV tasks by leveraging the complementary nature of these two spectrogram types.
Our dual-bandwidth approach combines narrowband spectrograms, generated with longer analysis windows (approximately 30 ms), and broadband spectrograms, generated with shorter windows (around 5 ms). Narrowband spectrograms offer higher frequency resolution, allowing better visualization of harmonic structures related to the speaker’s fundamental frequency. Conversely, broadband spectrograms provide better temporal resolution, capturing rapid acoustic events and broader spectral patterns like formants.
By exploring this novel combination of representations, we aim to capture a more comprehensive set of speaker-specific features across different temporal and spectral resolutions. This dual-bandwidth approach potentially allows the model to leverage both fine-grained frequency information and broader spectral patterns, leading to more robust speaker verification performance. While combined spectrograms have shown success in related tasks such as keyword detection, voice activity detection, and emotion recognition, our work is the first to thoroughly investigate their potential in speaker verification.
We chose the model presented in [5], which achieved state-of-the-art SV performance in 2020, as the reference to validate our hypothesis. Although more recent models with better performance exist, the simplicity of this model’s construction facilitates the isolation of the input’s influence during experiments. Moreover, its relatively low computational cost enables training on available hardware resources.
The remainder of this paper is organized as follows. Section 2 presents related work, positioning our research within the current landscape of speaker verification techniques. Section 3 describes our proposed method in detail, elaborating on the dual-bandwidth spectrogram approach and the neural network model created to validate our hypothesis. Section 4 outlines the experimental setup, including the dataset, hardware configuration, and training procedure. Section 5 presents and discusses the results, covering the evaluation protocol, performance metrics, and a comparison of the proposed model with the baseline. This section also addresses the limitations of our current approach and outlines potential directions for future work. Finally, Sect. 6 concludes the paper, summarizing the main findings and their implications for the field of speaker verification.
2 Related Work
Speaker verification has seen significant advancements through various techniques including data augmentation, novel neural network architectures, and multi-bandwidth spectrogram approaches.
In terms of data augmentation for speaker recognition, [6] proposed using a voice conversion model to generate additional data. They also applied a bandwidth extension to augment narrowband speech, generating missing frequency bands from lowband information.
Regarding neural network architectures, [7] and [8] introduced different attention layers on top of the feature extractor to better encode variable-length vector representations compared to global average pooling. Further advancing this concept, [9] adapted Selective Kernel Attention (SKA) modules with a multi-scale frequency and channel module, modifying feature extractor blocks to use an attention mechanism that selects from multiple 1D convolutional kernel sizes.
The use of multiple spectrograms as input to neural network models has been explored by [10]. They combined Mel spectrograms, Gammatone spectrograms, and spectrograms extracted with continuous wavelet transform into a single 3D-channel spectrogram for a Convolutional Neural Network (CNN).
Multi-bandwidth spectrograms, which combine multiple frequency bandwidths to capture a broader range of audio features, have shown promise in various speech and audio processing tasks. [11] analyzed combined spectrograms and proposed a new representation obtained by computing the pixel-wise geometric mean of narrowband and broadband spectrograms from the same audio signal.
In the work of [12], multi-bandwidth spectrograms were applied to voiced and unvoiced sound detection. The authors used different window analyses depending on the frequency range, resulting in performance improvements compared to classical spectrogram generation approaches.
While multi-bandwidth spectrograms focus on combining different frequency representations, other researchers have explored combining different types of features to improve speaker recognition performance. Feature fusion techniques have been explored in speaker recognition. [13] proposed fusing Mel Frequency Cepstral Coefficients (MFCCs) with new features based on temporal domain statistical indicators such as mean, median, and standard deviation. Their model, trained and evaluated on different subsets of LibriSpeech [14], outperformed the baseline models they evaluated.
These studies demonstrate the potential of multi-bandwidth spectrograms and feature fusion techniques in various speech and audio processing tasks, motivating further exploration in speaker verification applications.
Our research indicates that the use of dual-bandwidth spectrograms, as defined in this work, has not been previously applied to speaker verification tasks. While studies such as [10] and [11] have explored various combinations of spectral representations in speech processing, they have primarily focused on speech recognition or general audio processing tasks. The specific application of dual-bandwidth spectrograms to speaker verification appears to be novel.
This approach, combining narrowband and broadband spectrograms, extends the existing work on spectral analysis in speaker recognition systems. By leveraging complementary information from different spectral resolutions, our method aims to enhance the accuracy of speaker verification tasks. Our approach addresses limitations in previous works by providing a simple yet effective means of improving speaker verification performance using readily available spectral information.
3 Method
This study introduces a novel approach to SV analysis, leveraging a deep neural network model that processes combined narrowband and broadband spectrograms. Narrowband spectrograms are generated with longer analysis windows of 30ms. They capture finer harmonic details. Whereas broadband spectrograms are generated with shorter windows of 5ms. They better represent rapid temporal variations and spectral envelopes. The window lengths of 30ms for narrowband and 5ms for broadband spectrograms are defined in [15] and [16].
The baseline for this study is the Fast ResNet-34 architecture proposed by [5]. We have implemented it using the original code provided by the authors. This model is a streamlined version of the traditional ResNet-34 [17], characterized by a significant reduction in complexity. Fast ResNet-34 has been optimized to use only 1.4 million parameters, while the classical ResNet-34 comprises 63.5 million parameters. This efficient design greatly enhances computational efficiency and simplifies the training process, enabling us to explore our hypothesis without requiring high-end hardware resources.
The baseline for this study is the Fast ResNet-34 architecture proposed by [5]. We have implemented it using the original code provided by the authors. This model is a streamlined version of the traditional ResNet-34 [17], characterized by a significant reduction in complexity. Fast ResNet-34 has been optimized to use only 1.4 million parameters, while the classical ResNet-34 comprises 63.5 million parameters. This efficient design allows us to explore our dual-bandwidth hypothesis while maintaining reasonable computational requirements, even with the additional preprocessing step of generating two spectrograms per audio sample.
3.1 Adaptation for Dual-Bandwidth Spectrograms
To accommodate our dual-bandwidth spectrogram approach, we modified the baseline Fast ResNet-34 model. This adaptation primarily involved changing the initial convolutional layer to accept two input channels instead of one, corresponding to the narrowband and broadband spectrograms. Despite this modification, the number of output channels in this layer remained the same as in the original model. Consequently, the dimensions of the tensor after the first layer, and throughout the rest of the network, remain unchanged. This allows the subsequent layers to process the combined information from both spectrograms seamlessly, without requiring further architectural changes.
Our modified model processes and combines features from both narrowband and broadband spectrograms from the very first layer, enabling it to simultaneously capture fine spectral details and broader temporal patterns. Additionally, we incorporated Self-Attentive Pooling (SAP) [7] for temporal aggregation, utilizing attention mechanisms to emphasize critical segments in the SV process. Despite these adaptations, the modified model maintains the parameter efficiency of the original Fast ResNet-34, retaining a total of 1.4 million parameters. This efficient design allows us to investigate the potential of dual-bandwidth spectrogram input for improved SV performance without significantly increasing computational requirements.
3.2 Angular Prototypical Loss Function
In this study, we employ the Angular Prototypical loss function for metric learning, as introduced by [5]. This method involves using M audio clips from each speaker per training batchFootnote 1. These M clips are divided into two sets: S clips for the support set and Q clips for the query set, with Q set to 1, following the approach in [5].
As described in [5], a class prototype, or centroid, is calculated from the support set to represent each speaker class using the formula:
Here, \({\textbf {x}}_{j, m}\) is the feature from the m-th audio of the j-th speaker, and \(M-1\) represents the number of clips in the support set.
The similarity between each class centroid, \(c_j\), and the query audio feature, \(x_{j, M}\), is then measured using:
where w and b are learnable parameters.
Finally, the Angular Prototypical loss is calculated by comparing these similarity scores across all classes within a batch, as defined in [5]:
Here, \(S_{j, j}\) is the similarity between the centroid and the query vector of the same class, as defined in Eq. 1. The remaining terms \(S_{j, k}\) represent comparisons with centroids of other classes.
3.3 Dual-Bandwidth Spectrogram Representation
We represent the input audio as a dual-bandwidth spectrogram, which is a data structure similar to a multi-channel image with dimensions of width, height, and number of channels, i.e., \(X \in \mathbb {R}^{C \times H \times W}\).
The way we do so is the following, we combine narrowband and broadband spectrograms extracted from the same audio segment. We apply the Hamming window function [18] with a window width of 30 ms for the narrowband spectrogram and 5 ms for the broadband spectrogram. The step size between consecutive windows is set to 6.25 ms.
Considering the audio data sampling rate of 16,000 Hz, the window lengths correspond to 480 samples for the narrowband spectrogram, 80 samples for the broadband spectrogram, and a step size of 100 samples.
We then apply the mel-function to both narrowband and broadband spectrograms, resulting in 40 mel channels each. Hence, let \(X_n \in \mathbb {R}^{M \times T}\) and \(X_b \in \mathbb {R}^{M \times T}\) denote the narrowband and broadband mel-spectrograms, respectively, where M is the number of mel channels, and T is the variable length dimension determined by the audio segment duration. To construct the dual-bandwidth spectrogram, \(X_n\) and \(X_b\) are concatenated along a new channel dimension C, yielding \(X \in \mathbb {R}^{C \times M \times T}\) with \(C=2\). Figure 1 illustrates this construction process for a 2-s audio segment, resulting in a dual-bandwidth spectrogram.
4 Experiments
We conducted three distinct experiments to evaluate SV models using the Voxceleb2 [19] dataset’s development subset. This dataset features 5,994 unique speakers and over one million audio clips from YouTube videos, varying in quality, presenting a gender imbalance, and predominantly in English. For each experiment, the models were initialized with random weights.
The training was conducted on a setup with a single V100 GPU with 64GB of memory and 8 CPU cores, allowing for batch sizes of 240 audio samples. We used the Adam optimizer [20] with an initial learning rate of 0.001, which was decreased by a factor of 0.95 every 10 epochs over 200 epochs. We observed a significant reduction in training loss early on, leading to stabilized training progress. The average training duration was 1.8 h per epoch for single spectrogram models and 2.21 h for dual-bandwidth spectrogram models. This sums to 360 h for single spectrogram models and 442 h for dual-bandwidth models.
We performed three experiments with the models detailed in [5] 1. We have explored broadband mel-spectrograms with an 80-sample window 2. Narrowband mel-spectrograms with a 480-sample window 3. Dual-bandwidth mel-spectrograms. The core of our study is the exploration of a dual-bandwidth mel-spectrogram approach that combines both 80 and 480 sample windows, highlighting our innovative proposal.
In the context of metric learning, accuracy is defined by calculating class centroids (prototypes) from the support set and classifying feature vectors inferred by the model from the query set based on the most similar centroid for each batch. At the end of the training, the model based on broadband spectrograms achieved an accuracy of 81.94%, the narrowband-based model obtained 85.19% accuracy, and the model using both spectrograms resulted in an accuracy of 88.22%.
5 Results and Discussion
This section presents the main findings of our study, including the evaluation protocol, performance metrics, and a comparison of the proposed models with the baseline. We also discuss the significance of the results and their implications for SV.
5.1 Evaluation Protocol
The trained models were evaluated using the test set and following the protocol proposed in [5], which utilizes a list of test cases developed by the authors of Voxceleb1 [21]. Each test case specifies two audio files and a label indicating if they belong to the same or different speakers. The test files are from the Voxceleb1 test set, comprising 40 classes and 4,874 files. The list encompasses 8 tests for each file, with an equal division between intra-class and inter-class comparisons.
For each audio file in the list of test cases, 10 two-second segments were extracted, evenly distributed (overlap may occur). A feature vector was extracted from each segment using the trained model, resulting in 10 vectors per audio, which were then normalized. The similarity between two audios was calculated using the average Euclidean distance among all 100 possible vector pairs.
The decision threshold L, set between 0 and 2, classifies audios as same or different classes based on similarity values being less than/greater than L.
5.2 Performance Metrics
We evaluated the models using two main performance metrics: Equal Error Rate (EER) and Minimum Detection Cost Function (MinDCF) [22].
The system is calibrated to equate the costs of both errors, setting the prior probability of the target speaker at 5% for defining MinDCF.
Figure 2 presents the probability distributions of similarity scores obtained for the audio test cases using the model trained with dual-bandwidth spectrograms. It demonstrates the overlap between same-speaker and different-speaker cases.
Distribution of similarity scores obtained for the test cases from the list provided in [21] using the model trained with dual-bandwidth spectrograms.
Figure 3 displays the variation of True Positive Rate (TPR), True Negative Rate (TNR), and Detection Cost Function (DCF) with respect to the decision threshold for the model trained with dual-bandwidth spectrograms, as observed in our experiments.
The EER and MinDCF values monitored during the training of each model are shown in Fig. 4.
5.3 Results and Comparison
Table 1 presents the EER and MinDCF values obtained for the trained models, along with the results from the baseline [5]. Among the single spectrogram models, the narrowband mel-spectrogram model (EER 2.01% ±0.14%) significantly outperformed the broadband mel-spectrogram model (EER 2.65% ±0.16%). The minimal overlap in their confidence intervals suggests that this difference is statistically significant. This indicates that the narrowband representation likely carries more relevant features for SV, aligning with the baseline model [5] which used a 25 ms window (EER 2.22%). The superior performance of narrowband spectrograms suggests that fine-grained frequency resolution may be more critical for speaker verification than high temporal resolution, as it better captures certain speaker-specific characteristics in speech signals. The Dual-Bandwidth Mel-Spectrogram model achieved the lowest EER of 1.64% ±0.13%, representing a 26% relative improvement compared to the baseline. This further improvement demonstrates that while narrowband information is more relevant, the temporal resolution provided by broadband spectrograms contributes valuable complementary information. The statistically significant differences between these models, evidenced by the minimal overlap in confidence intervals, validate the potential of multi-resolution approaches in capturing the complex nature of speaker-specific information in speech signals. These confidence intervals were calculated considering the EER as a random variable following a Bernoulli distribution, approximated by a normal distribution given our large test set (37,720 instances). This statistical approach allows us to confidently state that the observed improvements, particularly with the Dual-Bandwidth model, are statistically significant and unlikely due to random chance.
5.4 Limitations and Future Work
While our study demonstrates the effectiveness of dual-bandwidth spectrograms in enhancing SV performance, there are areas that could benefit from further exploration to optimize and expand upon the current findings. Our investigation primarily focused on two specific bandwidth configurations-5ms and 30ms windows-which were selected based on their promising results. However, exploring a wider range of window sizes could reveal more effective combinations for various SV tasks. Additionally, integrating our dual-bandwidth approach with more recent SV architectures might yield further performance enhancements, as the current comparisons were primarily against a single baseline model from 2020.
Moreover, the potential of multi-bandwidth spectrograms, extending beyond two bandwidths, offers a promising direction for future research. This could involve investigating spectrograms with three or more bandwidths to assess potential performance gains and identify points of diminishing returns. Alternative techniques for combining spectrograms may also present opportunities for more optimal representations of multi-bandwidth information.
Finally, while our experiments were conducted on the widely used Voxceleb2 dataset, extending the evaluation to include more diverse datasets and real-world scenarios could help to establishing the robustness and broader applicability of the approach. By addressing these considerations, future research can continue to advance the understanding and application of multi-bandwidth spectrogram approaches in speaker verification.
6 Conclusion
This study investigated the impact of using dual-bandwidth spectrograms on the performance of Speaker Verification (SV) models. Our research validated the hypothesis that combining narrowband and broadband spectrograms provides complementary information, leading to improved SV accuracy.
The proposed dual-bandwidth spectrograms model achieved an Equal Error Rate (EER) of 1.64% ±0.13%, outperforming the reference model from [5] which had an EER of 2.22% ±0.05%. This represents a 26% relative improvement in performance. The statistical analysis, based on 95% confidence intervals, supports the robustness of these findings, indicating that the performance differences observed are unlikely to be due to random variation.
Our analysis revealed that narrowband spectrograms (EER 2.01% ±0.14%) carry more relevant features for SV compared to broadband spectrograms (EER 2.65% ±0.16%). This suggests that when using conventional spectrograms, opting for analysis windows around 30 ms, which produce narrowband spectrograms, is preferable. However, the superior performance of the dual-bandwidth approach demonstrates that while narrowband information is more relevant, the temporal resolution provided by broadband spectrograms contributes valuable complementary information.
The limitations of our study, including the exploration of only two bandwidth configurations and the use of a single dataset, have been acknowledged. These limitations, along with the promising results obtained, pave the way for future research. These include exploring additional bandwidth configurations, integrating our approach with more recent state-of-the-art models, and investigating the potential of multi-bandwidth spectrograms beyond dual-bandwidth.
In summary, this study demonstrates the effectiveness of dual-bandwidth spectrograms in enhancing SV performance. The statistically significant improvements achieved by our approach highlight its potential to advance the state-of-the-art in SV systems.
Notes
- 1.
Here, a “batch” refers to a subset of data for one training iteration, sometimes called a “mini-batch” in other studies.
References
Reynolds, D.A.: An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. IV–4072. IEEE (2002)
Jain, A.K., Flynn, P., Ross, A.A.: Handbook of Biometrics. Springer (2007)
Spazzapan, E.A., Cardoso, V.M., Fabron, E.M.G., Berti, L.C., Brasolotto, A.G., Marino, V.C.D.C.: Acoustic characteristics of healthy voices of adults: from young to middle age. In: CoDAS, vol. 30. SciELO Brasil (2018)
Geoffrey, S.M., Ewald, E., Ramos, D., González-Rodríguez, J., Lozano-Díez, A.: Statistical models in forensic voice comparison. In: Handbook of Forensic Statistics, pp. 451–497. Chapman and Hall/CRC (2020)
Chung, J.S., et al.: In defence of metric learning for speaker recognition. In: Interspeech (2020)
Yamamoto, H., Lee, K.A., Okabe, K., Koshinaka, T.: Speaker augmentation and bandwidth extension for deep speaker embedding. In: Interspeech, pp. 406–410 (2019)
Cai, W., Chen, J., Li, M.: Exploring the encoding layer and loss function in end-to-end speaker and language recognition system (2018)
Kye, S.M., Kwon, Y., Chung, J.S.: Cross attentive pooling for speaker verification [c]\(\parallel \) 2021 IEEE Spoken Language Technology workshop (SLT). 19–22 January 2021, Shenzhen, China (2021)
Mun, S.H., Jung, J.-W., Han, M.H., Kim, N.S.: Frequency and multi-scale selective kernel attention for speaker verification. In: IEEE Spoken Language Technology Workshop (SLT). IEEE 2023, pp. 548–554 (2022)
Arias-Vergara, T., Klumpp, P., Vasquez-Correa, J.C., Nöth, E., Orozco-Arroyave, J.R., Schuster, M.: Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal. Appl. 24, 423–431 (2021)
Cheung, S., Lim, J.: Combined multiresolution (wide-band/narrow-band) spectrogram. IEEE Trans. Signal Process. 40(4), 975–977 (1992)
Annabi-Elkadri, N., Hamouda, A.: Automatic silence/sonorant/non-sonorant detection based on multi-resolution spectral analysis and ANOVA method. In: International Workshop on Future Communication and Networking. Szczecin, Poland (2011)
Jahangir, R., et al.: Text-independent speaker identification through feature fusion and deep neural network. IEEE Access 8, 32 187-32 202 (2020)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
Boersma, P., Weenink, D.: PRAAT: doing phonetics by computer (version 6.1.48) (2021). http://www.praat.org
Styler, W.: Using PRAAT for linguistic research. University of Colorado at Boulder Phonetics Lab (2013)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015)
Smith, J.O.: Spectral audio signal processing. W3K (2011)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech 2018 (2018). http://dx.doi.org/10.21437/Interspeech.2018-1929
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2014)
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. Interspeech 2017 (2017). http://dx.doi.org/10.21437/Interspeech.2017-950
Omid.sadjadi@nist.gov. Nist 2018 speaker recognition evaluation (2018). https://www.nist.gov/itl/iad/mig/nist-2018-speaker-recognition-evaluation
Agresti, A., Coull, B.A.: Approximate is better than “exact’’ for interval estimation of binomial proportions. Am. Stat. 52(2), 119–126 (1998). https://doi.org/10.1080/00031305.1998.10480550
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Virgilli, R., Candido Junior, A., da Rosa, A.S., Oliveira, F.S., Soares, A.d.S. (2025). Dual-Bandwidth Spectrogram Analysis for Speaker Verification. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-79029-4_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79028-7
Online ISBN: 978-3-031-79029-4
eBook Packages: Computer ScienceComputer Science (R0)



