1 Introduction

Great effort has been directed towards regulation of Artificial Intelligence internationally, in order to promote safe and trustworthy AI applications. In march 2024, the EU Artificial Intelligence Act was approved by the European Parliament. In Brazil, Bill Project 2338/2023, inspired on the European AI Act, is currently under consideration and should be voted before the end of the year. The High-Level Expert Group On Artificial Intelligence, set up by the European Commission, also makes available an Assessment List for Trustworthy Artificial Intelligence (ALTAI) [2]. This list contains a series of questions to guide developers in order to ensure their AI systems adhere to the seven required ethic principles as proposed by the EU: human agency and oversight, technical robustness and safety, privacy and data governance, transparency, diversity, non-discrimination and fairness, societal and environmental well-being, and accountability. However, these are general guidelines that can be applied to all kinds of AI systems, and it could be challenging to effectively know how to apply such orientations when developing and testing a specific tool.

This article, guided on the referred assessment list, intends to elaborate mainly on the ethical principle of diversity inside the scope of speech processing. The idea behind it is to bring to light some important factors that one should consider when choosing data for developing and testing speech applications. To do so, along with the discussion, we present a use case of applying such considerations when creating a corpus proposed for the task of prosodic segmentation of spontaneous speech in Brazilian Portuguese (BP). As Brazil comprehends great diversity of linguistic variants, the principle of diversity, which emphasizes the relevance of developing applications that work for every individual, despite of their differences, can be widely explored. The contributions brought by this study are: (i) a discussion of the application of the diversity principle in the context of corpora for speech applications, considering some relevant aspects and the process we formulated to select a diverse sample of speakers to compose our corpus; (ii) a literature review of the current scenario of available corpora for the task of prosodic segmentation of spontaneous speech in Brazilian Portuguese (BP), focused on the diversity of the data; (iii) MuPe-Diversidades, a publicly available spontaneous speech corpus in BP, with automatic prosodic segmentation annotation. It contains 2 h 32 min 15 s and is more diverse (balanced in gender, comprehending accents of 17 Brazilian states, various education levels and ages) than those currently available with prosodic segmentation annotation. We present the discussion of some AI ethic principles and guidelines when applied to this scenario at Sect. 2, the task of prosodic segmentation at Sect. 2.1, the current scenario of corpora available today for prosodic segmentation of spontaneous speech in BP at Sect. 3, the corpus we generated during this study at Sect. 4, the process we formulated to make it as diverse as possible at Sect. 4.1 and its statistics at Sect. 4.2.

2 Background

One of the principles covered at ALTAI [2] is named “Diversity, non-discrimination and fairness” and alludes to the risk that AI systems could exacerbate prejudice unintentionally, by making use of incomplete or biased data. Additionally, it sustains that the systems should be accessible to all people, regardless of their age, gender, abilities or characteristics. One of the proposed questions at the assessment list is “Did you consider diversity and representativeness of end-users and/or subjects in the data?”. In the context of speech processing, in order to guarantee that inclusion, several perspectives must be accounted for. For instance, it is important to consider the diversity of segments (consonants and vowels) and suprasegmentals, such as tone, speech rate, etc. (which can be prosodic in nature). The diversity in how these speech units can be expressed by different people justifies careful consideration when dealing with speech processing applications. Some of those units and variations are explained below.

Speech contains specific features related to the production of vowels and consonants. Vocalic segments, for example, can be classified based on their height (open, closed) and may have secondary properties, such as undergoing nasalization [11]. The proposal of [28] presents seven oral vowels: high front [i], high back [u], mid-high front [e], mid-high back [o], mid-low front [ɛ], mid-low back [ɔ], and central [a]. Pretonic vowels, that is, those occurring before the stressed syllable, represented by “e” and “o”, create a distinction between speech patterns in the North/Northeast of Brazil, which typically exhibit an open production, while speech in the Central/West/Southeast/South regions tend to have a closed realization [15].

The consonantal system of Brazilian Portuguese can be identified by place and manner of articulation [13]. One of the secondary articulations of consonants that stands out in terms of linguistic variety is the process of palatalization. In this phenomenon, the tongue changes direction towards the hard palate, resulting in the sound of [s], for example, being produced as what we know as a wheezing in some regions of the country (fe[s]ta x fe[ʃ]ta; me[z]mo x ) [6]. This change occurs in a coda position, which is a final syllable constituent attached to a CV (consonant-vowel) unit [3]. Apart from [s], Brazilian Portuguese exhibits a wide variety in the production of [r] in coda, with occurrences of simple tap , vibrant [r], retroflex [ɻ], velar fricatives [x] and glottal [h] [16].

Additionally, regardless of language, speakers commonly utter pauses at grammatical junctures. The regularity and duration of pauses vary according to speech rate (how fast one is talking), text genre, emotional state, stylistic intention, age and experience of the speaker [31]. Speech rate and fundamental frequency (F0), that is, the most common frequency or pitch that a person uses when speaking, also vary according to the characteristics of the speaker and spoken text, such as text genre, age and experience.

2.1 Prosodic Segmentation

The task of prosodic segmentation consists of relying on prosodic markers to divide an audio of speech into intonational units (IUs) [21]. There is no precise definition of IUs but they necessarily consist of a grouping of words delimited by prosodic cues, which often include a well-defined pitch contour [4]. IUs can also be split considering terminal boundaries (TB) and non-terminal boundaries (NTB), which establish concluded utterances, and breaks in non-concluded utterances, respectively [25]. Computationally, prosodic segmentation has been explored through diverse approaches, which consider specific sets of features that could be of acoustic or syntactic nature. Such features include change in intensity [14, 20], pauses [4, 14, 20], F0 [14, 20], and speech rate [4, 14, 20].

2.2 Speaker’s Profiles

Each individual has a unique speaking style, which can vary in many traits. However, despite individual differences, in Brazilian Portuguese, a study that measured F0 yielded an average fundamental frequency of 105 Hz for men, 213 Hz for women, 290 Hz for children before puberty, and 440 Hz for newborns [24]. Additionally, the study by Reubold, Harrington and Kleber [22] shows a longitudinal analysis of the extent to which age affects F0 and formant frequencies. It presents the analysis of recordings of two speakers over a 50 year period. For the female speaker studied, a decrease of F0 and F1 (first formant in vowels) is observed as she gets older.

Furthermore, there are geographical and social factors that also highly impact speech style. For example, depending on the level of education, speakers may or may not adhere to the standard language norm [26], such as the syncope of proparoxytone words (“abóbora” - “abóbra”) [5] and the consonant substitution (“planta” - “pranta”) [9], which are more frequent in the speech of people with lower levels of education [26].

2.3 Regional Linguistic Variants

There are several characteristics that differentiate accents among different regions in Brazil. Not only each state has a unique accent, but a single city among a state could have its own specific accent as well. The Atlas Linguístico do Brasil (ALiB) [17] presents a set of speech characteristics that occur in the capitals of Brazil. ALiB demonstrates that there is preference for the realization of open pretonic vowels, as in “t[é]l[é]fone” and “c[ó]p[é]rar”, in the North and Northeast regions of Brazil, while in other regions closed pretonic vowels prevail, as in “t[ê]l[ê]fone” and “c[ô]p[ê]rar”. Tonic vowels are usually nasalized throughout the country, when the following segment is also nasal, but there are dialects in which pretonic vowels can be nasalized j[ã]nela or open j[a]nela [11].

Among consonantal segments, the process of palatalization of /S/ in coda position presents differences among regions. While it is alveolar at most regions, it tends to be produced with a wheezing, characteristic of the palatalization, by speakers from Rio de Janeiro, Recife and Northern states [7].

This process can also occur with the consonants /T/ and /D/, assimilating the characteristic of the following high vowel /i/, as in “tia”, which can be pronounced with or without a wheezing [‘tʃia], [12]. In these consonants, three variants can exist, with higher rates of alveopalatal affricates in Rio de Janeiro and lower rates in Recife, as well as other variations of the phenomenon, involving alveolar occlusives [t, d] and alveolar affricates [ts, ds] [1]. A review of this phenomenon can be verified in [29].

In Brazil, there is also a wide variety of /R/ sounds, especially at the end of syllables. ALiB shows that in the capitals, there is more realization of the glottal fricative in the North and Northeast of the country, and a higher index of velar fricative in Rio de Janeiro and Espírito Santo. The Atlas indicates that the tap is prominent in São Paulo, Curitiba, and Porto Alegre, while the retroflex has higher indices in Mato Grosso and Mato Grosso do Sul.

Furthermore, [27] shows that there are two melodic patterns when Brazilians ask a certain type of question, dividing the country into two: the North and Northeast regions follow an ascending pattern, while the Central-West, Southeast, and South regions exhibit a different pattern. There are also descriptions related to the intonation of declarative sentences, characterizing the intonational variety of Brazilian capitals.

3 Literature Review on Prosodic Segmentation

The available corpora for the task of prosodic segmentation of spontaneous speech in Brazilian Portuguese are NURC-SP [23], NURC-Recife [18], C-ORAL BRASIL I and II. While the annotation at NURC-SP and C-ORAL BRASIL I and II follow the linguistic theory presented at Raso, Teixeira and Barbosa’s work [20], which considers terminal and non-terminal intonational units, the annotation at NURC-Recife relied on annotators intuitively segmenting the audios into IUs. It presented a high annotator agreement rate (Fleiss’ kappa > 0.7) and each sample was revised thrice by other annotators [18].

In Raso, Teixeira, and Barbosa’s work [20, 30], the dataset used was composed of a sample of monological audio extracted from C-ORAL-BRASIL I and II, then annotated in terms of prosodic segmentation considering TBs and NTBs. It totals approximately 17 min of spontaneous speech, about 1 min from each of the 14 speakers. There are 4 speakers from Minas Gerais (Sete Lagoas, Rio Pomba, Rio Espera, Belo Horizonte), 2 from the city Rio de Janeiro, 1 from Pará (Belém), 1 from Santa Catarina (Florianópolis) and 2 from São Paulo (Diadema, São Paulo). The remaining 4 are of unknown origin. All speakers are male and there is no information about their age and level of education. The samples comprise three text genres: informal and formal speech in natural context and television speech, and all were classified with high acoustic quality.

NURC-SP comprises solely the linguistic variant of the capital of São Paulo, but contains around 44 h of audios, with revised transcription and manual annotation of prosodic segmentation with TBs and NTBs. NURC-Recife comprises solely the linguistic variant of Recife and contains around 300 h, of which some feature prepared speech. Both NURC-SP and NURC-Recife feature speakers with higher education, with age ranging from 25 to over 56 years old, comprise audios of varying acoustic quality and are equally divided between men and women [8, 10].

None of these corpora contain information about the specific speech phenomena and accent actually perceived in its speakers or information about whether they migrated from their place of birth, and only C-ORAL BRASIL is not publicly available.

Table 1 exhibits characteristics of existing corpora that contain spontaneous speech in BP with prosodic segmentation annotation, including the one we created during this study in an effort to embrace more Brazilian linguistic variants (MuPe-Diversidades), which is described at Sect. 4. The speech variety of 8 states of Brazil and Distrito Federal are not comprehended by any of the corpora. Particularly, the majority of states not comprehended are from the North region of Brazil (Amazonas, Roraima, Acre, Amapá, Tocantins) while two Northeastern states (Maranhão and Rio Grande do Norte) and two Midwestern (Mato Grosso and Distrito Federal) also lack representation.

Table 1. Statistics of existing corpora of spontaneous speech in BP with prosodic segmentation annotation. In total, 17 Brazilian states are comprehended and 9 are lacking. Note: Dur: duration, Bal: balanced, unk: unknown, NE: No education, IES: Incomplete Elementary School, CES: Complete Elementary School, TE: technical education, IB: Incomplete Bachelor’s degree, CB: Complete Bachelor’s degree, M: Master’s degree.

4 MuPe-Diversidades

Project TaRSila is working on several corpora of annotated audioFootnote 1, one of which consists of a collection of 289 anonymized life stories shared by São Paulo’s Museu da PessoaFootnote 2 (MuPe) in a partnership with ICMC-USP and the Federal University of Goiás (UFG) and totals 324.09 valid hours of audio automatically transcribed with WhisperX [19], an ASR model trained on large datasets that provides accurate automatic transcriptions using the large-v2 model of Whisper, and diarized via pyannote-audioFootnote 3, an open-source Python toolkit for speaker diarization. The automatic transcriptions were reviewed by ten students from the course of linguistics at the Federal university of Alagoas (UFAL) and at the Faculty of Philosophy, Languages and Human Sciences of the University of São Paulo (FFLCH-USP). This dataset is named CORAA MuPe and is not publicly available yet, but it will soon be released.

CORAA MuPe includes some information about each speaker, but, in many cases, some of the categories were incomplete. Education level was one of those cases: there are 39 speakers with a bachelor’s degree, and from 1 to 7 speakers that varied among incomplete middle school, complete middle school, incomplete high school, complete high school, incomplete bachelor’s degree, master’s degree, doctorate, technical education. They were all born among 1899 and 1991. For the majority of speakers, there is information about the state and country where they were born. 17 states of Brazil, and a few countries are mentioned. Additionally, around 46.3% (133) of speakers were labeled as women and the remaining 53.7% (154) were labeled as men. Although the dataset is quite balanced in gender, there is great discrepancy in the perspective of place of origin. Figure 1 exhibits the distribution of speakers by state of birth. There are 15 speakers labeled with “-”, who were born in different countries, but live in Brazil and speak Portuguese.

Fig. 1.
figure 1

Distribution of speakers from CORAA MuPe by state of birth.

Aiming at increasing diversity of Brazilian regional linguistic variants and speaker’s profiles in speech corpora, particularly for the task of prosodic segmentation, we have carefully created a selection of samples from CORAA MuPe, named MuPe-Diversidades, and are making it publicly available at githubFootnote 4. Our dataset contains a sample of approximately 10 min of speech from each state, with their respective transcriptions with prosodic segmentation annotation. In its current state, version 0, each sample consists of the audio, its revised transcription with automatic prosodic segmentation annotation considering TBs and NTBs, conducted according to the methodology described at our previous work [10]. However, during the manual review of the prosodic segmentation, we noticed that automatic cuts at every 30 s, which were part of the automatic transcription process carried with WhisperX, jeopardize the quality of the prosodic segmentation process, as some prosodic information is lost at every cut. Thus, a second version, named version 1.0, containing the samples of audios without cuts and their respective transcriptions with manually reviewed prosodic segmentation annotation will be available soon. Furthermore, the anonymization process consists of replacing names with their first letter, such as replacing “Ana” with “[A]” or “Zé Luís” with “[ZL]” in transcriptions, and silencing the respective parts in audio.

4.1 Methodology for Creating a Diverse Sample, MuPe-Diversidades

Regarding the selection, it would be ideal to include a balanced number of people to be representative of each unique combination of features (gender, age, location, education level), but there are so many variables and often limited resources to do so that it may only be feasible to formulate strategies to include as much diversity as possible. An interesting strategy regarding age is to create different age groups. NURC-SP creates three (I:25–35, II:35–55, III:56+) [23], which we use at Table 2 to classify our speakers by age group. In our case, we could only afford to include 2 speakers of each state, so we opted for aiming at including the oldest and youngest speakers of those states, one of each gender, from different cities, and could not consider education as a criteria due to the lack of data. It would be ideal to choose only speakers that did not migrate, but in many cases, it was not possible. 5 min of audio were extracted from each selected speaker, except in cases of states with only one speaker, in which around 10 min were extracted. Additionally, considering that other speech datasets for BP target the linguistic variety of capitals of Brazil, such as NURC-SP and NURC-Recife, here we prioritized cities other than the capitals. For each state, whenever possible, the rules to select a speaker were as follows, in this exact order:

  1. 1.

    Select the speaker with the earliest year of birth, as long as their city of origin was not the capital of their state;

  2. 2.

    Select the speaker with the latest year of birth, as long as they are from other gender, other city and not from the capital;

  3. 3.

    In case the speaker migrated to another state and presented some characteristics of the accent of the state to which they moved, while did not present some characteristics of the accent of their state of origin, according to ALiB, the next youngest/oldest speaker who did not migrate, migrated at an older age, or/and did present such characteristic was preferred (this substitution occurred only for RJ1 regarding the [ʃ]);

The steps above were elaborated considering the context of our corpus, which has a very limited number of speakers for many states and contains fewer people from age groups I and II. It would also be relevant, due to differences in F0, to include teenager’s and children’s speech samples, but CORAA MuPe did not contain any speakers below the age of 18.

4.2 Statistics

The resulting corpus contains 14 (47%) male speakers and 16 (53%) female speakers, ranging from 20 to 91 years old, born in 17 distinct states of Brazil. Table 2 presents information about the duration of each audio sample, the gender, city, state, level of education (information which was collected through hearing the interviews), year of birth of the respective speaker, and whether they migrated or not to another state, at which approximate year, if such information was available. There is one speaker who did not remember her age, as affirmed in her interview. For her case, there is the indication of the year of the interview. For the others, the age was estimated based on the year of their interview. Metadata indicates that at least 67% of speakers migrated from their state of origin, many of them at a young age. That could mean they could have lost some of the characteristics of their original accents.

It is important to remember that one or two speakers are not enough to reliably represent all characteristics of the accent of a region, and that their accent may be influenced by other factors, such as migration. Also, each person presents individual speech characteristics, other than their accent, and having few speakers could make it harder for an ML training to generalize and identify correctly the elements of a given accent. Furthermore, we could only include 17 states of Brazil.

To present greater transparency of which phonetic phenomenon were included in our corpus, we also classified some of the phonetic occurrences explained in Sect. 2. They were observed in the audio samples, with the help of Praat. Table 3 presents such characteristics, including classification of the enunciation of /R/, enunciation of /S/ with or without wheezing, at the end or in the middle of words, open or closed pretonic vowels, and palatalization of /T/ and /D/. For the latter one, we considered only two categories, either palatalized or non palatalized /T/ and /D/, we make no distinction among [ts, ds] and . For letter /R/, we consider tap [ɾ], retroflex[ɻ], vibrant [r] and fricative, making no distinction among glottal [h] and velar [x] fricatives. The audio samples of the speakers were fully heard and the categories attributed to them were the ones that were recognized at least twice, and appear at the table in order of apparent frequency from left to right. The production of pretonic vowels was particularly hard to classify because their realization can differ depending on the word, so speakers that realized an open pretonic vowel at least once in the audio were attributed with “open”.

Table 2. Information about each speaker of MuPe-Diversidades, which is identified by an ID. The table contains information about gender (G) (M:male, F:female), year of birth (YOB), estimated age at the time of the recording, stratified age group (S) (I:25–35, II:35–55, III:56+), city and state of origin, education level (NE: No education, IES: Incomplete Elementary School, CES: Complete Elementary School, TE: technical education, IB: Incomplete Bachelor’s degree, CB: Complete Bachelor’s degree, M: Master’s degree), approximate year and destination of migration to another state if it occurred (Y), or whether it didn’t (No), and the duration of the audio sample included at the corpus (Dur.). Given the nature of this data, which was collected through the interview, some information about migration is lacking, which is marked with “unk” for “unknown”. The corpus totals 2 h 32 min 15 s.
Table 3. Speech phenomena perceived in each speaker from MuPe-Diversidades, identified by their IDs. Here, [s] indicates no wheezing, and [ʃ] indicates wheezing at /S/, “mid” stands for a mid-word occurrence and “end” indicates occurrences at the end of the word. “NP” stands for “non palatal”, “P” stands for “palatal”, “f” stands for “fricative”, “r” stands for “retroflex”, “t” stands for “tap”, “v” stands for “vibrant”. The final rows indicate the frequency of the phenomena throughout the corpus considering how many audios of the corpus contain each pronunciation style.
Fig. 2.
figure 2

Screenshots of productions of specific phonemes in samples extracted from MuPe-Diversidades, analyzed at Praat. (Color figure online)

According to ALiB, open pronunciation of pretonic vowels should prevail in North and Northeastern States, but 10 speakers from those regions realized closed pronunciation, opposed to 6 from the same region who produced open pretonic vowels. No clear relation can be established among migration of such speakers, as the majority of them migrated, regardless of the pronunciation of pretonic vowels. The variation of pronunciation of the /S/ seems more well represented, as most speakers from Pará, Rondônia, Rio de Janeiro and Pernambuco realize the wheezing, as do some speakers from other states, mainly Northeastern. However, it is crucial to remember that ALiB’s conclusions rely on the accent of the capitals, while this corpus features mainly speakers from other cities of the state, which could have distinct accent characteristics.

Concerning the /R/, while, in our corpus, tap occurs in all south states and in São Paulo, it also occurs in many other states, which could be explained by the migration of many speakers to São Paulo. Many speakers across states pronounce a fricative /R/, composing the most common pronunciation of the /R/ in the corpus, occurring in 77% of the audios.

At Fig. 2, we include some screenshots of audio samples of certain segments observed with Praat to exemplify what we considered to be members of each category. They contain the respective oscillograms and spectrograms of the sound. As for the difference between open and closed pretonic vowels, we extracted the values of formants 1 and 2 of two samples to exemplify. The sample of the open pretonic vowel of the word “português” from PE2 showed F1 702.52 Hz and F2 1246.58 Hz at 10.57 s, while the sample of the closed pretonic vowel of the word “coordenador” from RS2 showed F1 467.72 Hz and F2 1067.18 Hz at 19.22 s.

5 Conclusions and Future Work

As discussed in Sects. 2.2 and 2.3, in the context of speech processing, it is important to consider covering different speaker’s profiles, including level of education, age, gender, place of origin and accent, in order to attend the AI ethical principle of diversity, and thus, promote technology that is equally efficient to all kinds of speakers.

Moreover, the transparency principle at ALTAI [2] includes the question “Did you communicate the technical limitations and potential risks of the AI system to users, such as its level of accuracy and/or error rates?”. Thus, it is also sensible to consider testing the applications to different linguistic variants and speaker’s profiles, and analyzing its performance for each case. As it is reasonable to argue that reporting those characteristics about speech corpora is relevant for ensuring transparency and allowing developers to take such limitations into consideration.

To what concerns Mupe-Diversidades, it is far from being complete and representative of all kinds of speaker’s profiles and regional variants of BP. It lacks material from people under 18, which are known to present different ranges of F0, speakers of 9 Brazilian states, speakers with different combinations of age, gender, place of origin and education level, as well as many other BP speakers who may utter phenomena that are not covered at this small sample. It also comprehends only one text genre: interviews about life-stories. However, the corpus also represents the inclusion of 13 regional variants of Brazilian states that were not yet covered and the inclusion of 2 h 32 min 15 s of audio with prosodic segmentation annotation to the collection of corpora available today for prosodic segmentation of spontaneous speech in BP. We hope this work encourages researchers to reflect on these issues, contribute to the creation of additional resources for BP, and prioritize the inclusion of greater diversity in future research.