key: cord-0843928-08gundoh authors: Matzinger, Theresa; Fitch, W. Tecumseh title: Voice modulatory cues to structure across languages and species date: 2021-12-20 journal: Philosophical transactions of the Royal Society of London. Series B, Biological sciences DOI: 10.1098/rstb.2020.0393 sha: ff1057cd3781308c51dc8d883b11504898b02147 doc_id: 843928 cord_uid: 08gundoh Voice modulatory cues such as variations in fundamental frequency, duration and pauses are key factors for structuring vocal signals in human speech and vocal communication in other tetrapods. Voice modulation physiology is highly similar in humans and other tetrapods due to shared ancestry and shared functional pressures for efficient communication. This has led to similarly structured vocalizations across humans and other tetrapods. Nonetheless, in their details, structural characteristics may vary across species and languages. Because data concerning voice modulation in non-human tetrapod vocal production and especially perception are relatively scarce compared to human vocal production and perception, this review focuses on voice modulatory cues used for speech segmentation across human languages, highlighting comparative data where available. Cues that are used similarly across many languages may help indicate which cues may result from physiological or basic cognitive constraints, and which cues may be employed more flexibly and are shaped by cultural evolution. This suggests promising candidates for future investigation of cues to structure in non-human tetrapod vocalizations. This article is part of the theme issue ‘Voice modulation: from origin and mechanism to social impact (Part I)’. Although human speech is often thought to be categorically different from non-human animal vocal communication, many aspects of human acoustic communication are directly comparable with those of other land vertebrates. These include both the vocal apparatus itself and the main voice modulatory cues involved in vocal production. 1 In this review, we will argue that voice modulatory cues are similar in the vocal communication of humans and other tetrapods because of (i) shared ancestry, resulting in a similar voice modulation physiology, and (ii) shared functional bases, i.e. similar pressures for efficient communication, resulting in similar cognitive processing due to domain-general mechanisms shared among species. Voice modulatory cues that are shared and have similar functions in human and non-human tetrapod vocalizations as well as cross-linguistically can be hypothesized to result from anatomical, physiological and cognitive mechanisms that are evolutionarily conserved [4] [5] [6] . These include vocal tract anatomy or respiratory constraints, along with domain-general learning constraints and/or cognitive production and perception constraints (e.g. attention and memory; [4, 7, 8] ). By contrast, cues that are neither paralleled in other tetrapods' vocalizations nor cross-linguistically varied may rely on less evolutionarily conserved mechanisms and therefore have larger potential to be shaped by cultural evolutionary processes. For example, the learnability and transmissibility of vocal features to future generations of signallers may not only be influenced by general mechanisms such as how easily the vocal features can be processed, but also by the social environment [9] [10] [11] [12] [13] [14] . Thus, factors such as group identity, community size or prestige may lead to different conventions of voice modulatory patterns in different communities [15, 16] . In this review, we attempt to begin disentangling which voice modulatory cues are the result of physiological constraints, of domain-general cognitive mechanisms, and of species-or language-specific conventions and learning pressures, aiming to contribute to the understanding of voice modulation in general evolutionary and cognitive terms. Because this is a very large research program, our review will cover only some specific aspects of voice modulation. In the first section, we compare different voice modulatory cues across human speech and tetrapod communication, including pauses, fundamental frequency and syllable/unit duration. We discuss similarities and differences in the physiological mechanisms underlying these cues, and then discuss how the effort of producing and perceiving them may be linked to functional pressures in the environment. In the second part of the review, we take a comparative approach across languages, comparing whether different voice modulatory cues used for speech segmentation are similar between or differ among various human languages. Especially regarding the many voice modulatory cues for which animal data remain scarce, comparisons between different human languages may provide valuable insights as to whether the physiological and cognitive mechanisms behind those cues are species-typical (and therefore may be evolutionarily conserved and domain-general), or more flexible language-specific. Finally, our review will identify research gaps and suggest avenues for further work that may help more clearly reveal the underlying physiological and cognitive mechanisms underlying the realization of different voice modulatory cues. Overall, our comparison between voice modulatory cues in tetrapod vocalizations and across various human languages will show that biological evolution can constrain cultural evolution, and that many of the structures and cues widely used in human speech rely upon basic acoustic and cognitive mechanisms that humans share with other tetrapods. Humans and other tetrapods share many similarities in the physiological mechanisms used to produce vocal signals. Multiple similarities result from shared respiratory mechanisms, which in turn result from shared ancestry during biological evolution [17, 18] . Most tetrapods, including humans, produce vocal signals in a two-stage process: first, a source generates acoustic energy using an airflow from the lungs. This source is the larynx in most tetrapods and the syrinx in birds, and consists of vibrating tissue that creates sound by oscillating at a particular rate termed the fundamental frequency ( f 0 hereafter). This source signal is then filtered in the supralaryngeal vocal tract (upper respiratory tract) via multiple formant frequencies that act as a series of bandpass filters, attenuating or enhancing certain frequency ranges. The actual vocal output fuses these two components (source and filter), which are mostly independent, meaning that f 0 can freely vary independent of formants and vice versa. This process, summarized as the source-filter-theory of vocal production, is shared by humans and most other tetrapods [1, [19] [20] [21] , with the exception of toothed whales [22] and certain whistle vocalization (e.g. in rodents; [23] ). This shared physiological basis of vocal production leads to many similarities in both the production and the acoustic output of humans and other tetrapods. Nonetheless, while constrained by physiological production mechanisms, voice modulatory cues can to a certain extent be flexible, and dynamic modifications of particular acoustic parameters can provide structure to the vocal output. Specific voice modulatory cues and the extent to which they can vary (table 1) are reviewed below. In particular, we focus mainly on three cues that are wellinvestigated with regard to speech segmentation across human languages and will therefore be most relevant for the later sections of this review: pauses, pitch and durational cues. Among the most distinctive voice modulatory cues are pauses in the vocal signal, which often result from the need to breathe via alternating between exhaling and inhaling. Typically, tetrapods vocalize during exhalation, and vocalizations pause during inhalation. However, some non-human tetrapods vocalize during both exhalation and inhalation, and thus do not need to pause during vocalization (e.g. donkey braying, chimpanzee pant hoots or howler monkey howling, during which inhaling vocalizations are shorter than exhaling vocalizations, but similar in terms of structure and amplitude; [24] ). Humans are also capable of ingressive vocalizations such as gasps and chuckles, but these usually do not replace respiratory pauses and are less flexible in encoding meaning than egressive vocalizations [25] [26] [27] . While pauses in tetrapods result from the same physiological mechanism, i.e. respiratory pausing, and are thus constrained by the individuals' lung capacities, they can also vary in their specific realizations. For example, pauses can differ in their duration, number and their position in the vocal stream. Because of this flexibility, tetrapods, including humans, can use pauses to structure the vocal signal in many different ways [28] . For example, birdsong is structured into units commonly termed 'syllables' that are separated by short pauses during which rapid inhalation-'mini-breaths'-occur [29] . The duration of phonation at the source can induce durational and rate variations in the vocal output. These durational variations can extend over different domains of the vocal output, such as individual sounds, individual syllables/units or larger stretches of vocalizations [30, 31] . This can, for example, lead to different rhythmic patterns, to differences in vocalization tempo, or to distinctive vowel sounds in human speech, where phonemic distinctions between long and short durations are frequent. Duration of one syllable can also disambiguate neighbouring phonemes, as exemplified in the American English words ladder (/ae/ longer) and latter (/ae/ shorter), which only differ in their vowel length [32] . Human speech sounds that differ in their vowel quality (determined by formants), such as the vowels in the English words feet and fit, may also have distinctive lengths. Again, physiologically, durational variations are limited by the individuals' breathing capacities, but below that capacious limit, the duration can be varied more or less flexibly to give structure to the vocal output of humans and most non-human tetrapods alike. Vocal signals are further characterized by the vibration rate of the vibrating tissue, which determines the signals' f 0 , often termed pitch in the speech literature [21] . Typically, in tetrapods, f 0 is influenced both by subglottal air pressure and by muscles that regulate the length and tension of the vibrating tissues, i.e. the vocal folds in non-avian tetrapods and the syringeal membranes in birds [33] [34] [35] . By modulating these two factors, a pitch can vary within and between vocal signals. To increase pitch, individuals can either increase the subglottal air pressure or the tension of the vibrating tissues. Both of these options require increased effort (see §3) and can provide diversity and structure to vocal signals. For example, typically, on the level of syllables, an increase in pitch signals emphasis ('stress' in the speech literature), whereas pitch modulation on the phrase level can function as a boundary signal [36] [37] [38] [39] . Again, the effort required for pitch modulation, and physiology such as the dimensions of the vibrating tissues, limit the pitch range that can be realized. However, within that range, the pitch can be employed flexibly to structure the vocal signal differently, as evidenced by different stress patterns observed in different languages [40] . Fundamentally, tetrapods share these voice modulatory cues because of their shared vocal production physiology, which in turn results from their shared ancestry. Nonetheless, the specific uses and manifestations of these cues can vary considerably across species and languages. For example, species, languages and individuals may differ in when and where they make pauses, when and where pitch rises and falls, or which segments they lengthen or shorten. One useful principle for categorizing and understanding this variation in vocal signals is based on the effort it takes to place emphasis in the vocal signal, using various voice modulatory cues. Thus, the following section will address emphasis and effort in the production of vocal signals, how they are influenced by functional pressures and how this can lead to the cultural evolution of prosodic patterns. It seems intuitively obvious that vocal signals can carry emphasis, and that this requires effort. In particular, it takes more effort to produce emphasized or stressed, i.e. louder, longer and higher-pitched syllables than non-emphasized or unstressed ones. However, despite a common assertion that producing certain voice modulatory cues is more 'energetically efficient' than producing others [41] [42] [43] [44] [45] , the exact metabolic costs needed to produce and process these cues have rarely been systematically compared. In fact, several studies have shown that vocalizing is not very costly in terms of oxygen, glucose or ATP needed [46] [47] [48] [49] [50] . Thus, although it is clear that tensing muscles requires energy consumption, the costs involved in contracting the tiny muscles controlling source characteristics like f 0 may not be appreciable relative to an organism's overall energy budget. Respiratory muscles are larger and potentially more energy-consuming, but they need to be constantly working to serve respiratory functions, independent of vocalization. The relative cost of increased versus decreased pitch or duration during normal speech and frequently produced animal vocalizations will represent an even smaller proportion of net energy expenditure. 2 Finally, the cost of neuronal firing involved in producing or perceiving vocalizations is real, but also very difficult to quantify using current methods. Therefore, at present, we have little choice but to adopt an intuitive definition of 'effort', which can manifest in dynamic effort, i.e. muscular effort for moving the articulators, and neural control effort, i.e. cognitive effort for planning, producing and processing voice modulatory cues. The term 'stress' is used in phonology essentially as a catchall term, connoting effort and emphasis, but not grounded in detailed syllable-by-syllable measures of expended effort. How much effort senders will invest in emphasizing vocalizations is largely driven by an interplay of the functional pressure for successful versus efficient communication [42, 51] . These pressures may also influence which parts of the signal are emphasized. Emphasis can either extend over the whole signal (e.g. louder vocalizations in noisy environments) or be specific to certain elements of the signal (e.g. stressing certain phrases or syllables); the latter should be more energetically efficient, so we may expect organisms to vary cues across a vocal stream in many cases, as humans do with speech. One well-studied example where signals are emphasized in their entirety is the so-called Lombard effect: both humans and other tetrapods, including non-human primates, birds and whales tend to vocalize louder and with a higher pitch, i.e. with an increased effort, when there is more background noise [52] [53] [54] [55] [56] . When background noise in the environment is reduced, signallers return to vocalizations that need less royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 376: 20200393 effort and decrease their pitch and intensity. A recent example in birdsong occurred when traffic reductions during the Covid-19 pandemic resulted in lower-frequency bird vocalizations, showing that signallers can flexibly adapt their vocalizations to functional pressures in the environment [57] . Further examples of signals with emphasized elements include rhythmic vocalizations and stress or intonation patterns. This kind of emphasis needs both dynamic and cognitive effort on the side of the sender, but creates structure in the signal, which may reduce error, combat habituation or facilitate meaning encoding and processing on the side of the listener [58] . The complex interplay of pressures acting on the sender and receiver may lead to variation in vocal signals that is not fixed genetically but influenced by current properties of the environment [9, 10] and shows that once individuals begin to produce vocal cues, there is an opportunity to modulate them. Furthermore, in species that learn their vocalizations (e.g. birdsong or human speech), small production or perception biases for or against certain voice modulatory structural patterns in a certain environment may be amplified over generations of speakers [9] . This may lead to a process of cultural evolution and can result in within-species variation in structural patterns of vocalizations as exemplified by different human languages or different dialects in other tetrapods' vocalizations [59, 60] . Thus, overall, how exactly the different voice modulatory cues are used varies within physiological constraints and results from a balancing act between communicating successfully, but with low effort. This in turn depends on functional pressures of listeners and environment, which can vary between different species and languages, and may include factors such as cultural evolution. How exactly different species and different linguistic communities deal with different functional pressures depends both on domain-specific factors such as auditory salience, domain-general cognitive constraints such as memory and attention, but also on more flexible constraints such as social factors. All of these factors will combine to constrain the range within which the different voice modulatory cues can be realized and determine the actual vocal output seen in a language or a species. Different realizations of voice modulatory cues have been heavily investigated in human languages, but similar investigations in non-human tetrapod vocalizations are comparatively scarce and less systematic. Over the past decades, bioacoustics has made considerable advances in the investigation of non-human tetrapod vocal production, but research on the perception of voice modulatory cues in non-human tetrapods is still in its infancy [61, 62] . It is especially difficult to reach firm conclusions about the communicative meaning of voice modulatory structures found in non-human tetrapod vocal signals, given how few cues and species have been systematically investigated. Therefore, the remaining sections of this review will mainly focus on the comparison of voice modulatory cues across human languages, and specifically the voice modulatory cues that help listeners to segment continuous speech into words. When voice modulatory cues are realized similarly across human languages, this suggests that fundamental physiological constraints or basic cognitive mechanisms may be responsible for these patterns [4] [5] [6] , and that therefore, due to their shared ancestry, similar cues may also be prevalent in non-human tetrapod vocalizations. We suggest that such patterns may provide starting points for investigating modulation in tetrapod vocal signals. By contrast, cues that differ across different linguistic communities may be largely influenced by different functional pressures in the environment and by cultural evolutionary processes and therefore are more likely to also differ across tetrapod vocalizations. Comparing voice modulation across human languages and non-human animal vocalizations, and using similarities and differences between them to draw conclusions about the evolutionary roots of vocal communication, is not new [2, [63] [64] [65] [66] . Similar approaches have already been proposed, for example, by Morton [64, 65] , who suggested that high and low pitch vocalizations signal similar emotions and attitudes across languages and species. Across species, a low pitch signals largeness, dominance and self-confidence, whereas a high pitch signals smallness, submissiveness and prosociality. Ohala [67] suggests that this biological grounding helps to explain prosodic patterns that are consistent across human languages, such as a final pitch decrease in declarative statements (i.e. utterances signalling dominance and self-assurance) and final pitch increase in questions (i.e. utterances signalling insecurity, submissiveness and need). Past approaches typically either avoid detailing the specific acoustic cues [66] , or treat these cues as fixed for a particular sound class (e.g. low-pitched growls and highpitched whines). Our goal below is to call attention to how dynamics within a call can play a role in structuring acoustic signals, and to investigate the specific acoustic parameters varied. Furthermore, our approach extends previous proposals by highlighting the importance of listener-associated cognitive factors, such as perceptual salience, memory, attention and learnability of prosodic patterns, for biological and cultural evolution. Finally, our proposal captures a more diverse range of prosodic patterns than previous accounts. In contrast with Ohala [67] , who explained prosodic patterns by primarily drawing on emotional communication, our account attempts to explain a more diverse set of linguistic structures and meanings. One crucial first step in the acquisition of linguistic structure is the segmentation of fluent speech into words, before the words' meaning is known. This so-called speech segmentation problem is most acute for infants learning their first language, but also concerns second language learners. For adults, the challenge is particularly evident when they try to identify distinct words while listening to an unfamiliar foreign language [68] [69] [70] . Nevertheless, language learners eventually master the speech segmentation problem easily. This is because they implicitly use various cues in the speech stream to identify patterns and regularities, which in turn help them to extract words. Such cues may also play a role in complex sequence learning in bird or whale song (e.g. [71] ), but this possibility remains little explored. Speech segmentation is a challenge that speakers of all human languages have to face and that is therefore well royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 376: 20200393 suited for cross-linguistic comparisons. Over the past decades, cues used in human speech segmentation have been the subject of a large body of research in a variety of different languages such as English [72] [73] [74] [75] [76] , German [77] [78] [79] [80] , Italian [78, 79, 81] , French [74] , Dutch [74] , Spanish [79, 82] , Portuguese [83] , Basque [79] , Japanese [73] , Cantonese, Mandarin and Russian [84] . This makes it possible to compare the characteristics of speech segmentation cues across languages, answer questions about more general physiological and cognitive mechanisms that are necessary to create and process linguistic structure and identify functional pressures in the respective environments. Among the cues that have been identified to be very important for speech segmentation and creating linguistic structure are transitional probability cues (statistical learning) and the voice modulatory cues that are our focus (e.g. [68, [73] [74] [75] [85] [86] [87] [88] [89] [90] [91] [92] ). Transitional probability cues are based on listeners tracking the co-occurrence frequencies of syllables in vocal input ( [75, 93] ; see [94] for a meta-analysis). For example, when hearing the sound sequence pretty#baby, listeners can infer that pretty and baby are distinct words because the syllables pre and ty as well as ba and by also co-occur in other sequences such as pretty#girl or lovely#baby. By contrast, ty and ba co-occur less frequently and can therefore be assumed to span a word boundary [95] . Speakers of a wide variety of languages have been demonstrated to use such transitional probability cues for language acquisition in similar ways (English: e.g. [72] [73] [74] [75] [76] ; German: [77] [78] [79] ; Italian: [78, 79, 81] ; French: [74] ; Dutch: [74] ; Spanish: [79, 82] ; Portuguese: [83] ; Basque: [79] ; Japanese: [73] ). Notably, producing different speech sounds and syllable identities is itself a form of voice modulation and is a prerequisite for syllable creation and thus for tracking transitional probabilities. Specifically, individual vowels and consonants are created by moving the articulators, which leads to different formant frequency patterns (see table 1 ; [96] ). While different languages have different speech sounds [40, 97] , the cross-linguistic ability to modulate the voice in a way that produces different speech sounds is crucial for the cross-linguistic use of transitional probabilities for speech segmentation. Using transitional probabilities to infer characteristics of a signal appears to be a very general behaviour since in basically any domain of action, including animal vocalizations, certain events are more likely to follow each other than others [98, 99] . In humans, the identification of transitional probability cues appears to be based on a domain-general cognitive mechanism, namely statistical learning [100] [101] [102] [103] . Furthermore, statistical learning is not a uniquely human cognitive mechanism, and also other species have been demonstrated to use it to deduce signal structure [104] . These can even apply across species; for example, many non-human animals form associations between heterospecific alarm calls and the presence of a predator [105, 106] . Also, vocal learning in nonhuman animals, most notably in birds, is suggested to be supported by statistical computations, although the precise mechanisms behind it are not yet fully understood [104] . It thus seems likely that both humans and many non-human tetrapods rely on a combination of statistical learning and acoustic modulations when learning the structure of their species-specific sound sequences. Statistical learning is a very general and prominent perceptual and cognitive skill. However, in human languages, voice modulatory cues in the speech stream, such as pauses, or variations in fundamental frequency, syllable duration or intensity (which create word stress, speech rhythm or intonation), can be processed more easily than statistical cues and therefore have more significant effects on speech segmentation [68, 76, 80, 81, 91] . However, since voice modulatory cues come in many different realizations and can have many different functions [107] , their overall role in signalling linguistic structure, and the cognitive mechanisms needed for processing them, are less understood. While some voice modulatory cues are realized and processed similarly across languages (e.g. [73] ), others are subject to cross-linguistic variation (e.g. [74, 79] ). This raises the question how much the realization and processing of voice modulatory cues are determined by domain-general cognitive or physiological constraints, and how much these cues may be shaped by cultural evolution. 6. Cues to speech perception: when voice modulatory cues count more than transitional probability cues The efficiency of different voice modulatory cues for speech segmentation has traditionally been tested in artificial language learning experiments [75] . In these experiments, participants are exposed to several minutes of a continuous stream of nonsense speech, consisting of randomly concatenated invented pseudo-words. Listeners can infer from the transition probabilities between syllables which syllable combinations are 'words' of the artificial language and can segment these items from the stream. To test the influence of voice modulatory cues on listeners' segmentation performance, voice modulatory cues are added at different positions to the speech stream and it is measured how this changes listeners' perception of words in the stream. In such artificial language learning experiments, voice modulatory cues added to continuous speech on the word (e.g. [73, 74, 80] ) and phrase level (e.g. [108] [109] [110] [111] ) typically enhance speech segmentation compared to transitional probability cues only. Crucially, these cues facilitate speech segmentation most effectively when they converge with the transitional probability cues in the speech stream, i.e. when the voice modulatory cues sound as 'natural' to the listeners as they do in natural speech. By contrast, when voice modulatory cues are designed to conflict with the transitional probability cues in experimental settings and sound 'unnatural' to the listeners, voice modulatory cues disrupt speech segmentation or even override the transitional probability cues [68, 76, 80, 81, 91] . Whether voice modulatory cues at certain positions in the speech stream sound natural or unnatural with respect to the transitional probability cues depends both on language-universal cognitive predispositions such as attention, perception or preferences in pattern recognition, and on language-specific word stress patterns typical of the listeners' native languages [73, 74, 81] . Crucially, many artificial language learning studies tested the influence of language-specific word stress on speech segmentation by using a combination of different voice modulatory cues [74, 78, 81] . For example, stress cues dominated transitional probability cues when they were implemented as a combination of longer-duration, higher-pitch and higher-intensity of stressed syllables [68, 76, 91] . While using a combination of different voice modulatory cues closely royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 376: 20200393 simulates natural languages [70, 91, 92] , it does not tell anything about the effects of the individual voice modulatory cues in isolation. However, since different voice modulatory cues have different physiological origins and may be cognitively processed and culturally transmitted differently, investigating them separately can reveal more about the functional pressures acting on linguistic structure [81, 88] . Several studies have already addressed the role of voice modulatory cues in isolation. These studies suggest that pauses and lengthening serve as language-universal signals for word-finality (e.g. [73, 74, 78, 79, 85, 88, 112] ; but also: [81, 113] ). By contrast, pitch increase is suggested to be the main perceptual correlate of word stress and is therefore processed differently by speakers of different languages [68, 74, 78, 114] . Speech segmentation studies investigating other prosodic cues such as intensity or voice quality are comparatively rare [88, 115] , which is why our review below focuses on pauses, durational and pitch modifications. Pause cues typically result from the physiological necessity to breathe, but pauses could in principle be expressed at different positions in a vocal signal, or differ in number and duration. Still, in practice, pauses are realized in strikingly similar ways across human languages. Language-universally, pauses are realized at the end of sentences or phrases but hardly ever occur within phrases or within words [28, 116] . This is further supported by second language learning studies finding that second language learners have hardly any problems acquiring pause characteristics typical of their second language [117, 118] . Thus, while in principle, pauses could occur anywhere within the breathing range, it is most probable that domain-general cognitive processing mechanisms constrain them to occur at specific positions in the vocal output-namely at those positions where they structure the vocal output most efficiently and with the least processing effort. This and their perceptual salience may explain why pauses are very effective for speech segmentation and outrank other cues in speech segmentation experiments [80] . In animal vocal signals, it is challenging to determine whether pauses occur between or within phrases because units and phrases in animal vocalizations are less clearly defined [119] . Still, because of their shared ancestry with humans, it can be expected that pauses manifest similarly in non-human tetrapods' vocalizations, i.e. at the end of phrases or units. This is why pauses are often used by researchers to determine units in non-human tetrapod vocalizations [120] . One reason why final lengthening may serve as a languageindependent speech segmentation cue is that-languageuniversally-sentence-final or phrase-final elements are lengthened in everyday speech production [28, 74, [121] [122] [123] [124] . The evolutionary origins of final lengthening are that at sentence or phrase boundaries, speakers need to switch from exhaling to inhaling, leading to a pause, and that it takes less effort to slow articulators down before a pause than to stop them abruptly [125] [126] [127] [128] [129] . Similar patterns can also be observed in movements in other domains than vocalization. For example, runners also decelerate their movements before stopping [130] . This mechanistic factor seems like a good candidate for a factor that could play a role across languages and in other species' vocal communication systems: a potential universal in vocal communication. Because kinematic articulatory constraints result in lengthened syllables before sentence or phrase boundaries, listeners may have learned to associate lengthening with boundaries and to exploit it as a cue for speech segmentation [131] . In turn, speakers may have started to intentionally use lengthening to indicate boundaries in the speech stream, also at positions where they did not pause [132] . Via cultural transmission, this may have resulted in final lengthening becoming a conventionalized but still language-universal boundary signal [133] . Because final lengthening is used as a convention for indicating boundaries cross-linguistically, it can be assumed that besides the articulatory constraints that speakers of all languages face equally, its transmission and processing is based on domain-general cognitive constraints. This notion is supported by the putatively languageindependent Iambic/Trochaic Law (=ITL; [134] [135] [136] [137] [138] ), which states that cross-linguistically, listeners group sounds with longer duration as sequence-final (iambic grouping). Although the ITL focuses on disyllabic words, it can also be generalized to trisyllabic words, suggesting that domain-general cognitive mechanisms may be responsible for this flexibility [73, 80] . Still, recently, there has also been evidence that the perceptual groupings of sequences of syllables with variable duration may be shaped more by cultural variation than previously assumed [81, [139] [140] [141] . Interestingly, the ITL not only applies to linguistic stimuli, but also to tone sequences [115, 137] or visual patterns [142] . This further supports the idea that final lengthening as a signal to linguistic structure and thus to low-effort communication results from general cognitive processing mechanisms that also apply to non-linguistic stimuli. Since deceleration before pauses occurs across various human movements [130] and final lengthening is perceived as a boundary signal across different sensory domains, the mechanisms behind it seem likely to be evolutionarily old. Because of their shared ancestry with humans, a similar vocal tract physiology and similar energetic constraints, final lengthening and its perception as a boundary signal are promising targets for investigation in non-human tetrapods, and there is already some evidence for final lengthening in birdsong [143, 144] . Such a cue could play an important role, for example, in structuring turn-taking exchanges between individuals [145, 146] . However, to our knowledge, there is no current evidence that non-human tetrapods use final lengthening as a boundary cue at a perceptual level, and when listening to human speech, rats do not appear to group syllables varying in duration according to the ITL [138] . Research with other tetrapods is badly needed to further examine this potential universal. In multiple speech segmentation experiments, similar pitch modifications led to different segmentation patterns in speakers of different native languages [74, 78] . For example, wordinitial pitch increase facilitated speech segmentation for native speakers of English, whereas word-final pitch increase royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 376: 20200393 facilitated speech segmentation for native speakers of French. These patterns are consistent with the typical stress placements of these languages [74, 147] . One explanation why duration and pitch are used differently for speech segmentation is that, potentially, pitch is used as a more reliable cue for the perception of word stress than duration. In speech production, stressed syllables are characterized by a co-occurrence of higher pitch and longer duration, and interestingly, cross-linguistically, duration seems to be a more consistent marker of word stress than pitch ( [81, 148] ; but also: [74] for French and English). Still, while being an important acoustic correlate of word stress, lengthening at the same time occurs at boundaries (as discussed in the previous section) and most likely, this durational increase is larger and more consistently applied than that at stressed syllables [125] . As a result, during perception, to avoid ambiguities, listeners may rely on lengthening for perceiving boundaries, but rather focus on the pitch for perceiving word stress [74, 80] . In general, listeners may need to be more flexible in the perception and cognitive processing of pitch variations compared to durational variations. In natural speech, pitch as a signal for word stress varies more than duration as a signal for sentence or phrase finality, for example, because of loan words with non-typical stress patterns [149] [150] [151] . In addition, intonation patterns are variable and depend for example on speaker emotions, attitudes, grammatical structure and focus [152] . Also, while sentence-final pitch decrease in declarative sentences is common across languages [110, 123, 153] , listeners may equally encounter sentence-final pitch increase in yes-no questions. Therefore, overall, the pitch may be a less consistent [41, [154] [155] [156] and less informative cue during speech segmentation than lengthening. This may explain why neither word-final pitch decrease [80] nor increase facilitated speech segmentation [74, 78, 157] in artificial language learning experiments, unless for speakers of languages with word-final stress [74, 147] . According to the ITL [136, 138, [158] [159] [160] , listeners perceive sounds with a higher pitch in sequence-initial positions (trochaic grouping). Interestingly, rats similarly group sequences that vary in pitch as trochees [138] . However, apparently, this perceptual grouping does not play a big role for speech segmentation, since cross-linguistically, a word-initial higher pitch has facilitated speech segmentation in artificial language learning experiments only inconsistently [74, 80, 81, 157] . It can therefore be inferred that the ITL for pitch does not systematically generalize from disyllabic to trisyllabic words, but pitch is instead processed more flexibly. The apparently rather flexible processing of pitch may result in weak production, perception or learning biases amplifying pitch cues in different directions during the cultural transmission of languages. This may in turn lead to different stress patterns in different languages, making pitch a less reliable signal for speech segmentation than duration. While still originating from basic cognitive processing mechanisms, the cognitive and physiological structures responsible for pitch processing are therefore suggested to be less conserved than those responsible for duration processing. This may have constrained the cultural evolution of pitch cues to linguistic structure less than that of durational cues. Thus, functional pressures for structured signals may hold equally across languages, but how exactly this linguistic structure is archieved, can vary cross-linguistically. While lexical stress patterns vary across languages and it can be assumed that similar variation should be expected in other tetrapod vocalizations, utterance-final pitch decrease in declarative statements is common across many languages [39, 110, 123, 153] . One reason for this declination may be that the articulators, in this case the vibrating tissues, are slowed down before being brought to a halt, and this lower vibration rate of the tissues leads to a lower pitch [161] . A functional reason may be that pitch declination facilitates turn-taking and thus decreases communicative effort. 3 These physiological and functional constraints are shared across species, which is why pitch declination may be an interesting target for investigation in non-human tetrapod vocal signals. Indeed, there are some indications for final pitch declination and turn-taking in vervet monkeys and rhesus macaques [38] . Investigating other species for final pitch declination could further corroborate the hypothesis that a shared ancestry drives similarities in pitch realization and processing in humans and non-human tetrapods. Summarizing, our review of human speech modulation shows that f 0 , duration and pauses are typically used in systematic ways across languages to help structure the speech signal, but that there is nonetheless considerable variation across languages in the details. Voice modulation can, in many cases, provide cues to structure that are more salient and effective to listeners and learners than statistical measures over the vocal units (e.g. sequential transition probabilities), and can work together with such statistical information or in some cases override it. Thus, although such statistical cues are important (and can be readily computed in animal signals like bird or whale song), they obscure the importance of voice modulation as a key factor in structuring animal communication signals. How language-or species-specific and cross-linguistic and cross-species cues interact certainly warrants further research. In those cases where comparative information is available, it suggests that the cues used to indicate a structure in the speech signal are both present in vocalizations of other species (unsurprising given their fundamentally similar production mechanisms) and also can be used in similar ways (e.g. phrase-final lengthening in speech and birdsong). Nonetheless, there is currently far too little comparative data to allow any clear conclusions about the degree to which human-typical cues to structure are also used by other species. More research in this area-what we might term 'animal phonology'-is needed to evaluate whether there are broad phylogenetic generalizations to be made, as we have hypothesized here. A rich comparative analysis of these issues could be expected to shed light not just on the evolution of communication across vertebrates, but also about the phylogenetic origins of universals in human speech production and perception. Data accessibility. This article has no additional data. Authors' contributions. T.M. was involved in conceptualization and writing the original draft; W.T.F. was involved in conceptualization, writing the review and editing, and supervision. Competing interests. We declare we have no competing interests. royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 376: 20200393 Funding. This work was supported by the Austrian Science Fund (FWF) DK Grant Cognition and Communication (grant no. W1262-B29) to W.T.F. 1 The terms 'voice modulation' and 'prosody' essentially describe the same concept, namely all kinds of vocal dynamic modifications of acoustic parameters during production in humans and non-human tetrapods [1] [2] [3] . For the sake of consistency, we will use the term 'voice modulation' throughout this review. 2 Note that respiratory muscles may induce higher energetic costs in very loud, high or long vocalizations such as during human singing and oratory, or mammalian roaring contests or infrasonic long-distance calls. Because subglottal pressure is an important factor determining both f 0 and sound intensity, very loud and high-pitched vocalizations may require more respiratory effort than normal breathing and vocalization. In addition, very long syllables may disrupt the natural respiratory rhythm. 3 However, potential analogies between turn taking in human and non-human animal vocalizations have to be interpreted with caution. Since it is difficult to assess the underlying meaning or the intentions behind non-human animal vocal signals, alternation of signals may not necessarily be the result of active turn-taking [146] . In such cases, the communicative benefit gained from alternating vocalizations may differ among species. Voice modulation: a window into the origins of human vocal control? Emotional and interactional prosody across animal communication systems: a comparative approach to the emergence of language Voice pitch modulation in human mate choice The evolution of language Assessing the uniqueness of language: animal grammatical abilities take center stage The language faculty that wasn't: a usage-based account of natural language recursion The myth of language universals: language diversity and its importance for cognitive science Monkey vocal tracts are speech-ready Learning bias, cultural evolution of language, and the biological evolution of the language faculty Cultural evolution: implications for understanding the human language faculty and its evolution Introduction. Cultural transmission and the evolution of human behaviour Vocal learning in the functionally referential food grunts of chimpanzees Cultural evolution in animals Three decades of cultural evolution in Savannah sparrow songs Language learning, language use and the evolution of linguistic variation What makes a language easy to learn? A preregistered study on how systematic structure and community size affect language learnability Evolution of lung breathing from a lungless primitive vertebrate Reconstructing the evolution of the respiratory apparatus in tetrapods Principles of voice production The contribution of sourcefilter theory to mammal vocal communication research The evolution of speech: a comparative review Anatomy of underwater sound production with a focus on ultrasonic vocalization in toothed whales including dolphins and porpoises Laryngeal airway reconstruction indicates that rodent ultrasonic vocalizations are produced by an edgetone mechanism Production of loud and quiet calls in Howler Monkeys Pulmonic ingressive phonation: diachronic and synchronic characteristics, distribution and function in animal and human sound production and in human speech Pulmonic ingressive speech: a neglected universal? Pulmonic ingressive speech The prosody of speech: timing and rhythm Airflow and pressure during canary song: direct evidence for minibreaths Rhythm in speech and animal vocalizations: a cross-species perspective Between-speaker and within-speaker variation in speech tempo of American English On vowel duration in English Subglottal pressure, tracheal airflow, and intrinsic laryngeal muscle activity during rat ultrasound vocalization Subglottal pressure and fundamental frequency control in contact calls of juvenile Alligator mississippiensis Functional morphology of the sound-generating labia in the syrinx of two songbird species In search of the acoustic correlates of stress: fundamental frequency Fundamental frequency declination is not unique to human speech: evidence from nonhuman primates The perception of fundamental frequency declination 2013 The world atlas of language structures online Frequency of use and the organization of language Language learners restructure their input to facilitate efficient communication How efficiency shapes human language Evolutionary phonology -The emergence of sound patterns Human behavior and the principle of least effort Two experiments on oxygen consumption during speech production: vocal effort and speaking tempo Oxygen consumption during crowing by roosters: talk is cheap The metabolic cost of birdsong production No cost of echolocation for bats in flight Deep-diving pilot whales make cheap, but powerful, echolocation clicks with 50 µl of air Logic and conversation Chapter 1 Environmental acoustics and the evolution of bird song The evolution of the Lombard effect: 100 years of psychoacoustic research Noise-induced vocal modulation in cotton-top tamarins (Saguinus oedipus) Bird song and anthropogenic noise: vocal constraints may explain why birds sing higher-frequency songs in cities Control of vocal intensity in budgerigars (Melopsittacus undulatus): differential reinforcement of vocal intensity and the Lombard effect Singing in a silent spring: birds respond to a half-century soundscape reversion during the COVID-19 shutdown On language 'utility': processing complexity and communicative efficiency Dynamic horizontal cultural transmission of humpback whale song at the ocean basin scale The question of animal culture Principles of animal communication Temporal modulation in speech, music, and animal vocal communication: evidence of conserved function On the occurrence and significance of motivation-structural rules in some bird and mammal sounds Grading, discretenesss, redundancy, and motivational-structural rules Humans recognize emotional arousal in vocalizations across all classes of terrestrial vertebrates: evidence for acoustic universals Cross-language use of pitch: an ethological view Word segmentation by 8-month-olds: when speech cues count more than statistics Word segmentation with universal prosodic cues Statistical learning of language: theory, validity, and predictions of a statistical learning account of language acquisition Statistical learning in songbirds: from self-tutoring to song culture Phonotactic and prosodic effects on word segmentation in infants Domaingeneral mechanisms for speech segmentation: the role of duration information in language learning Cross-language differences in cue use for speech segmentation Statistical learning by 8-month-old infants When cues collide: use of stress and statistical cues to word boundaries by 7-to 9-month-old infants Word segmentation in German-learning infants and German-speaking adults: prosodic and statistical cues Native language influence in the segmentation of a novel language Cross-linguistic differences in the use of durational cues for the segmentation of a novel language The influence of different prosodic cues on word segmentation Transition probabilities and different levels of prominence in segmentation Speech segmentation is facilitated by visual cues The relative weight of statistical and prosodic cues in speech segmentation: a matter of language-(in)dependency and of signal quality Statistical speech segmentation in tone languages: the role of lexical tones Word segmentation: the role of distributional cues Statistical learning of tone sequences by human infants and adults Computation of conditional probability statistics by human infants Rhythmic grouping biases constrain infant statistical learning Infants use prosodically conditioned acoustic-phonetic cues to extract words from speech Bootstrapping language: are infant statisticians up to the job? At 11 months, prosody still outranks statistics Testing the limits of statistical learning for word segmentation Statistical learning and language acquisition Quantifying infants' statistical word segmentation: a meta-analysis Statistical language learning: mechanisms and constraints The handbook of speech production PHOIBLE 2.0. Jena, Germany: Max Planck Institute for the Science of Human History Markov chains: from theory to implementation and experimentation A mathematical theory of communication Integrating statistical learning into cognitive science Statistical language learning: computational, maturational, and linguistic constraints Beyond word segmentation: a two-process account of statistical learning Speech segmentation by statistical learning is supported by domaingeneral processes within working memory Constraints on statistical learning across species How infant vervet monkeys learn to recognize starling alarm calls: the role of experience Hornbills can distinguish between primate alarm calls Prosody in context: a review Phonological phrase boundaries constrain lexical access II. Infant data Can prosody be used to discover hierarchical structure in continuous speech? An interaction between prosody and statistics in the segmentation of fluent speech The use of prosodic cues in learning new words in an unfamiliar language Are initial-consonant lengthening and final-vowel lengthening both universal word segmentation cues? Emerging integration of sequential and suprasegmental information in preverbal speech segmentation Infants' and adults' use of duration and intensity cues in the segmentation of tone patterns Pauses and the temporal structure of speech Non-native speaker pause patterns closely correspond to those of native speakers at different speech rates The relationship between L1 fluency and L2 fluency development Acoustic sequences in non-human animals: a tutorial review and prospectus Segmental units in nonhuman animal vocalization as a window into meaning, structure, and the evolution of language Linguistic uses of segmental duration in English: acoustic and perceptual evidence The effect of position in utterance on speech segment duration in English Language-independent prosodic features 2021 The extent and degree of utterance-final word lengthening in spontaneous speech from 10 languages Durational effects in final lengthening, gapping, and contrastive stress Articulatory vowel lengthening and coordination at phrasal junctures The articulatory kinematics of final lengthening The origin of vowel length neutralization in final position: evidence royalsocietypublishing.org/journal/rstb Phil Gestural coordination at prosodic boundaries and its role for prosodic structure and speech planning processes Does music performance allude to locomotion ? A model of final ritardandi derived from measurements of stopping runners Duration as a cue to the perception of a phrase boundary Phrase-final lengthening in Russian: pre-boundary or pre-pausal? Language evolution: consensus and controversies 1894 Rhythm A quantitative study of rhythm: the effect of variations in intensity, rate and duration Metrical stress theory: principles and case studies Perception of rhythmic grouping: testing the iambic/trochaic law Do humans and nonhuman animals share the grouping principles of the iambic -trochaic law? Perception of rhythmic grouping depends on auditory experience Iambic-Trochaic law effects among native speakers of Spanish and English Beyond the Iambic-Trochaic law: the joint influence of duration and intensity on the perception of rhythmic speech How modality specific is the Iambic-Trochaic Law? Evidence from vision Universal principles underlying segmental structures in parrot song and human speech The motor origins of human and avian song structure Taking turns: bridging the gap between human and animal communication Interactive rhythms across species: the evolutionary biology of animal chorusing and turn-taking Contribution of prosody to the segmentation and storage of 'words' in the acquisition of a new mini-Language Acoustic correlates of word stress: a cross-linguistic survey The phonology of language contact Stress adaptation in loanword phonology On the change of word stress in the history of German. Beitrage zur Geschichte der Dtsch Spr und Lit Intonation Intonation systems: a survey of twenty languages Frequency effects in language processing Frequency effects in language acquisition, language use, and diachronic change The ubiquity of frequency effects in first language acquisition The role of perceptual salience during the segmentation of connected speech An exploration of rhythmic grouping of speech sequences by French-and Germanlearning infants Acoustic markers of prominence influence infants' and adults' segmentation of speech sequences Different phrasal prominence realizations in VO and OV languages A primate dictionary? Decoding the function and meaning of another species' vocalizations Acknowledgements. We would like to thank Andrey Anikin and an anonymous reviewer for helpful comments on an earlier version of this manuscript.