key: cord-0218309-lx99ilct authors: Feng, Kexin; Zanwar, Preeti; Behzadan, Amir H.; Chaspari, Theodora title: Exploring Speech Cues in Web-mined COVID-19 Conversational Vlogs date: 2020-09-16 journal: nan DOI: nan sha: b63c86ef755548e85116bd8c458b9755fffbffac doc_id: 218309 cord_uid: lx99ilct The COVID-19 pandemic caused by the novel SARS-Coronavirus-2 (n-SARS-CoV-2) has impacted people's lives in unprecedented ways. During the time of the pandemic, social vloggers have used social media to actively share their opinions or experiences in quarantine. This paper collected videos from YouTube to track emotional responses in conversational vlogs and their potential associations with events related to the pandemic. In particular, vlogs uploaded from locations in New York City were analyzed given that this was one of the first epicenters of the pandemic in the United States. We observed some common patterns in vloggers' acoustic and linguistic features across the time span of the quarantine, which is indicative of changes in emotional reactivity. Additionally, we investigated fluctuations of acoustic and linguistic patterns in relation to COVID-19 events in the New York area (e.g. the number of daily new cases, number of deaths, and extension of stay-at-home order and state of emergency). Our results indicate that acoustic features, such as zero-crossing-rate, jitter, and shimmer, can be valuable for analyzing emotional reactivity in social media videos. Our findings further indicate that some of the peaks of the acoustic and linguistic indices align with COVID-19 events, such as the peak in the number of deaths and emergency declaration. The COVID-19 disease caused by the the novel SARS-Coronavirus-2 (n-SARS-CoV-2) became a pandemic in late 2019, and have infected tens of millions people around the world [1] . The exact date of first report of the COVID-19 disease and the origins of the n-SARS-CoV-2 are undergoing scientific investigation. Throughout this pandemic, governments have been encouraging people to stay home and reduce physical contact with others. This prolonged confinement and absence of face-to-face interaction in combination with negative feelings of anxiety caused by the pandemic s expected to result in significant emotional strain. Social media platforms, such as Weibo and Twitter, can potentially reveal individuals' emotional reactions to such impactful events and have been actively explored by researchers in social media analytics [12, 13] . In contrast to social networks and blogs, that rely mostly on written communication, conversational vlogs provide a valuable source of multimodal data for understanding subtle facets of emotion in communities and societies through the integration of spoken language and visual information. Conversational vlogs refer to a specific type of vlogging, in which whole or part of the shots depict a single person facing and talking to the camera [5] . The richness of multimodal information presented in conversational vlogs can potentially provide a better understanding of the vloggers' attitude, feelings, and emotions compared to written text. Additionally, interactive cues in vlog videos, such as comments and number of upvotes or downvotes, can help identify how attitudes and emotions in the vlogs are propagated to the world. While few previous studies have conducted public sentiment analysis during a short period of the pandemic based on written text in social media [16] , to the best of our knowledge, multimodal analysis of conversational vlogs with the aim of a better understanding of public sentiments during the pandemic has not been examined. To fill this gap, we collected 463 conversational vlogs from New York city, a major epicenter of the pandemic in the United States (U.S.). The examined vlogs uploaded between March 13, 2020 (date of announcement of the national emergency declared by the U.S. government) and June 1, 2020 [2, 8] . We then applied a speech pre-processing pipeline to obtain the acoustic features indicative of prosodic changes on a weekly basis. We further analyzed the frequency of the words in the title and description of the YouTube videos to obtain a set of linguistic descriptors. Analysis of the acoustic and linguistic data found significant fluctuations across the span of the 11 weeks, some of which aligned with significant COVID-19 events, such as the peak in the number of deaths in New York City, as well as the stay-at-home order. Our pilot study provides preliminary insights into vloggers' emotional reactions during the period of the COVID-19 pandemic and contributes to better understating of emotion type and propagation in social media videos. YouTube has been widely used by researchers in computing and social science, due to it is a great source of naturalistic and diverse real-life data [15, 20] . Vlogging became a social trend after 2010 [10] . Conversational vlogs are a specific type of vlogging, where usually an individual talking to a camera to share their ideas, views, or expertise about a topic of interest. This have rendered the content of vlogs a valuable source for researchers enabling the better understanding of people's behavior in social media [10] . Biel et al. used verbal and non-verbal cues of the vlogger to estimate the amount of social attention that the vlog would receive [5] . Integrating vloggers' personality scores can further increase the accuracy of this task [4] . Biel et al. further performed crowdsourcing experiments to investigate how vloggers were perceived by their audience [6] . Researchers have also done sentiment analysis using comments posted under YouTube videos, and reviews posted for Movies. [7, 11, 17, 22] . Although previous studies have performed sentiment analysis in YouTube, conversational vlogs and emotion tracking of vloggers' experiences remains underexplored. This motivated us to explore the possible impact of social events on vloggers. To the best of our knowledge, our study is the first to investigate vloggers' emotions during the period of COVID-19 and their potential association with significant events of the pandemic. Our pilot study examines data collected by vloggers in New York, which was the first center of the pandemic in the U.S [8] . Findings from our work could provide a better understanding on how emotion is propagated in social media during during large-scale emergencies and life-changing events. In this section, we discuss data collection and processing methods of conversational vlogs from YouTube videos. Given the recency of the COVID-19 pandemic and nonexistence of datasets of vlogs recorded during this period, we collected a new dataset from YouTube. Due to limitations of the YouTube API for collecting videos at a large-scale (e.g., many vlogs are from people who do not reside in the U.S., which is not related to our Table 1 : Components relevant to keywords used in the Selenium tool for video mining. Behavior Location quarantine vlog New York NY covid-19 vlogger pandemic vlogging goal), we applied selenium webdriver, an automation testing tool to obtain YouTube videos [3] . Selenium is able to simulate a human search behavior in a browser window on the YouTube website by scrolling down to the end of the search query while tracking the search results. In this way, the number of overseas videos is significantly decreased since the result of such traditional search method could possibly related to searcher's region. To maximize the number of retrieved videos, our queries included a variety of keywords through the combination of three components related to the event (e.g., COVID-19), behavior (e.g., vlog), and location (e.g., New York), as in Table 1 , resulting in a total of 18 combinations. After removing duplicate videos from the search results, we obtained 4,265 videos potentially relevant to COVID-19 vlogs in New York City. We further collected video information, including title, description, duration, date published, number of views, and number of upvotes and downvotes, which could potentially become cues based on previous research [5] . We then filtered out the videos before the March 13th, the date when U.S. national emergency is declared, resulting in a total of 3,021 videos. The end date is June 1st, the time when we performed the data collection. Among those videos, we manually examined each one to make sure they satisfy the following requirements: • Part of the video displays a conversational shot. • The video has low or no background music or noise. • The video is not recorded overseas. This process resulted in a total number of 463 valid videos. Due to the subjectivity of this task, 13.24% of the 3,021 videos (400 videos) were cross-examined by an additional annotator, yielding a Cohen's kappa coefficient of 0.703. The main reasons for the different labels between annotators are relying on annotator's subjective judgment about background music level and the ratio of conversational shots in each video. In order to obtain the videos within New York, we further selected the videos that included "NY," "NYC," or "New York" in the title or corresponding description, which yielded a final number of 278 videos that were used in the rest of the analysis. The audio data obtained from our study may contain multiple speakers, as well as non-speech segments corresponding to noise and music background. To address this challenge, we manually labeled a 5-second reference audio of the target speaker within each video, and did the speaker diarization by calculating the similarity between the reference audio and each target analysis window. Similarity metrics were calculated in a 256-dimensional d-vector space calculated by a deep learning model [21] . A similarity score of 0.65 is used as a threshold and the window size is set to 125 milliseconds. By introducing the reference audio, this speaker diarization step can effectively identify non-speech segments or speech segments that do not belong to the target speaker. Acoustic features focused on capturing prosodic changes, which are indicative of emotional information [19] . We extracted four prosodic features, namely loudness, zero-crossing-rate (ZCR), jitter, and shimmer, followed by four statistics descriptors of these features, that include mean, standard deviation, skewness, and the slope coefficient of the linear regression fit. For ZCR, we also extracted the min and max values since these two ZCR descriptors are used in other emotion-related tasks [18] . To avoid the influence of possible extreme values, we first segmented the speech signal into 125ms windows, and subsequently obtained the statistics of the prosodic descriptors using a 25ms frame with a 10ms shift using OpenSmile toolbox [9] . Finally, we calculated the average for each descriptor among all the windows. As a result, a 18-dimensional speech feature for each video was extracted. As an additional source of information, we analyzed the words in the title and video description of each video and measured the word frequency. We further considered the words with the highest frequency from each video as linguistic measures. To explore the connections between social media reactions and COVID-19 spread, we collected data related to COVID statistics. In particular, we used the NYC Open Data and measured daily number of new cases, new deaths, and hospitalized patients [14] . We only included the data after March 13rd to match the start of the search period for extracted videos, as visualized in Fig. 1 . Our experiment aims to find if there is a potential connection between people's reactions in social media vlogs and the spread of COVID-19. Since there is generally a delay between video recording and posting online, we clustered the data into weekly bins, which is likely to address this delay. More specifically, starting from March 13th, we grouped videos published every 7 days, which resulted in 11 time periods ending on June 1st. The detailed start date information and number of videos in each time period can be found in Table 2 . For each week, we calculated the average of each acoustic feature to obtain a 18-dimensional weekly prosodic representation (Fig. 2) . The most frequent words and corresponding frequency of the title and video description over each week was also examined. We list the most frequent words within each week in Table 3 , and plot the largest frequency within target words (e.g., word 'quarantine' is the most frequent in most weeks) among weeks to explore potential connections with COVID-19 spread as shown in Fig. 3 . After plotting the variation of the 18 acoustic features across all the weeks, common patterns seem to emerge in some of the extracted features. Particularly, we found that the slope of the linear regression fit is the most consistent descriptor compared with mean, skewness, or standard deviation, potentially due to its ability to capture temporal trends. As for the four acoustic features (loudness, ZCR, jitter, and shimmer), the loudness was not able to reveal meaningful patterns. A possible reason could be that the loudness is highly dependent on the recording conditions (e.g., microphone, distance from the microphone), making it highly variable across videos. Based on Fig. 2 , three peaks at week 3, 5, and 8-9 can be observed among the other three acoustic features. Next, we explored the change of word frequency among those weeks. As observed in Fig. 3 , the frequency of negative words relevant to COVID-19 is higher in weeks 3, 5, and 8-9. In the remaining weeks, the frequency of such words is rarely greater than 1. The word frequency analysis is thus consistent with the change observed across acoustic features. Finally, we explored potential connections between the acoustic and linguistic trajectories and the COVID-19 spread. According to Fig. 1 , the number of daily new cases reached a peak in New York City around week 3 (day 20), while the number of daily new deaths reached to the peak around week 5 (day 35). During weeks 8 and 9, The Governor of New York extended the PAUSE order as well as the state of emergency for the New York state, which added to the quarantine period [23] . Although we cannot draw direct comparisons, due to the fact that acoustic and linguistic measures might be confounded by multiple factors, we observed similar spikes in weeks 3, 5, and 8-9 in some of the acoustic data. These trajectory similarities might suggest that COVID-19 related events can influence vloggers' sentiments, but a more thorough analysis is needed to better understand the contextual factors causing such spikes in the acoustic and linguistic trends. Even though our results indicate fluctuations in the acoustic and linguistic features which might be relevant to COVID-19 events, there are various limitations in our study. First, the data collection step heavily relies on the YouTube video search results. Possible bias could occur in this process, such as potentially more popular videos being retrieved first. Also, even though we took actions to make sure all the videos used in our study are from New York, we cannot exactly justify the video is from New York City or the state of New York. Finally, contextual factors that capture the content The COVID-19 related words are marked with italic. of each video have to be taken into account in a more thourough analysis. In this paper, we explored the possibility of understanding public sentiments during the COVID-19 pandemic through multimodal content of social media. We selected New York City, because it was one of the first epicenters of COVID-19 pandemic. We collected our own dataset from YouTube and provided a complete pipeline to pre-processing and analyzing real-life audio data. We then extracted acoustic features and observed common patterns from three features (jitter, shimmer, and ZCR). This pattern was also consistent with a word frequency analysis performed on video title and description and can be potentially explained by taking into account the timing of major COVID-19 related events occurring in New York City during this time. As part of our future work, we plan to extend our study to additional geographical locations, and explore the influence of gender, age, as well as other potential factors on viewers' reactions to social media content. Finally, we will analyze additional cues from the videos, such as facial expression and linguistic cues obtained from the vloggers' speech, to obtain a better understanding of social media videos. Proclamation on Declaring a National Emergency Concerning the Novel Coronavirus Disease (COVID-19) Outbreak. https Selenium WebDriver practical guide You are known by how you vlog: Personality impressions and nonverbal behavior in youtube VlogSense: Conversational behavior and social attention in YouTube The good, the bad, and the angry: Analyzing crowdsourced impressions of vloggers Emotion classification of YouTube videos New York state now has more coronavirus cases than any country outside the US Opensmile: the munich versatile and fast open-source audio feature extractor Vlogging: A survey of videoblogging technology on the web Using YouTube comments for text-based emotion recognition Isabelle van der Vegt, and Maximilian Mozes. 2020. Measuring emotions in the covid-19 real world worry dataset The impact of COVID-19 epidemic declaration on psychological consequences: a study on active Weibo users Department of Health and Mental Hygiene (DOHMH). 2020. COVID Daily Counts of Cases, Hospitalizations, and Deaths: NYC Open Data Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video Covid-19 public sentiment insights and machine learning for tweets classification Emotion classification on youtube comments using word embedding The interspeech 2009 emotion challenge Acoustic emotion recognition: A benchmark comparison of performances Social networks and the diffusion of user-generated content: Evidence from YouTube Generalized end-to-end loss for speaker verification Youtube movie reviews: Sentiment analysis in an audio-visual context Cuomo Extends Authority for 'PAUSE' Order, But Some Reopening Still Possible After This work was supported by the Texas A&M Institute of Data Science (TAMIDS) through the Data Resource Development Program. The authors would like to thank Alexandria Curtis, Texas A&M Computer Science & Engineering student, for her help in annotating the conversational vlogs.