key: cord-233012-ltbvpv8b authors: Garcia-Gasulla, Dario; Napagao, Sergio Alvarez; Li, Irene; Maruyama, Hiroshi; Kanezashi, Hiroki; P'erez-Arnal, Raquel; Miyoshi, Kunihiko; Ishii, Euma; Suzuki, Keita; Shiba, Sayaka; Kurokawa, Mariko; Kanzawa, Yuta; Nakagawa, Naomi; Hanai, Masatoshi; Li, Yixin; Li, Tianxiao title: Global Data Science Project for COVID-19 Summary Report date: 2020-06-10 journal: nan DOI: nan sha: doc_id: 233012 cord_uid: ltbvpv8b This paper aims at providing the summary of the Global Data Science Project (GDSC) for COVID-19. as on May 31 2020. COVID-19 has largely impacted on our societies through both direct and indirect effects transmitted by the policy measures to counter the spread of viruses. We quantitatively analysed the multifaceted impacts of the COVID-19 pandemic on our societies including people's mobility, health, and social behaviour changes. People's mobility has changed significantly due to the implementation of travel restriction and quarantine measurements. Indeed, the physical distance has widened at international (cross-border), national and regional level. At international level, due to the travel restrictions, the number of international flights has plunged overall at around 88 percent during March. In particular, the number of flights connecting Europe dropped drastically in mid of March after the United States announced travel restrictions to Europe and the EU and participating countries agreed to close borders, at 84 percent decline compared to March 10th. Similarly, we examined the impacts of quarantine measures in the major city: Tokyo (Japan), New York City (the United States), and Barcelona (Spain). Within all three cities, we found the significant decline in traffic volume. We also identified the increased concern for mental health through the analysis of posts on social networking services such as Twitter and Instagram. Notably, in the beginning of April 2020, the number of post with #depression on Instagram doubled, which might reflect the rise in mental health awareness among Instagram users. Besides, we identified the changes in a wide range of people's social behaviors, as well as economic impacts through the analysis of Instagram data and primary survey data. The GDSP (Global Data Science Project) for COVID-19 consists of an international team focusing on various societal aspects including mobility, health, economics, education, and online behavior. The team consists of volunteer data scientists from various countries including the United States, Japan, Spain, France, Lithuania and China. The purpose of the GDSP is to quantitatively measure the impacts of the COVID-19 pandemic on our societies in terms of people's mobility, health, and behaviour changes, and inform public and private decision-makers to make effective and appropriate policy decisions. a. Quantifying Physical Distancing physical distancing is key to avoid or slow down the spread of viruses. Each country has taken different policies and actions to restrict human mobility. In this project, we investigate how policies and actions affect human mobility in certain cities and countries. By referencing our analysis of policy and secondary impacts, we hope that decision makers can make effective and appropriate actions. Furthermore, by analyzing human mobility, we also aim to develop a physical distancing risk index to monitor the risk on areas with high population densities and probability of contraction. Due to physical distancing and lockdown policies, people have begun relying on video conferencing tools for meetings, lectures, and conversations among friends more frequently than usual. Children are especially affected by the quarantine since many must refrain from going to their classrooms and take classes online. By leveraging various data sources, we will analyze how daily behavior has been affected by this pandemic, and also compare behaviors among different countries and cities. We will also measure online e-commerce and consumer behavior by analyzing sites such as Amazon. For health, we have focused on emotion changes that people have experienced during this pandemic. Emotion changes have stemmed from various reasons such as unemployment, implementation of stay-at-home policies, fear of the virus, etc. We quantify emotion changes by using social media data, including Twitter and Instagram. Since the breakout of COVID-19, we have seen an increase in online discussions that use hashtags such as #COVID-19 and #depression. We believe it is vital to visualize and analyze the differences in people's perceptions of . We also hope to analyze overall responses to the pandemic by sentiment: sadness, depression, isolation, happiness, etc. A further detailed analysis will also look into specific keywords and corresponding trends. Each section in this report follows the following format; key takeaway, data description, policy changes, overall analysis, subcategory analysis. We aim at analysing the seeming trade-off between economics and prevention of infection spread. Based on the calculation of physical distance index (mobility index), economic damages, and the number of newly infected patients, we evaluate the optimal level where we embrace both the steady decline in the number of infections and recovery of economics. To investigate the effects of these travel restrictions on worldwide flights, we analyzed the decline of flights for continents and countries from public flight data. We found that the overall international flights significantly decreased from the beginning of March, at around 24 percent. In particular, the number of flights connecting Europe drastically dropped in mid of March after the United States announced travel restrictions to Europe and the EU and participating countries agreed to close borders, at 84 percent decline compared to March 10th. We conducted a more detailed analysis in another paper Suzumura et al. (2020) . In order to analyze real-time international flight data, we obtained voluntarily provided flight dataset from The OpenSky Network -Free ADS-B and Mode S data for Research Schäfer, Strohmeier, Lenders, Martinovic, and Wilhelm (2014),Strohmeier (2020) . The dataset contains flight records with departure and arrival times, airport codes (origin and destination), and aircraft types. The dataset includes the following flight information during January 1st to April 30th. The dataset for a particular month is made available during the beginning of the following month. The data covers 148 countries out of 195 countries, including 616 major airports and several small to medium size airports ( Figure 1 ). The data was collected over the period of January 1 to March 31. As for a data methodology, we build a temporal network where a country or an airport is represented as a vertex and a connection between 2 countries or 2 airports is represented as an edge. By building such a temporal network and compute shortest paths and their length between 2 countries or 2 airports, we can measure how travels are restricted in a quantitative manner by using graph analytics. The data is analysed on (1) global, (2) continental, (3) country (airport) level. The top 20 airports were based on the preliminary world airport traffic rankings released by ACI World International (2020). Before the end of February 2020, the overall number of (departed) international flights were around 8,000. However, since March 1 the number of international flights has started to decline. It reflected the first coronavirus death in the United States and the announcement of travel restrictions of 'do not travel' on February 29th. Since March 11, this decline has further accelerated in response to President Trump's announcement of the travel restriction on 26 European countries on March 11 and 13. In Italy, 28,800 people were infected before February 28th in Italy and cases in 14 other European countries remain an area of concern, and the US records its first coronavirus death and announces travel restrictions of 'do not travel' on February 29th, resulting in the decline of international flights from France Switzerland, and Italy around February 27th. Then, a significant slump of flights also occurred from around March 12th in these countries after the declaration of COVID-19 by WHO. We examined the mobility in Tokyo and its relations to the national and local government's measures to suppress COVID-19. We found that 1) different demographic groups respond differently to the governments' messages (e.g., the senior responded first but the younger generations are willing to comply with the governments' instructions to stay home once the formal announcement of the state of emergency is declared), and 2) people's behaviour is affected more by the mood of the society than the official declaration of the state of emergency. We also investigated the correlation between the daily mobility index and the growth rate of reported confirmed cases, which suggested that the mobility index may be an early indicator of the growth rate of confirmed cases as well as the number of confirmed cases affecting future mobility. We analyzed NTT DoCoMo data and accessed high-resolution hourly population data within Tokyo from Mobile Kukan Toukei docomo (2020). The data is based on mobile phone location on every hour and covers all the Tokyo metropolis (average daily population is around 11M) and from Jan. 1st, 2020 to the current date. We also received the same data for Jan. to Mar. in 2019 for the comparison. The data set divides Tokyo into 8,500 grid cells of 0.5km x 0.5km. The provided data is a collection of population vectors Pt where Pt[i] is the population of the grid cell i at time t (t is an hourly time point between 0:00 on January 1st and 23:00 on March 31st). We defined the overall mobility within Tokyo at time t as L1(Pt -Pt+1) where L1 is the L1 norm. Intuitively, this metric counts the sum of the number of people who came into or left each cell during the given hour. Note that these metric underestimates actual mobility since incoming and outgoing people within an hour cancel each other out. The mobility index above for Tokyo includes large rural areas that may have contributed less to COVID-19 transmission. For this study, we clustered 8,500 grid cells into 6 groups based on hour-of-day population patterns (each grid cell is represented as a 24variable vector) using the 2019 data. Figure 11 and 12 show the change of mobility index by age groups. We observe the largest drop of mobility occurred around March 25th when the media started to discuss potential "capitol lockdown". When the official state of emergency was declared on April 9th, the mobility had already dropped to less than half of that of the normal time. We also note that the senior groups, who are supposed to be at higher risk, responded to the epidemic initially, but later the younger groups were more willing to stay at home. We also investigated the potential use of the mobility index as an earlier indicator of the future spread of the disease. Figure 13 and 14 shows the daily mobility index and the growth rate of confirmed cases in Tokyo. In the plots, we noticed that the drop of mobility around March 2nd may be correlated to the drop of growth rate on March 14th as shown in the blue arrows, and the pickup of mobility on March 18th may be correlated to the peak of the growth rate on March 29th as shown in the red arrows It is not conclusive, but it may suggest that the mobility index has some signals for predicting the future spread of disease. Different clusters show different responses in mobility to COVID-19. Figure 17 shows how mobility changes over time depending on the cluster. We analyzed the changes in traffic volumes and a bicycle sharing service in New York City to examine the effect of COVID-19 and announcements from the city government. We found out the traffic volume has significantly decreased after the beginning of March, and thousands of people use CitiBike in every daytime. We analyzed the mobility changes in New York City through the road traffic data and tracking data of sharing bikes. We retrieved public historical data about traffic volume of freeways from NYC Open Data OpenData (2020b). As for road traffic data, we extracted the daily average travel time and speed in 20 days ranges. The x-axis of Figure 19 and As for sharing bike data, we track the number of people using bikes at each bike station every 30 seconds from NYC Open Data OpenData (2020a). We aggregated the number of departed bikes in 938 bike stations, located in most areas of Manhattan, Brooklyn and Queens near the East River, and Jersey City along the Hudson River. Since we can only track the number of available bikes in each station, we estimate the number of departed bikes by computing the difference in the number of available bikes between two timestamps. We developed an interactive visualization dashboard that illustrates how bikes are used over time since March 23th. After the NYC municipality recommended citizens to ride bikes instead of using public transportation, there was a surge in the usage of Citi Bike in the beginning of March, a privately-owned public bicycle sharing system in New York City Kuntzman (2020). Figure 18 describes the total number of hourly CitiBike usages from March 23rd. More than a thousand of people used CitiBike in a peak hour every day, and in some days the number of peak usages exceeded 6,000 in some days (e.g., April 19th, 25th, and May 2nd). From the beginning of May, more than 4,000 people used bikes in peak hour. We analyzed the use of the public bike system and the amount of traffic load within the Barcelona metropolitan area, to understand how the CoVid19 pandemic and the government measures were affecting public movement. We detected that mobility was only significantly altered once the harshest measures were implemented, hinting at a potential inefficiency of mild measures. Furthermore, as the lockdown went on, the mobility kept decreasing, indicating an increasing adherence as the understanding of the severity increases. The mobility data in Barcelona, Spain was collected through the location signal of Face-Book application users who have consented to share their location, public bike sharing data, and road traffic data. Similar to New York City, public bike sharing is available in Barcelona. By analyzing the availability of docking stations throughout the city, we measure the changes in population mobility. Traffic data is obtained from the open data released by the Barcelona city hall. It includes over 100 measuring points, evenly distributed throughout the city. The mobility data in Barcelona, Spain was collected through the location signal of Face-Book application users who have consented to share their location, public bike sharing data, and road traffic data. Similar to New York City, public bike sharing is available in Barcelona. By analyzing the availability of docking stations throughout the city, we measure the changes in population mobility. Traffic data is obtained from the open data released by the Barcelona city hall. It includes over 100 measuring points, evenly distributed throughout the city. The first detected case of COVID-19 within Spain was on January 31st in the Canary Islands, located more than 1,000 km from peninsular Spain. By late February, imported cases were detected in the mainland, and on February 26th the first endemic case was diagnosed. On March 9th, 999 cases were diagnosed and certain regions in Spain started implementing local restriction policies. By March 13th, cases had been detected across all 50 provinces. The following day, March 14th, the Spanish Government announced a state of emergency, and implemented a lockdown for the whole population. Citizens were only permitted to travel for work, and all social events were prohibited. This lockdown was reinforced on March 29th with total mobility restrictions, and only essential services were an exception. The first restriction against COVID-19 by the regional government of Barcelona was on March 11th that informed citizens to avoid gatherings of more than 1,000 people. Two days later, on March 13th, with 508 confirmed cases in the region, all classes were suspended. By March 14th, a national lockdown was declared by the Spanish government. The effects on mobility are only visible from March 13th, indicating that the local population did not alter their mobility patterns in response to earlier and milder governmental restrictions. Posts with some hashtags related to physical distancing suddenly started increasing in mid of March 2020. A hashtag #zoom, online meeting software, was frequently posted all over the world in late March, and the stock price of the software company rose dependent upon the increasing volume of posts with the hashtag. A post on Instagram would be summarized and categorized by hashtags on the post. We analyse the following hashtag, which represent physical distancing have increased signif-icantly since March 2020; #stayhome, #stayathome, #socialdistancing, #workfromhome, #zoom (online meeting software). As of April 11, 2020, more than 16 million posts with #stayhome have uploaded on Instagram, including 6 million posts in March 2020. During mid of March 2020, the number of posts with #stayhome gradually rose up, with states across the U.S. announcing a stayat-home order. On March 24, 2020, Instagram announced that it launched a "Stay Home" sticker to help those practicing physical distancing connect with others. This might have also boosted the number in March 2020. The outbreak of COVID-19 recently has affected human life to a great extent. Besides direct physical and economic threats, the pandemic also indirectly impacts people's emotional conditions, which can be overwhelming but difficult to measure. We apply natural language processing (NLP) Vaswani et al. (2017) , Qi, Zhang, Zhang, Bolton, and Manning (2020) to analyze tweets in terms of emotion analysis and attempt to find more in-depth topics and facts about emotions in terms of COVID-19 Li, Li, Li, Alvarez-Napagao, and Garcia (2020) . We have seen an increase in discussions tagged with hashtags such as #COVID-19 and #depression on Twitter and Instagram. We believe it is vital to analyze differences in. We also hope to analyze overall responses to the pandemic as well as changes in behavior due to the virus and generate reports on global situations regarding mental health. We plan to categorize the tweets and Instagram posts that mention COVID-19 by sentimental categories: .anger, anticipation, disgust, fear, joy, sadness, surprise and trust. Further detailed analysis will also look into specific keywords and corresponding trends. We applied Twitter API to conduct a crawler with a list of keywords: #coronavirus, #covid19, #covid, #covid19, #confinamiento, #flu, #virus, #hantavirus, #fever, #cough, #social #distance, #lockdown, #pandemic, #epidemic, #conlabelious, #infection, #stayhome, #corona, #epidemie, #epidemia,新冠肺炎, 新型冠病毒, 疫情, 新冠病毒, 感染, 新型コロナウイルス, コロナ. Each day, we are able to crawl 3 million tweets in free text format from different languages. Due to the high capacity, we look at the tweets from March 24 to 26, 2020 to get language and geolocation statistics. Among these tweets, 8,148,202 tweets have the language information ("lang" field of the "Tweet" Object in Tweet API), and 76,460 tweets have the geographic information ("country code"value from the "place" field if not none). We applied a deep learning model (BERT) Devlin, Chang, Lee, and Toutanova (2018) trained on 750 manually labeled cases to 1 million English tweets. Fear, Anger and Sadness ranked first. We now look at the emotion trends on different topics. Using BERT, we analyzed two topics: "mask" and "lockdown". To understand why people feel fear and sadness, we calculated correlation on the tweets categorized by fear and sadness, and then kept nouns and noun phrases with the help of Stanford Stanza tool. We utilized the LDA(Latent Dirichlet allocation) topic modeling Blei, Ng, and Jordan (2003) to analyze the topics on people's tweets. Each "topic" learned by the model is a bunch of key words, then we manually labeled these topics as meaningful concepts. By applying our model, we show the emotion distribution among 8 categories in Fig. 31 . Each day, the overall distribution has no big difference. So we show the results on a 1 million tweets from March 29th, 2020. Note that these tweets contain many languages rather than English. We could notice that the top emotions are very positive: fear, anger and sadness. We select the data of two weeks (March 25, 2020 -April 7, 2020 , and apply our model to predict the emotions on all the tweets we crawled (around 3 million each day) that contain the two "masks" and "lockdown" respectively. We found the dominating emotions and variations of the change are closely related to the topic. In Fig. 32 and 33, we illustrate the emotion trend for each single day of the selected keywords. The high variation (plot in solid lines in the figures) showed up in sadness, anger and anticipation for the tweets that contain the word "mask" in Fig.32 , and disgust, sadness for the tweets that contain the word lockdown in Fig. 33 . Especially for the lockdown tweets, the percentage of disgust emotion had a significant increase on March 27 and dropped on the next two days, as marked with the black asterisks. To further investigate, we looked at the news on March 27, which included the U.S. as the first country to report 100,000 confirmed coronavirus cases, and 9 in 10 Americans were staying home; India and South Africa joined the countries to impose lockdowns. Given that the United States, India and Brazil have large groups of twitter users, we assume that this dramatic change may be triggered by that news. We set the number of topics to be 5, and did detailed analysis on the data from April 7th, 2020. The topics are listed as the following: Topic 0: Covid19 testing, deaths cases, positive cases Topic 1: President Trump, government, federal affairs Topic 2: lockdown, stay at home, physical distancing Topic 3: (Spanish) pandemic, health conditions Topic 4: the peak, serious treatment, Boris Johnson Figure 37 shows the distribution of the topics. We choose the data on April 7th, and first we do inference on all the data, and show the ratio for each topic learned above (All). Then we do inference on the tweets that are only labeled as sadness or fear (Sad and Fear). And the following is the ratio of each topic learned. In general, the public may be worried about Topic 3 and 2, mainly, the pandemic and lockdown, which are making people stressed. In the beginning of April 2020, the number of posts with #depression on Instagram doubled, which might reflect the rise in mental health awareness among Instagram users. We collected 71,737 posts on Instagram Inc. (2020) with #depression from March 31 to April 5, 2020. During this period, the number of the posts are steadily increasing as below. Among the posts with location information, #depression was mostly posted in the U.S., the U.K. and India. In those countries, users posted the hashtag during the local daytime. During the long quarantine, people might struggle to keep their mental conditions healthy and the trend of #depression on Instagram might reflect their attitudes. The number of hourly posts with #depression doubled in a week from March 31 to April 5. Among the posts with location information, most posts were uploaded in the U.S. In all top three countries, the U.S., the U.K. and India, the hashtag was mainly posted in between afternoon and the evening. The number of posts with #cough started increasing in mid of March, several days before the stay-at-home orders in some countries. Some governments could take their initial response earlier than they actually did based on the increase in the number of the behavioral changes on the social networking service platform. Other hashtags related to users behavioral changes, including #mask, #facemask, #stayalive, started increasing in mid of March. We analyzed basic hashtags such as #covid19 and #coronavirus; hashtags related to medical supplies #mask and #facemask; #ClapBecauseWeCare, a daily event people cheer medical professionals working on the frontline; and a hashtag that support others such as#stayalive. We have collected 513,712 posts with #mask and 251,452 posts with #facemask since February 20, 2020. Figure 40 and 41show that the two hashtags have been steadily increasing since around March 11, 2020, a few weeks earlier than some lockdown announcements in Europe and stay-at-home orders in the U.S. People might have considered how to protect themselves amid the pandemic before their governments imposed strict prohibitions on residents. People's perception regarding COVID-19 varies depending on the number of infected cases in the local community. Also, people"s behavior has some indicative signal for the future spreading the disease. We analyzed the data of the national survey concerning COVID-19 news and resulting behavior changes on March 7 to 9 provided by Survey Research Center Co. Ltd. 100 responders were randomly selected from each prefecture from an approximate national pool of 2 million panelists. We also used national data regarding confirmed COVID-19 cases from March 6th (the day before the survey was conducted). In densely affected areas higher levels of concern regarding the impact of COVID-19 on everyday work (Q14-8) were observed. In areas with lower numbers of confirmed cases, commuting behavior was shifted to avoid public transportation. We also considered how people's perception and behavior may impact the spread of the disease. Since there is some delay between an infection and its appearance in the official statistics, we took the growth rate of the reported confirmed cases on March 27th as the indicator of the rate of the spread. Figure 46 shows the strong positive and negative correlations. The mobility analysis of Barcelona seemed to indicate that society assimilated the importance and severity of the situation during the second week of lockdown. This may be supported by this work, which indicates that this same week hosted most of the layoffs. This is however, before the toughest part of the lockdown (during the first two weeks, travel to work was allowed). In this regard, the hard lockdown seemed to have little further effect on unemployment. For this analysis we used the public ERTOs data from the Generalitat of Catalunya 5 and the public data of unemployment of SEPE 6 . During the lockdown, the Spanish government promoted a temporal unemployment fiscal figure (ERTE) under which companies can temporarily layoff workers. In this period, 70% of the worker's salary is paid by the government, and the company may complement the rest. The purpose of this measure is to maximize the number of workplaces which are restored after the economic lockdown. The success of this measure will have a great impact on the duration of the economic side-effects of the pandemic. We analysed the effect of this measure in Catalonia, one of the most populated Autonomous Communities of Spain. Catalonia includes the city of Barcelona and close to 3.4M workers. Industrial activity represents nearly 21% of the Catalan GDP, while tourism accounts for 12%. Figure 47 we plot the number of temporal unemployments per day. That is number of affected workers on each day. For context, the Spanish Government announced a state of emergency, and implemented a lockdown for the whole population on March 14th. This lockdown was reinforced on March 29th to total mobility restrictions, with the only exception of essential services. Figure 47 : Volume of workers affected by temporal unemployment on each day As seen in Figure 47 , most of the layoffs happened during the second week of the pandemic, before the the total lockdown. This indicates that, for the case of Catalonia, the mild lockdown and the hard lockdown have similar economic effects in terms of unemployment. On the worst day, 25th of March, 2% of Catalan workers were fired. By the end of May, the total number of workers affected by temporal unemployment in Catalonia was 491,789. Next, in Figure 48 , we compare the unemployment volume with the volume of ERTEs per month. Considering the difference between the ERTEs volume (over 300K on the worse day) and the growth of unemployment (less than 100K accumulated), ERTEs seem to many of the layoffs, mitigating the growth of unemployment at least temporally. The behavior of unemployment in the coming months, as ERTEs expire, will provide the real measure on the effectiveness of ERTEs. Latent dirichlet allocation Bert: Pre-training of deep bidirectional transformers for language understanding Mobile kukan tokei people-safe-informed-and-supported-on-instagram International, A. C. (2020). Preliminary world airport traffic rankings released -aci Boom! new citi bike stats show cycling surge is real -but mayor is not acting What are we depressed about when we talk about covid19: Mental health analysis on tweets using natural language processing Citi bike live station feed (json), nyc open data Real-time traffic speed data -nyc open data Stanza: A python natural language processing toolkit for many human languages Bringing up opensky: A large-scale ads-b sensor network for research Opensky covid-19 flight dataset The impact of covid-19 on flight networks Polosukhin, I