key: cord-328461-3r5vycnr authors: Chire Saire, J. E. title: Infoveillance based on Social Sensors to Analyze the impact of Covid19 in South American Population date: 2020-04-11 journal: nan DOI: 10.1101/2020.04.06.20055749 sha: doc_id: 328461 cord_uid: 3r5vycnr Infoveillance is an application from Infodemiology field with the aim to monitor public health and create public policies. Social sensor is the people providing thought, ideas through electronic communication channels(i.e. Internet). The actual scenario is related to tackle the covid19 impact over the world, many countries have the infrastructure, scientists to help the growth and countries took actions to decrease the impact. South American countries have a different context about Economy, Health and Research, so Infoveillance can be a useful tool to monitor and improve the decisions and be more strategical. The motivation of this work is analyze the capital of Spanish Speakers Countries in South America using a Text Mining Approach with Twitter as data source. The preliminary results helps to understand what happens two weeks ago and opens the analysis from different perspectives i.e. Economics, Social. Infodemiology 1 is a new research field, with the objective of monitoring public health 2 and support public policies based on electronic sources, i.e. Internet. Usually this data is open, textual and with no structure and comes from blogs, social networks and websites, all this data is analysed in real time. And Infoveillance is related to applications for surveillance proposals, i.e. monitor H1N1 pandemic with data source from Twitter 3 , monitor Dengue in Brazil 4 , monitor covid19 symptoms in Bogota, Colombia 5 . Besides, Social sensors is related to observe what people is doing to monitor the environment of citizens living in one city, state or country. And the connection to Internet, the access to Social Networks is open and with low control, people can share false information(fake news) 6 . A disease caused by a kind coronavirus, named Coronavirus Disease 2019 (covid19) started in Wuhan, China at the end of 2019 year. This virus had a fast growth of infections in China, Italu and many countries in Asia, Europe during January and February. Countries in America(Central, North, South) started with infections at the middle of February or beginning of March. This disease was declared a global concern at the end of January by World Health Organization(WHO) 7 . South America has different context about economics, politics and social issues than the rest of the world and share a common language: Spanish. The decisions made for each government were over the time, with different dates and actions: i.e. social isolation, close limits by air, land. But, there is no tool to monitor in real time what is happening in all the country, how the people is reacting and what action is more effective and what problems are growing. For the previous context, the motivation of this work is analyze the capitals of countries with Spanish as language official to analyze, understand and support during this big challenge that we are facing everyday. This paper follows the next organization: section 2 explains the methodology for the experiments, section 3 presents results and analysis. Section 4 states the conclusions and section 5 introduces recommendations for studies related. The present analysis is inspired on Cross Industry Standard Process for Data Mining(CRISP-DM) 8 steps, the phases are very frequent on Data Mining tasks. So, the steps for this analysis are the next: • Select the scope of the analysis and the Social Network • Find the relevant terms to search on Twitter • Build the query for Twitter and collect data • Cleaning data to eliminate words with no relevance(stopwords) • Visualization to understand the countries . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint Considering the countries where Spanish is the official language, there are 9 countries in South America: Argentina, Bolivia, Chile, Colombia, Ecuador, Paraguay, Perú, Uruguay, Venezuela and every nation has a different territory size as the table Tab. 1 shows. Therefore, analyze the whole countries could take a great effort about time then the scope of this paper considers the capital1 of each country because the highest population is found there. At the same time, there are many Social Networks with like Facebook, Linkedin, Twitter, etc. with different kind of objective: Entertainment, Job Search and so on. During the last years, data privacy is an important concern and there is update on their politics, so considering the previous restriction Twitter is chosen because of the open access through Twitter API, the API will help us to collect the data for the present study. Although, the free access has a limitation of seven days, the collecting process is performed every week. Actually, there is hundreds of news around the world and dozens of papers about the coronavirus so to perform the queries is necessary to select the specific terms and consider the popular names over the population. The selected terms are: Ideally, people only uses the previous terms but, citizens does not write following this official names then special characters are found like @, #, -, _. For this reason, variations of coronavirus and covid19 are created, i.e. { '@coronavirus', #covid-19', '@covid_19' } The extraction of tweets is through Twitter API, using the next parameters: • date: 08-03-2020 to 21-03-2020, the last two weeks . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint • Change format of date to year-month-day • Eliminate alphanumeric symbols • Uppercase to lowercase • Eliminate words with size less or equal than 3 • Add some exceptions to eliminate, i.e. 'https', 'rt' This step will help to answer some question to analyze what happens in every country. • How is the frequency of posts everyday? • Can we trust on all the posts? • The date of user account creation • Tweets per day to analyze the increasing number of posts • Cloud of words to analyze the most frequent terms involved per day The next graphics presents the results of the experiments and answer many questions to understand the phenomenon over the population. At beginning, a fast preview about the frequency of post per country will support us to understand how many active users are in every capital. Four things are important to highlight from Fig. 2: (1) Venezuela is a smaller country but the number of posts are pretty similar to Argentina, (2) Paraguay is almost a third from Peru territory and the number of publications are very similar, Chile is one small country but the number of publication are higher than Peru and (4) Uruguay is the smallest one with more tweets than Bolivia and Colombia even Ecuador has more. By other hand, considering data from Table 1 , there is a strong relationship between Internet, Social Media and Mobile Connections in Argentina, Venezuela with the number of tweets and but a different context for Colombia, this insight show us the level of using in Bogota and says how the Internet Users are spread in other cities on Colombia. So, a similar behavior explained previously is present over this data. . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint Considering the image Fig.2 , the number of post for each country, the total number of tweets is up to five millions(5 627 710), close to half of million(401 979) per day. So, the question about veracity is important to filter and analyze what people is thinking, because the noise could be a limitation to understand what truly happens. By consequence, it is necessary to consider some criterion to filter this data. First, Argentina has the highest number of publications in last two weeks. For example, the firt dozen of the top users in Buenos Aires are: 'Portal Diario', '.', 'Clarín', 'Radio DoGo', 'Camila','El Intransigente', 'Agustina', 'Pablo', 'FrenteDeTodos', 'Ale','Lucas', 'Diario Crónica' Later, a search about the users, one natural finding is: they are related to newspapers, radio or television(mass media). But there is people with many hundreds of tweets and regular people. The next image Fig. 3 has the names of users and quantity of posts. #bolivia siete covid @pagina gobierno salud @larazon cruz ministro medidas cuarentena caso #coronavirusbo santa @rtp pais #esultimo emergencia personas @erboldigital anibal paciente confirma pide #lapaz informa confirmados nuevo oruro presidenta hospital primer @jeanineanez #elalto #urgente pacientes #deahora prevencion presidente ciudad evitar anuncia informo @sumaj warmi nacional nuevos dice sospechosos #loultimo alto prevenir centro medicos atencion china italia declara cuba tres horas #deultimo poblacion #santacruz ministerio propagacion video mundo hospitales enfrentar autoridades tras @yerkogarafulic #coronavirusmundo luis @luchoxbolivia #anibalcruz jeanine medico debido #mundo #ultimo virus gobernacion #videonoticias pandemia municipal estan primera manos policia reporta suspension #oruro @radiolider97 frente Helping the visualisation from Monday to Sunday during the last two weeks, a cloud of words is presented in Fig. 6 showing the first thirty terms per country. It is important to remember every country promote different actions on different dates. . CC-BY-NC 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055749 doi: medRxiv preprint Infodemiology and infoveillance: Tracking online health information and cyberbehavior for public health Social web mining and exploitation for serious applications: Technosocial predictive analytics and related technologies for public health, environmental and national security surveillance Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak Building intelligent indicators to detect dengue epidemics in brazil using social networks What is the people posting about symptoms related to coronavirus in bogota, colombia? Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study WHO. Who statement regarding cluster of pneumonia cases in wuhan, china. Beijing: WHO 9 The crisp-dm model: The new blueprint for data mining Infoveillance based on Social Sensors with data coming from Twitter can help to understand the trends on the population of the capitals. Besides, it is necessary to filter the posts for processing the text and get insights about frequency, top users, most important terms. This data is useful to analyse the population from different approaches.