key: cord-0276378-pd1oonje authors: Ambavi, Heer; Vaishnaw, Kavita; Vyas, Udit; Tiwari, Abhisht; Singh, Mayank title: CovidExplorer: A Multi-faceted AI-based Search and Visualization Engine for COVID-19 Information date: 2020-11-30 journal: nan DOI: 10.1145/3340531.3417428 sha: 67161ba0e3576004a17660d3130006ef7ae71d6e doc_id: 276378 cord_uid: pd1oonje The entire world is engulfed in the fight against the COVID-19 pandemic, leading to a significant surge in research experiments, government policies, and social media discussions. A multi-modal information access and data visualization platform can play a critical role in supporting research aimed at understanding and developing preventive measures for the pandemic. In this paper, we present a multi-faceted AI-based search and visualization engine, CovidExplorer. Our system aims to help researchers understand current state-of-the-art COVID-19 research, identify research articles relevant to their domain, and visualize real-time trends and statistics of COVID-19 cases. In contrast to other existing systems, CovidExplorer also brings in India-specific topical discussions on social media to study different aspects of COVID-19. The system, demo video, and the datasets are available at http://covidexplorer.in. With an exponential growth rate in COVID-19 infections, all government and private organizations are spending heavily on R&D infrastructure and essential medical facilities. This has led to a high surge in scientific volume ranging from proposal for innovative medical devices, vaccines, infection prediction, and propagation models for COVID-19. For instance, we witness several tongue-incheek media headlines like 'Scientists are drowning in COVID-19 papers. Can new tools keep them afloat?' 1 . Several COVID-19 specific search and recommendation tools have been developed recently by various research groups. 2 However, to the best of our knowledge, systems that leverage graph-based interactive visualizations to navigate the vast research volume are not available. A large volume of social media discussions around the COVID-19 has also led to several natural language processing (NLP) limitations like identifying facts or opinionated messages, COVID-19 specific trending topics, and fake or hate messages. Except for COVIDspecific data curation strategies [2] , we find a few works [4, 6] that address some of these NLP limitations. None of the available systems has attempted to encapsulate social media discussions with scholarly search and recommendation. In this paper, we propose a multi-faceted AI-based search and visualization engine, CovidExplorer. Our system aims to serve as a means for researchers to understand COVID-19 research and visualize the trends in the expanding pool of scientific articles on coronaviruses. Our system seamlessly integrates three different aspects of COVID-19 into a single platform (i) search and recommendation, (ii) statistics, and (iii) social media discussions. The system aims to facilitate the researchers and decision-makers with the latest global research updates and social discussions in India. The development of CovidExplorer leverages two rich time-stamped datasets. The first dataset (hereafter 'CORD-19' dataset) comprises ∼157,000 scholarly articles, including over 75,000 full-text articles on coronaviruses. Year range 1991- The White House and a coalition of leading research groups provide CORD-19 [7] to the global research community. We also curate about 227 million tweets relevant to the COVID-19 pandemic (hereafter 'Tweet' dataset). We periodically collect relevant tweet IDs from a publicly available COVID-19 TweetID corpus [2] . The tweet IDs are hydrated using the DocNow Hydrator tool 3 . We only explore India-specific tweets by using location metadata or the presence of Indian locations (states, cities, towns, and village names) in the tweet text. We also include tweets that are either posted as a reply to the India-specific tweets and tweets for which India-specific tweets were posted as a reply. In this paper, we present all our analyses on this subset of tweets. Both datasets are updated weekly. Table 1 presents statistics of the CORD-19 and Tweet dataset. We release the processed data with a clear and accessible data usage license 4 . In this section, we present the detailed architecture of CovidExplorer. The proposed system comprises four interdependent modules: (i) Data Retrieval and Storage, (ii) Information Generation, (iii) Query Processing, and (iv) User Interface. Figure 1a shows the system architecture. Figure 1b shows the landing page of CovidExplorer. (1) Data Retrieval and Storage Module: This module periodically downloads the data updates from both the data sources (described in Section 2) and performs pre-processing. The pre-processed data is stored in a semi-structured format to facilitate a quick query response. Elasticsearch engine indexes the CORD-19 dataset, and the Tweet dataset is stored as CSV dumps. (2) Information Generation Module: This module processes the curated data. It extracts biological entities and updates entity statistics (Section 4.1.2) from the CORD-19 dataset. The Elasticsearch engine also indexes this generated entity information. In the case of Tweet data, it performs basic location-based filtering, generates timelines, performs topic modeling, and generates India-specific insights. We use Flask 5 framework, written in Python, for web deployment. We use Elasticsearch 6 for data storage and querying. The timelines are generated using TimelineJS 7 and plots are rendered using Plotly 8 , Dash 9 and amCharts 10 . The tweet preprocessing is done using NLTK and Gensim libraries. The flow values between Hydroxychloroquine and the entity type is the total number of co-mentions of that type. We develop COVID-19 scholarly search facilities using the CORD-19 dataset. The search facility comprises several components described in the following sections. CovidExplorer supports a keyword-based search in three categories: authors, papers (title), and full-text papers. In addition to the text of the articles, metadata such as title, abstract, author names, publication year, and venue is used. Each search query list of relevant papers displayed along with their authors, venue, the date of publication and bio-entities mention (described in the next section), and segregated by the year of publication. Each search result is linked to the original paper source URL. These search results can be further filtered by applying a range of publication year, or by their bio-entity mentions. Figure 2 shows one example of paper search results for keyword 'hydroxychloroquine'. CovidExplorer is equipped with a Named Entity Recognition (NER) system for aiding navigation through the large volume of papers. The NER system uses the state-of-the-art language model for scientific and biomedical text SciBERT [1] . We fine-tune SciBERT using the JNLPBA corpus [5] and the NCBI-disease corpus [3] . Every article in the search result shows the biological entities extracted from its abstract. The current NER functionality identifies seven different types of bio-entities: DNA, RNA, proteins, cell types, cell lines, diseases, and chemical names. Any entity belonging to multiple types is assigned the maximally occurring type for that entity. We assume that other entity types (except disease) are sub-types of the chemical name entity type. Hence, other types are given priority over the chemical names type during the assignment. We provide multiple insights about the recognized entities. For each entity type, we display a timeline to visualize the first mention of each entity of that type, shown in Figure 3a . We also list the top-10 most frequently mentioned entities for each entity type. Table 2 shows the statistics along with the names of popular instances in each entity class. For each entity, we also individually show the first mention, a visualization of popular co-mentioned entities, and the year-wise distribution of mention frequencies. We also display all papers that contain these individual entities segregated by the year of publication. Figures 3b and 3c show the entity statistics and co-mentioned entities for candidate entity 'Hydroxychloroquine'. CovidExplorer also displays a statistics page that keeps track of the daily evolving pandemic situation in India. Daily, it updates the total cases, active cases, deaths, and recovery counts. Besides, it provides state-wise cumulative data of the number of active cases. The filtered dataset of ∼5.7 million tweets is processed and in the following sections, we describe some of the India-specific social media insights. 4.3.1 Temporal Activity and Trends. CovidExplorer Social media page displays an interactive timeline of the Twitter activity through the duration of the pandemic, as well as a state-wise geographical distribution of the twitter activity in India. It also shows most common hashtags, mentions, tweet locations, and URLs. The shared URLs are examined to determine the most common domain names tweeted. To quantify the misinformation prevailing in the tweet volumes, we classify the mentioned domains into misinformation sources and trusted source categories. We use the list of low-quality misinformation sources (LQMS) curated by NewsGuard 12 . Table 3 11 https://www.mohfw.gov.in/ 12 https://www.newsguardtech.com/coronavirus-misinformation-tracking-center/ shows the statistics of the presence of LQMS in the Tweet dataset. Figure 4 shows visualisations of social media interactions. Evolution. The tweets are further processed using the Latent Dirichlet Allocation (LDA) algorithm to infer topical distribution. The LDA results are interactively displayed with the month-wise distribution of topics and the keywords for each topic along with their probabilities. In this work, we present CovidExplorer. CovidExplorer aids researchers in understanding current state-of-the-art COVID-19 research, identify research articles relevant to their domain, visualize real-time trends and statistics of COVID-19 cases, and understand social media discussions. In the future, we aim to provide APIs to enable researchers to access the query results. We also aim to create named entity-based networks that visually depict the context of occurrences of the entities across the resource pool. The Tweet dataset can be further processed using BERT-like models to construct sentence embeddings for unsupervised topical clustering. This method can be an effective method of labeling the dataset. SciBERT: A Pretrained Language Model for Scientific Text COVID-19: The First Public Coronavirus Twitter Dataset Content analysis of Persian/Farsi Tweets during COVID-19 pandemic in Iran using NLP Revised JNLPBA Corpus: A Revised Version of Biomedical NER Corpus for Relation Extraction Task COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter CORD-19: The Covid-19 Open Research Dataset This work was partially supported by Google Cloud COVID-19 credits program.