key: cord-0912778-ms9xh1t1
authors: Zuo, Xu; Chen, Yong; Ohno-Machado, Lucila; Xu, Hua
title: How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles
date: 2020-12-02
journal: Brief Bioinform
DOI: 10.1093/bib/bbaa331
sha: 64ca113a19a5824d7b8ad133f496bc515f698da1
doc_id: 912778
cord_uid: ms9xh1t1

OBJECTIVE: This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations. METHODS: We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets. RESULTS: We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets. CONCLUSION: PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited.

The novel coronavirus disease (COVID- 19) outbreak was first reported in Wuhan, China, on 31 December 2019. On 11 March 2020, World Health Organization officially declared COVID-19 a pandemic, marking the recognition of a global crisis [1] . To fight the past seven months. Along with published articles, massive and heterogeneous datasets have been created, ranging from testing and case statistics at various locations (medical centers, cities, counties, states, countries), clinical data from studies (e.g., 'omics, imaging, assays, questionnaires) or from electronic health records, surveys for patient-reported outcomes, administrative data [e.g., ventilators, hospitalizations, intensive care unit (ICU) beds], vital statistics (e.g., obituaries, death certificates), as well as sociodemographic, environmental, economic, individual mobility and transportation data.

Efficient data sharing of biomedical data is an important component in the development of a successful data-driven research on COVID-19 [3] . Researchers reconstructed the early evolutionary paths of COVID-19 by genetic network analysis, for example using existing data of virus genomes collected across the world, providing insights into virus transmission patterns [4] . Nevertheless, it is challenging for researchers to find and identify reliable datasets for novel scientific discoveries, given the large volume and sometimes contradictory information (e.g. non-peer-reviewed sources) about available datasets. Principles such as FAIR (Findable, Accessible, Interoperable and Reusable) [5] and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) [6] have been proposed for sharing digital data and digital repositories, with applications to COVID-19 datasets as well (e.g. the Virus Outbreak Data Network) [7] .

Here, we propose to conduct a systematic review on COVID-19 datasets that are associated with published literature. Our study aims at identifying a comprehensive list of available COVID-19 datasets across domains and at providing insights on how researchers share datasets as they publish COVID-19 research articles. Additionally, we also assess the accessibility, sustainability and impact of published datasets. More specifically, we attempt to answer the following research questions about COVID-19 datasets that are associated with publications: Q1. Contents: What types of data are published to support different studies and where are those data collected from? Q2. Accessibility: How can users access datasets and where are the data hosted? Q3. Citation: How are datasets cited by others and what are top high-impact datasets, by citation count?Our ultimate goal is to promote data sharing and data reuse through careful analyses of current practice by researchers. Through a systematic review, we provide researchers with a comprehensive list of reliable datasets that are available to the public. Additionally, we provide insights about data sharing strategies to aid those who plan to develop and publish new COVID-19 datasets.

To identify and collect COVID-19-related articles, we leveraged LitCovid [2] , a newly established literature database for tracking the latest scientific articles about COVID-19, developed by National Library of Medicine in the United States. LitCovid provides essential bibliographic information such as PubMed ID, title, abstract and journal of publications related to In this review, we included all LitCovid articles published before 31 May 2020, resulting in 18 332 articles. As the recognition of associated datasets requires access to full-text articles, we further limited articles to those with full text available in PubMed Central (PMC), which is one of the most significant open access literature repositories of full-text biomedical articles. We then removed errata notes of 16 articles. This further reduced the number of articles to 12 324, from which we carried out our dataset collection process.

We manually reviewed 100 PMC full-text articles and identified the following patterns for mentioning datasets:

(1) Dataset information is available in the Data Availability Statement section provided by PMC, allowing the authors to disclose information about data availability and access, which often contains URL links to data sources, or (2) When Data Availability Statement section was missing, datasets could have been mentioned in the full text as (a) external URL links to the data sources, (b) supplemental files (e.g. additional tables, sometimes in PDF) and (c) textual statements about data availability (e.g. 'available upon request').

As datasets from category 2b and 2c often required additional effort before they could be used in calculations, we limited our data collection to categories 1 and 2a, which led to the task of identifying URL links from PubMed full-text articles. Of course, external URL links in PMC articles do not always refer to datasets. Therefore, we developed a process that combines automatic extraction with manual review, to identify dataset links mentioned in articles. We first downloaded the full texts of 12 324 PMC articles in XML format using E-Fetch queries [8] . All URLs tagged with the markup 'ext-link' were then automatically extracted from articles. This included URLs both in the main text and in the citations. These URLs then underwent a normalization step, where extensions like 'HTTP' and 'htm' were removed, which resulted in a list of 23 467 URLs in total. We then manually reviewed all of them and identified 144 links directing to actual datasets. We noticed that one single dataset can be associated with multiple links. For example, the Johns Hopkins University Dashboard [9] was cited in articles using four different URLs. After merging these different data links that directed to the same dataset, we obtained 128 unique datasets from the verified data links. The complete process of extracting COVID-19 publications from LitCovid and extracting datasets mentioned in full-text COVID-19 publications in PMC was described in Figure 1 .

For each of the 128 COVID-19 datasets, we manually reviewed its web pages. We extracted information for 10 descriptive variables: (1) type of the dataset (e.g. epidemiological or genomic data), (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) file format (e.g. CSV), (5) where the dataset was hosted, (6) whether the data were updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PMC paper describing the creation of the dataset and (10) the number of times the dataset was cited by PMC articles (either via URL links or via articles). The definitions and examples of values for the 10 variables are shown in Table 1 .

Among 12 324 PMC articles screened, 249 papers included Data Availability Statement sections, and 23 papers provided valid online data sources. Of the papers without the Data Availability Statement (12 075), 3486 papers contained at least one dataset [97, 105] and medications [99] and (3) imaging datasets (N = 3; 2.3%) contained chest computed tomography (CT) images for COVID-19 patients plus others [110] [111] [112] . Mobility datasets (N = 7; 5.5%) [128] and population data [129, 131] , were also discovered in articles investigating the effects of disease transmission and long-term impacts of the pandemic. Figure 2 illustrates the geographic regions that the datasets covered. More than half of the datasets (N = 68; 53.1%) incorporated data from around the world. From the total, 18 (14.1%) datasets involved data from China. Multiple datasets related to epidemiology were reported in the United States [66, 69, 90, 102] , United Kingdom [40, 54, 62] and India [35, 36, 57] as the coronavirus diseases spread to these countries. Aside from country-level data, Africa [59] and Europe [109] also created datasets covering these entire continents. There were also smaller datasets that covered only states [39, 47] , counties [56] and cities [14, 16, 17, 43, 53] . Such datasets were often created by local health departments and incorporated detailed COVID-19 patients' demographic breakdowns. 

Among the 128 datasets in our study, 20 datasets did not provide clear downloading information. Users who wish to use these datasets need to contact the owners for download instructions. Therefore, we marked the accessibility of such datasets as 'request needed'. The remaining 108 (84.4%) datasets were instantly downloadable. Registrations prior to accessing the data are required for 9 out of 108 downloadable datasets.

Of 108 datasets that could be downloaded instantly, 19 were available to download in multiple formats. CSV (N = 53; 49.1%), XLSX (N = 27; 25.0%) and JSON (N = 10; 9.3%) were three popular formats in dataset exchange. Almost all genetic studies shared data in FASTA. RDS and RDA were two of the common data formats in studies that utilized the R programming language [28, 60, 68, 106, 108, 126] . Imaging datasets typically shared CT images as JPG files. Datasets of protein structures offered data in PDB files [92, 95] . GeoTiFF files were provided in a worldwide population dataset that allowed the data to be projected onto a geographical map [129]. 

As shown in Table 3 below, the most popular data repository is GitHub, incorporating 57 (44.5%) datasets. Of all, six (4.7%) datasets were stored on Mendeley Data, a cloud-based repository for research data from scholarly articles. Individual webpages (N = 55; 43.0%) referred to those datasets accessible only via stand-alone websites, in comparison with those deposited on established data repositories.

More than half (N = 74; 57.8%) of the datasets were being updated regularly (often daily or weekly). If data depositors did not offer any information regarding the updating frequency, we treated those datasets as not being updated on a regular basis. We recorded the date of the last update on those datasets. Figure 4 illustrates the number of datasets that stopped updating in each month. Table 4 showed the statistics for data licensing. Among the 128 datasets we collected, 39 (30.5%) datasets clearly specified data licenses to allow permitted use of datasets. The COVID-19 Image Data Collection [111] used multiple licenses for different subsets of data. 37.5% (N = 48) stated their own terms and policies for data usage in detail online. 8.6% (N = 11) datasets require users to cite their associated papers when using the data but do not offer other information on data sharing and usage. 23.4% (N = 30) datasets do not release any information regarding data usage.

Of 108 datasets that are immediately downloadable, 77.8% (N = 84) provide metadata in machine readable formats. Several datasets [40, 74, 78, 125, 130] and data deposited on established data repositories (GitHub, Mendeley and Kaggle) offer application programming interfaces (API) to automatically retrieve metadata. 9.3% (N = 10) datasets provide metadata in free text, which includes information like dataset names, data owners and data description. 13.0% (N = 14) datasets do not release any information on metadata.

Dataset article availability 41 .4% (N = 53) datasets were described with details in publications on PMC. Of the 53 articles describing datasets, 5 articles described extensively the purpose and techniques of building COVID-19 databases. The main purpose of the remaining 48 articles was to carry out modeling, prediction or other types of analysis in diverse domains, with some description about datasets in the study. These were often the datasets that were not updated on a regular basis: those data were collected, standardized and maintained by the authors themselves for the specific studies. Figure 5 demonstrates the number of citations for each dataset. Typically, a dataset can be cited in two ways in articles: (1) as a URL in the full text and (2) as an article that describes the dataset. It is possible for an article to cite both the URL and the article of the dataset. The number shown in Figure 5 is for the overall citations (both articles and URLs), in which the duplicated citations were removed. The number of citations across different datasets varied heavily. The dataset available in the John Hopkins University Dashboard [9] was the most popular dataset and was cited 454 times. Of the top 10 datasets, 9 were from the epidemiology domain. However, a low number of citations do not necessarily indicate that the dataset has little impact and may just reflect the fact that they did not have enough time to accrue citations yet (i.e. more recently published datasets). Table 5 presents the top 10 cited datasets in our study. The John Hopkins University Dashboard [9] had a large number of citations both as an online data link and a publication. Worldometers [33] and CDC [31] are high-impact data sources for COVID-19 case update and cited frequently as external data links. They are used in the Johns Hopkins University Dashboard but, since they do not accrue citations indirectly, their impact may be underestimated. The remaining seven datasets were almost all cited as articles published on PubMed and had none or few URL citations.

Although no extensive analyses have been carried out on availability, accessibility and type of COVID-19 datasets, discussion on the collection and sharing of COVID-19 data has received great attention among the scientific community: Alamo et al. [136] highlighted a variety of significant open data sources and evaluated the limitations and readability of available data. They concluded that notable progress was achieved by certain scientific communities, particularly among epidemiologists, healthcare specialists, the machine learning community and data scientists. Several studies also reviewed and explored available COVID-19 data in specific domains. Kalkreuth and Kaufmann [137] reviewed publicly available medical imaging resources for COVID-19 cases worldwide. Rubin [138] reported on recent progress in collecting data of ventilated patients confirmed with COVID-19. Robinson and Yazdany [139] described an initiative to collect data about COVID-19 patients with rheumatic diseases. Khalatbari-Soltani et al. [140] listed a series of important socioeconomic characteristics often overlooked when collecting and reporting social science data related to the pandemic.

In our study, we took a different approach and conducted a systematic review on COVID-19 literature in PMC to identify associated datasets. A number of interesting findings were identified through our analysis. First, although PMC implemented Epidemiology datasets constituted more than half of our dataset collection, while imaging datasets accounted for 2.3%, indicating the need to develop more datasets for the latter and for related domains, which will probably require worldwide collaboration in order to grow to the same size as epidemiology datasets. As for data format, although FAIR [5] recommends the RDF (Resource Description Framework) format, no dataset in this study has adopted RDF, probably because common machine-readable formats such as CSV, JSON and TXT are easier to understand. We observed two major types of practices in licenses of data usage. Data owners who use established data repositories often use a variety of existing data licenses to grant data usage and sharing. On the other hand, data owners who publish datasets on individual webpages prefer to specify their own terms and policies. Overall, 76.6% (98/128) data owners allow non-commercial use of data and specify the degree of openness by releasing data usage policies. The data update frequency relied heavily on the objectives of creating the dataset. Among 75 datasets only available as online sources, the majority of them were updated regularly for public uses. However, for 41.4% (53/128) datasets that are associated with publications, the authors collected and maintained datasets themselves for different purposes. Five articles aimed at describing how the COVID-19 databases were built, and they discuss data collection, storage and visualization. The remaining 48 articles focused on modeling, predictions or other analysis related to COVID-19. The authors of these analysis articles kept not only data but also codes and tools they used in their own studies. The datasets mentioned in these articles represent the collection of raw data that authors used as input for their analysis. Such data are often limited within a period of time and contain a relatively small number of cases.

We observed two approaches for citing datasets: (1) URL citations: citing URLs that led to the data sources and (2) Article citations: citing the article that describes the dataset. After examining the articles that cited datasets in the full text, we also discovered two major purposes of citing datasets: (1) citing a dataset as the data source used in the study and (2) citing a dataset as a general reference. Researchers are typically more likely to have used the dataset if they cite it directly as a URL. On the other hand, when citing a dataset as an article, the authors are more likely to mention it as a general reference instead of citing the data sources. This suggests that a larger number of URL citations to a dataset indicate its higher reuse. However, we also saw that datasets that aggregate data from several sources can be popular and be highly cited, but the data sources they use may not always receive citations. This indicates that we may consider indirect citations when assessing the true impact of a dataset in terms of its utility. Additionally, if a dataset is associated with a dedicated description paper, e.g. the John Hopkins University Dashboard [9] or the Epidemiological Data from the nCoV-2019 Outbreak [18] , other papers that used the dataset may cite it as both URLs and papers.

One limitation of this study is that we limited our analysis to full-text articles in PMC. Although PMC is the largest full-text article repository in the biomedical domain, there are still about one-third (5992/18 332) of LitCovid papers that are not included in this study due to unavailability at PMC. Considering that LitCovid collects articles from PubMed only, the actual number of COVID-19 articles that are not included in this study could be even higher. In the future, we plan to look into other sources of full-text articles to study COVID-19 dataset status. Additionally, our study did not take into account high-impact datasets cited often by preprints, such as the Public Coronavirus Twitter Dataset [141] . Furthermore, we reviewed only the URLs extracted from articles, instead of other potential types of references that could be revealed had we reviewed the whole text. There is a chance that we missed data source information stated in plain text. We hope to resolve this problem and to expand the dataset collection by introducing natural language processing techniques in our future studies.

We screened 12 324 COVID-19 related full-text articles in PMC and collected 128 unique dataset URLs. By systematically analyzing the collected datasets in terms of content, accessibility and citation, we observed significant heterogeneity in the way these datasets are mentioned, shared, updated and cited. Those findings on current practice on generating, sharing and citing datasets for COVID-19 research can provide valuable insights for future improvements.

• 128 COVID-19 datasets from 12 324 COVID-19 articles were collected for this systematic review.

• We conducted a quantitative analysis of dataset contents, accessibility and citations. 

Supplementary data are available online at Briefings in Bioinformatics.

The original data presented in the study are included in the article/supplementary materials.

UTHealth CCTS Pilot Project(0015300); National Science Foundation (OIA-1937136).

who-director-general-s-opening-remarks-at-the-mediabriefing-on-covid

Keep up with the latest coronavirus research

Coronavirus: indexed data speed up solutions

Phylogenetic network analysis of SARS-CoV-2 genomes

The FAIR guiding principles for scientific data management and stewardship

The TRUST principles for digital repositories

Virus Outbreak Data Network (VODAN)

Do text mining/retrieving full text

An interactive web-based dashboard to track COVID-19 in real time

European Centre for Disease Prevention and Control: COVID-19 case update worldwide

China Centre for Disease Prevention and Control COVID-19 Dashboard

Online repository (MOBS)

WHO Coronavirus Disease (COVID-19) Dashboard

Risk for transportation of coronavirus disease from Wuhan to other cities in China

Estimating case fatality ratio of COVID-19 from observed cases outside China

Analysis of early transmission dynamics of nCoV in Wuhan

Pattern of early human-to-human transmission of Wuhan

Open COVID-19 Data Curation Group. Open access epidemiological data from the COVID-19 outbreak

Singapore Ministry of Health

COVID19 Outbreak tracking and forecast

RTaLTmMLF_K_5EVCc/edit#gid=783518927

COVID-19 in Italy: dataset of the Italian civil protection department

Latest Situation of Coronavirus Disease (COVID-19) in Hong Kong

Impact of international travel and border control measures on the global spread of the novel 2019 coronavirus outbreak

Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2)

The effect of human mobility and control measures on the COVID-19 epidemic in China

Serial interval of COVID-19 among publicly reported confirmed cases

Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts

Serial interval of novel coronavirus (COVID-19) infections

The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak

Centers for Disease Control and Prevention: Cases, Data, and Surveillance

India COVID-19 Statewise Status

Tracking the impact of COVID-19 in India

Rhode Island COVID-19 response data

Oxford COVID-19 Government Response Tracker

King County COVID-19 data dashboards

Google News COVID-19

Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study

COVID-19 Projections

State-level social distancing policies in response to the 2019 novel coronavirus in the US

An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China

COVID-19) Data

COVID-19) in the UK

Israeli COVID-19 Database

COVID-19 cases in the Indian health system

Assessing differential impacts of COVID-19 on black communities

Projected early spread of COVID-19 in Africa through 1

Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UKBMC Med

Data on the 2019 Novel Coronavirus Outbreak

Introductions and early spread of SARS-CoV-2 in the New York City area

Estimates of the severity of coronavirus disease 2019: a modelbased analysis

Early dynamics of transmission and control of COVID-19: a mathematical modelling study

Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: a datadriven analysis

Covid-19) Data in the United States

A framework for identifying regional outbreak and spread of COVID-19 from oneminute population-wide surveys

The transmissibility of novel coronavirus in the early stages of the 2019-20 outbreak in Wuhan: exploring initial point-source exposure sizes and durations using scenario analysis

Estimating the burden of United States workers exposed to infection or disease: a key factor in containing risk of COVID-19 infection

Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data

Estimating the infection and case fatality ratio for coronavirus disease (COVID-19) using age-adjusted data from the outbreak on the Diamond Princess cruise ship

Online forecasting of COVID-19 cases in Nigeria using limited data

Characterization of the COVID-19 pandemic and the impact of uncertainties, mitigation strategies, and underreporting of cases in South Korea, Italy, and Brazil

Day level information on covid-19 affected cases

Disease 2019 cases in Italy

GISAID

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

Chaos game representation dataset of SARS-CoV-2 genome

Repurposing didanosine as a potential treatment for COVID-19 using single-cell RNA sequencing data

Transcriptomic characteristics of bronchoalveolar lavage fluid and peripheral blood mononuclear cells in COVID-19 patients

SARS-CoV-2 receptor ACE2 and TMPRSS2 are primarily expressed in bronchial transient secretory cells

SARS-CoV-2 receptor ACE2 is an interferon-stimulated gene in human airway epithelial cells and is detected in specific cell subsets across tissues

Genomic epidemiology of SARS-CoV-2 in Guangdong Province

Coast-tocoast spread of SARS-CoV-2 during the early epidemic in the United States

ARTIC nanopore protocol for nCoV2019 novel coronavirus

date last accessed). 93. NCBI Virus

Collection of 3D Print Models of SARS-CoV-2 virions and proteins

The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application

Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies

Network-based drug repurposing for novel coronavirus 2019-nCoV/ SARS-CoV-2

COVID-19 Cases on ECMO in the ELSO Registry

Projecting hospital utilization during the COVID-19 outbreaks in the United States

COVID-19 Ventilator Projects and Resources with FAQs

A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis

In silico identification of vaccine targets for 2019-nCoV

Inhibition of SARS-CoV-2 infections in engineered human tissues using clinicalgrade soluble human ACE2

Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: a statistical analysis of publicly available case data

ICU capacity management during the COVID-19 pandemic using a process simulation

Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy

CORD-19: the Covid-19 open research dataset

Dataset for country profile and mobility analysis in the assessment of COVID-19 pandemic

Evidence from internet search data shows information-seeking responses to news of local COVID-19 cases

Data for understanding the risk perception of COVID-19 from Vietnamese sample

United Nations Population Fund

Estimated effectiveness of symptom and risk screening to prevent the spread of COVID-19

Open data resources for fighting COVID-19

COVID-19: a survey on public medical imaging data resources

Global Effort to Collect Data on Ventilated Patients With COVID-19

The COVID-19 global rheumatology alliance: collecting data in a pandemic

Importance of collecting data on socioeconomic determinants from the early stage of the COVID-19 outbreak onwards

Tracking social media discourse about the COVID-19 pandemic: development of a public coronavirus twitter data set