key: cord-0756516-52e0nw9a
authors: Dagliati, Arianna; Malovini, Alberto; Tibollo, Valentina; Bellazzi, Riccardo
title: Health informatics and EHR to support clinical research in the COVID-19 pandemic: an overview
date: 2021-01-18
journal: Brief Bioinform
DOI: 10.1093/bib/bbaa418
sha: 94fce17aa5c490f3281c27618d83cc1046529d33
doc_id: 756516
cord_uid: 52e0nw9a

The coronavirus disease 2019 (COVID-19) pandemic has clearly shown that major challenges and threats for humankind need to be addressed with global answers and shared decisions. Data and their analytics are crucial components of such decision-making activities. Rather interestingly, one of the most difficult aspects is reusing and sharing of accurate and detailed clinical data collected by Electronic Health Records (EHR), even if these data have a paramount importance. EHR data, in fact, are not only essential for supporting day-by-day activities, but also they can leverage research and support critical decisions about effectiveness of drugs and therapeutic strategies. In this paper, we will concentrate our attention on collaborative data infrastructures to support COVID-19 research and on the open issues of data sharing and data governance that COVID-19 had made emerge. Data interoperability, healthcare processes modelling and representation, shared procedures to deal with different data privacy regulations, and data stewardship and governance are seen as the most important aspects to boost collaborative research. Lessons learned from COVID-19 pandemic can be a strong element to improve international research and our future capability of dealing with fast developing emergencies and needs, which are likely to be more frequent in the future in our connected and intertwined world.

The coronavirus disease 2019 (COVID- 19) pandemic has clearly shown that major challenges and threats for humankind need to be addressed with global answers and shared decisions. Data and their analytics are crucial components of such decision-making activities. To this end, several initiatives have started to allow national and international sharing of different types of COVID-19 data, including molecular (from sequences to drug targets [https://www.covid19dataportal.org/] [1] ), epidemiological (https://www.ecdc.europa.eu/en/covid-19pandemic) and finally policies and intervention strategies data (https://www.coronanet-project.org/).

Rather interestingly, one of the most difficult aspects turned out to be reuse and sharing of accurate and detailed clinical data collected by Electronic Health Records (EHR), even if these data have a paramount importance. EHR data, in fact, are not only essential for supporting day-by-day activities, but also they can leverage research and inform about effectiveness of drugs and therapeutic strategies. As a matter of fact, as also reported by Moore et al. [2] , EHR and health information systems have a 2-fold nature, as they must provide flexible and robust point-of-care solutions and at the same time they can feed clinical research with invaluable real-world data.

Two recent retractions of papers published in leading medical journals [3, 4] witness how critical is the process of clinical data collection, curation and sharing.

The COVID-19 emergency has suddenly made visible what are the current problems and limitations, including old and new barriers, as well as the existing opportunities.

On the one hand, the pandemic has required a sudden physical reorganization of hospital wards and a full redesign of clinical workflows. Sometimes this has forced a redesign of parts of the hospital information systems, including the introduction of new terms and definitions, with a consequent delay in data collection and following analysis [5] . Moreover, the pandemic has pushed telemonitoring of COVID and non-COVID patients, suddenly putting into practice telemedicine solutions [6, 7] . On the other hand, the need for coordinated multicentre studies has been made clear. Although a number of consolidated international projects and consortia had the chance to show the potential of their approaches, the lack of standardization and interoperability has still emerged as one of major obstacles for efficient and effective data sharing. Moreover, strict privacy regulations and political concerns are slowing down international collaborations. Coordinating Institutional Review Board (IRB) processes among centres across the continents, for instance, is very complex and sometimes turns out to be an insurmountable challenge.

The COVID-19 pandemic triggered an unprecedented growth of collaborative efforts, deployment of analysis frameworks and an extraordinary generation of scientific literature, which has been largely made available as preprint: nearly 8000 papers have been uploaded on medRxiv (https://www.medrxiv.org/). Given not only the rapid evolution of the COVID-19 pandemic, but also the fast pace at which we are accumulating novel knowledge about the disease and developing tools to gain more information to infer more refined knowledge, conducting a systematic review appears to be unavailing.

PubMed and Scopus search using the query 'COVID-19' AND 'Electronic Health Record * ' AND 'shar * ' retrieved a relatively small number of manuscripts (Supplementary Material S1) pertinent to this overview. Thus, while including these evidences, we extend this overview on the basis of our knowledge of health informatics initiatives to collect international federated EHR data based on de facto standards for clinical data sharing such as Informatics for Integrating Biology and the Bedside (i2b2) (https://www.i2b2.org/) and Observational Health Data Sciences and Informatics (OHDSI) (https://ohdsi.org/).

Although the main purpose of the overview was to underline the importance on international and collaborative informatics infrastructures (described in the section 'Clinical and epidemiological data collected from EHRs into standardized formats'), we further investigate initiatives to create open access data portals (in 'Open Data portals') leveraging on our knowledge of institutional schemes, such as the ones promoted by the European Union (EU). To the best of our knowledge, while several countries collected clinical data at a national level (https:// www.england.nhs.uk/contact-us/privacy-notice/how-we-useyour-information/covid-19-response/coronavirus-covid-19-re search-platform/, https://www.aihw.gov.au/covid-19), none of them made them accessible to non-government establishments for independent researchers.

As thoroughly described in [8, 9] and in [10] , digital technologies for big data analytics, next-generation telecommunication networks and artificial intelligence might play a crucial role to tackle major problems related to the management and containment of the pandemic. Among these disruptive digital technologies, we explored those that might be used to gather and handle patients' data and that can potentially be integrated into health informatics systems: data lakes and blockchains. PubMed and Scopus queries and results including key term 'COVID-19', 'Data Lake * ' and 'Blockchain * ' are reported in the Supplementary Material S1.

Once we had introduced some of the available health informatics tools (i.e. collaborative data infrastructures, databases and digital technologies), which have the potential of accelerating our discoveries about the epidemiology, pathophysiology and healthcare system dynamics of COVID-19, we discuss open challenges of data sharing and data governance and foreseeable future directions to enhance and support the clinical research in the COVID-19 pandemic.

System interoperability, shared knowledge of structured processes, common standards and terminologies are the pillar for data sharing. The actual implementation of data-sharing systems can then follow different purposes; data can be gathered and released with different formats and data silos organized accordingly for specific aims.

As noted in [11] , to date, healthcare professionals do not have access to robust data-sharing systems for large-scale, real-time analysis. This is an important limitation for epidemiological studies and to develop treatment protocols, especially given the absence of clinical trials and the obstacles for rapidly setting them up. Authors state that, 'when considering COVID-19, the insight we could gain from a pooled, publicly available dataset analysed by researchers in academic institutes and industry is invaluable and necessary'. We agree with this claim; furthermore, we underline the importance of deploying fast access to clear and shared policies, which should be promptly communicated and could be easily implemented into healthcare informatics systems.

Cosgriff et al. [11] envision 'a unifying multinational COVID-19 electronic health record waiting for global researchers to apply their methodological and domain expertise'. Although the adoption of internationally shared EHRs systems still seems a non-viable solution, especially due to data protection regulations, there are few promising initiatives. Some of these are based on experiences and infrastructures developed before the COVID-19 pandemic, which were aimed at gathering clinical and epidemiological data and at making these data available to international researchers for joint studies and meta-analyses.

In the following, and in Table 1 , we report several of current national and international initiatives promoted by governments, academia and industry for providing shared informatics infrastructures and databases.

This overview is not meant to be exhaustive and we have confidence that many other collaborative projects will be initiated and released in the following months. 

The simplest approach to quickly release shared data is to rely on previous resources and integrate them with specific COVID-19 information. That is the approach followed by the UK Biobank (http://biobank.ndph.ox.ac.uk/showcase/exinfo.cgi? src=COVID19), which is receiving COVID-19 test data for UK Biobank participants and frequent updates on deaths, inpatient hospital admissions-including intensive care-and primary care data. The great advantage given by this kind of approach is to have already coupled clinical, omics and patients generated data in standardized formats and to already have in place both information infrastructures and data protection policies allowing to give access to patient-level data. Among other initiatives that leverage on pre-existing communities and data-sharing networks, there is the one promoted by the OHDSI community (https://www.ohdsi.org/covid-19-u pdates/), namely the project CHARYBDIS (Characterizing Health Associated Risks, and Your Baseline Disease In SARS-COV-2) (https://data.ohdsi.org/Covid19CharacterizationCharybdis/). Data are collected from international general practice, hospitals and outpatient's specialist EHRs and organized following the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) [12] . The project is aimed at describing baseline demographic, clinical characteristics, treatments and specific outcomes among individuals diagnosed with COVID-19, also using seasonal influenza infections as a benchmark. Preliminary results indicate the feasibility of exploiting these data for developing risk scores of hospital admission, intensive care unit admission and fatality [13] and for detailed characterization and phenotyping of hospitalized patients [14] .

Within the 4CE consortium (https://covidclinical.net/), a large international community of researchers set up a network of 96 hospitals across five countries to answer some of clinical and epidemiological questions around COVID-19 through data harmonization, analytics and visualizations. Contributors utilized the i2b2 or OMOP platforms to map to a CDM. The 4CE initiative showed how an international consortium was able to rapidly harmonize and integrate EHR data, thanks to a strict collaboration among clinicians and bioinformatics who understood both technical characteristics and clinical relevance of data. Preliminary results [15] show laboratory trajectories abnormalities in several tests and discuss the importance of interoperability and data alignment.

International collaborations could certainly be facilitated by the creation of national cohorts that-as advisable-will be made accessible to international researchers. A National Institutes of Health effort, called the National COVID Cohort Collaborative (N3C) (https://ncats.nih.gov/n3c/about), aims to build a centralized US data resource [16] . Data are systematically collected from EHRs and include clinical, laboratory and diagnostic information. Harmonized data are mapped into an OMOP CDM and shared with participating partners via a cloudbased research environment. From the health informatics point of view, the most interesting and ambitious goal of N3C is to create an analytics platform on top of the data collected into a CDM, to enable collaborative analyses and that can be reused for other diseases in the future.

Finally, the Secure COllective Research (SCOR) consortium has developed a secure infrastructure using advanced privacy and security technologies to support data federation, query and analytics in a distributed manner. SCOR is based on the Collective protection of medical data (MedCO) data analytics platform, which is implemented exploiting homomorphic encryption and secure multiparty methodology to ensure privacy and, at the same time, multicentric analysis. A total number of 24 centres around the world are involved in this project [17] .

Data collected at patient level from EHRs have invaluable potential for accelerating knowledge discovery and to support clinical research and practice. Nevertheless, their value could be augmented, thanks to their integration with ancillary information derived from open datasets regarding national policies, surveillance plans, socio-economics factors and populations' characteristics that could influence both available treatments and outcomes. Therefore, these aspects should be evaluated as potential confounders. We reviewed some of the available data portals ad hoc created for COVID-19 and hosted by pre-existing platforms. Detailed information is reported in Table 1 .

Open data best practices have been promoted and supported by the EU through several projects, among which the European Data Portal (https://www.europeandataportal.eu/en/about/euro pean-data-portal), containing a wide range of metadata and datasets organized according to specific categories (including economy, public sectors, health, population and society). As a response to the pandemic, they have created a collection of datasets directly or indirectly related to COVID-19 with information on the COVID-19 research, epidemiology and healthcare systems. The EU Open Data Portal (https://data.europa.eu/euodp/ en/home) gives access to open data published by EU institutions and bodies among the ones published by the European Centre for Disease Prevention and Control (https://www.ecdc.europa.eu/ en/covid-19/data-collection). The EU Open Data Portal contains the latest available public data on COVID-19, including daily situation updates, epidemiological curves and global geographical distributions.

At a global level, World Health Organization (WHO) provides a Coronavirus Disease Dashboard (https://covid19.who. int/), which allows rapidly visualizing contagion trends over time and querying and retrieving information about epidemic summary statistics by country.

As previously mentioned, governments responses to the pandemic are important factors to consider and to possibly include in integrated analyses. The CoronaNet Research Project compiles a database on various fine-grained actions governments are taking to defeat the coronavirus, including travel bans and investments in the public health sector (https://www.coronanetproject.org/index.html). Thanks to the collaboration of political, social and public health science scholars, they have collected >10 000 international policy announcements [18] . Data, reports and analytics tools can be downloaded from their site.

Another key aspect to consider to get valuable insight to support the research on COVID-19 is the impressive amount of scientific literature produced in a very short time frame. The COVID-19 Open Research Dataset (CORD-19) (https://www. semanticscholar.org/cord19) provides a daily updated corpus of >130 000 scholarly articles about COVID-19. The goal is to support the application of natural language-processing approaches and help researchers to overcome the possible information overload.

Given the rapidly evolving situation and the amount of heterogeneous data collected during the pandemic, data lakes seem another ample opportunity to explore, especially for the implementation of machine learning frameworks. Data lakes are storage repositories that hold a vast amount of raw data in its native format, including both structured and unstructured data. Such storage solutions can be rapidly set up in presence of adequate infrastructures to manage them; on the other hand, querying data lakes could be challenging due to their unstructured nature. Although it is important to underline industry efforts for releasing COVID-19 data lakes, see, for example, Amazon (https://aws.amazon.com/it/blogs/ big-data/a-public-data-lake-for-analysis-of-covid-19-data/), Microsoft (https://azure.microsoft.com/en-us/services/opendatasets/catalog/covid-19-data-lake/) and C3.AI (https://c3.ai/ products/c3-ai-covid-19-data-lake/), here we focus on specific matters related to the collection of medical data. As further discussed in the following sections, data governance is often associated with cumbersome processes that slow down data gathering efforts and affect data availability. In [19] , the authors propose a data lake-based strategy for clinical data repositories to achieve legal interoperability for research purposes. In the COVID-19 pandemic, the proposed approach allows rapidly combining and making available legally compliant datasets, responding to the necessity to harmonize international legal requirements and supporting global collaborations in context where data governance heterogeneity might hinder rapid analyses and responses. Data lakes can provide fast solutions for big data collection. However, once not-harmonized data are collected, they can hardly provide meaningful insights and derive clinical value if not carefully manipulated. In [20] , authors propose the use of a semantic approach based on automated ontology mapping and merging, where data lakes are used to interoperate ontologies for learning object retrieval and reuse from local to global ontology. In this way, data lakes were exploited to extract information from heterogeneous data sources and generate real-time statistics and reports.

Blockchains provide decentralized computational architectures and data management technology, so that actions on data (such as transactions) take place in a decentralized manner. One of their main features of interest is to provide security, anonymity and data integrity without the control of third-party organizations [21] . Blockchains have been identified as a key technology for fighting against COVID-19 in several case applications and service opportunities [22] [23] [24] , such as contact tracing, supply chain management, online education, e-government and patient information sharing, where blockchains are seen as a possible solution to preserve privacy and quickly and accurately share clinical information.

As reported in [25] , the use of blockchain can facilitate the creation of generalizable predictive systems in healthcare and contain infections. Authors conducted a rigorous analysis in opportunities and limits to blockchain-based adoption within the COVID-19 pandemic and conclude that blockchain, applied to the health sector, can offer effective opportunities to improve prevention activities, management of clinical risk, patient data and EHR data, and also the scientific research and the divulging of scientific knowledge. Fusco et al. [25] underline how the adoption of blockchain systems as a bridge to ensure crosscommunication might overcome the issue of interoperability among different EHR systems and allow the rapid collection and sharing of healthcare data respecting privacy and security. Also pertinent to the focus of this overview, we want to highlight the opportunities of using blockchain for improving the exchange of health records, especially for enhancing patientcentric interoperability and user-centred medical research [26] and to ensure secure and effective EHR data sharing that allows personal medical data remaining in control of the patient [27] .

The COVID pandemic made clear that the health informatic community agrees and strongly demands for unified frameworks for sharing and exchanging digital epidemiological data and, accordingly with data protection regulations, facilitating the flow of information between health workers, stakeholders, policy makers and the public.

The demand for digital data sharing also raised some crucial discussion points.

Beyond obvious issues related to different international health systems and organizations, a substantial lack of cohesive data models in EHR and poor interoperability emerged during the pandemic, spotlighting a very well-known weakness of health information technology (IT) infrastructures [28] .

In fact, while noteworthy efforts have been devoted towards syntactic and semantic interoperability over the last 50 years (http://www.hl7.org, https://loinc.org/, http://www.snomed.o rg/, https://www.npu-terminology.org/), the way in which individual-level data are collected and coded can be extremely different even between institutions within the same country.

Even if there is a general agreement that FAIR data principles (Findability, Accessibility, Interoperability and Reusabili ty) should apply to EHR data, too, practice is often showing a disappointing reality [29, 30] .

Data sharing initiatives could be dramatically limited by such heterogeneity of data formats and standards. Therefore, data requires a long and painful pre-processing phase, which consists of variables mapping between coding standards and releases before being shared with others to contribute to multicentric studies.

Studies about COVID-19 pandemic are more than others affected by extreme heterogeneity in terms of data standardization. This is mostly due to the rapid spread and evolvement of the epidemic and to the limited time that research institutes had to organize data collection in a homogeneous way and to define and share vocabularies to represent a common 'core' set of new concepts.

To date, initiatives to improve the interoperability among systems for the diagnosis and treatment of COVID-19 are still relatively limited and being driven mostly by the collaborative efforts presented in the previous sections of our paper. Standards for data models have been exploited, such as i2b2 for the i2b2-Accrual to Clinical Trials (ACT) ontology, reported in ( Another interesting initiative has been reported by Mishra et al. [31] . Authors adapted the FHIR-based architecture for infectious disease surveillance for sexually transmitted diseases to the general problem of outbreaks, showing the potential of this emerging technology in the COVID-19 scenario. Shared data models based on openEHR modelling, as the one presented in [32] , should be more widely exploited for facilitating data exchange.

Let us finally note that interoperability mostly involves EHR vendors and providers. Effective data exchange will be possible only if different companies will work together to implement new strategies for systems communication, thus allowing patients data, currently trapped into single EHR silos, to become easily available to clinicians, researchers and patients themselves, for coordinated analyses and actions. Ad hoc informatics infrastructures to share EHR data would probably need to be jointly supported by governments, public and private entities to facilitate this process, as foresees for genomics in [33] .

EHR data are often gathered in a clinical processes-blind way. Although this is somehow manageable during routine practice, in emergency situations, this represents a critical bottleneck for assessing patients' risks and tailoring interventions. Difficulties in gathering EHR data into cohesive narratives imply partial views of patients' risk and the loss of essential timeline of health data [34] . For this reason, it would be always important to have a formal representation of the healthcare process underlying EHR to explicitly describe the context and assumptions of the specific implementation [35] .

In addition to having the capability of sharing patients' careflows, EHR workflows should be flexible, embed clinical practice processes information and quickly adapt to sudden changes in clinical processes and guidelines.

An important step for building processes-aware systems is to learn processes in an objective way, possibly from the same EHR data. Analytics techniques such as process mining [36] allow for discovery, conformance and enhancement of processes, establishing a strong relation between the process model and the reality captured from data. If processes were learned in a systematic and structured way, they could be easily integrated into hospital information systems, thus accelerating the response to unexpected changes of clinical practice and ease EHR workflow adaptations to emergencies.

Process awareness is often fragmentary and incomplete, as the data from which it is gathered. This is an issue that health informaticians face for chronic diseases that have been studied-and for which data have been collected-for decades. To efficiently study the processes related to a novel disease with a global spread such as COVID-19, it is necessary to create evidence from a large amount of structured data collected in a very short time period but at an international level. This is an unparalleled research opportunity to create shared systems for global scale analyses of clinical and processes data, especially if these repositories could be coupled with biological knowledge and omics data.

The global spreading of COVID-19 raised questions and challenges regarding privacy and security compliance. Among these, it is worth mentioning the presence of substantial differences in terms of data sharing and protection regulations among countries (e.g. Data Protection Regulation, General Data Protection Regulation [https://eur-lex.europa.eu/eli/re g/2016/679/oj] in Europe, Health Insurance Portability and Accountability Act [37] in United States) and the involvement of businesses that are not regulated by privacy rules, but that can collect information potentially useful to face COVID-19 pandemic. Beside these aspects, there are also potential cybersecurity issues that entities should account for.

As a consequence, health organizations are exceedingly riskaverse towards data sharing, even if regulations permit such activities. Therefore, some European groups [38] acknowledged 'an ethical obligation to use the research exemption clause of the European General Data Protection Regulation during the COVID-19 pandemic' for sharing digital health data for research purposes and support global collaborative health research efforts.

Research studies involving human subjects require IRB approval by local institutions and this process follows internal procedures. IRB protocol preparation and approval processes typically differ between institutions even within the same country. The key differences rely on documentation structure and contents, bureaucratic steps needed to envision the documentation, time required by the institution to process and to approve it, and number and nature of changes requested by the IRB. Thus, such differences represent a limitation for multicentric studies since each participating centre must deal with the IRB approval procedure independently according to internal regulations, causing a lack of synchronization between institutions and possible delays in starting research activities [39] .

IRB approval is often tailored to specific scientific questions, but these can evolve rapidly, especially when dealing with new research challenges like the COVID-19 pandemic. Therefore, there is the need to be able to rapidly update IRB approvals based on the new requirements. For example, the possibility to obtain approvals for new specific tasks proposed in the context of the starting research activity could facilitate and speed up the whole process.

As a consequence of the spread of research initiatives about COVID-19 pandemic, research institutions are often involved in multiple activities by sharing data about local patients between different centres like the 4CE consortium (https://covidclinical. net/) and N3C partnership (https://ncats.nih.gov/n3c/about) or by starting independent internal research projects. Data from the same patients are typically shared between multiple studies, increasing the probability of non-independence of findings deriving from apparently unrelated research initiatives. Also, datasets are often generated incrementally and shared or analysed according to temporal 'releases' of increasing size, depending on the cumulative number of new cases collected.

Once published, results (summary statistics, odds ratios, . . . ,) from multiple scientific studies addressing the same question are often combined through meta-analysis, 'a quantitative, formal, epidemiological study design used to systematically assess previous research studies to derive conclusions about that body of research' [40] . By combining results, meta-analysis allows increasing the statistical power of the analyses planned. In this context, the potential non-independence of the findings reported in the studies considered could represent a heavy limitation and researchers should account for this possible bias when performing meta-analyses.

Starting from these observations, there is clear evidence about the need for ad hoc systems able to keep track of the different data releases generated by centres and shared between independent research initiatives. By keeping track of the data versioning and sharing history, researchers could have a more precise overview about the data used in different analyses, thus allowing to discriminate between independent and nonindependent studies reducing potential methodological bias.

Rather interestingly, the importance of data governance and data stewardship in data reuse has been strongly advocated by the International Medical Informatics Association (IMIA) some years ago [41, 42] . IMIA proposed the data steward as a key actor to 'convey a fiduciary (or trust) level of responsibility toward the data'. Moreover, 'data governance is the process by which responsibilities of stewardship are conceptualized and carried out'. We can conclude that key responsibilities of the data steward are not only to comply with privacy regulations, but also to keep track of the different ways in which the same data sources are utilized and the evidence that is based on these data is extracted.

The COVID-19 pandemic needs multi-institutional data sharing strategies able to deal with manifold challenges that society is facing. Among them, the availability of reliable clinical data can further boost understanding of the disease, deepening insights on its time-varying nature, investigating the impact of different therapeutic strategies and finally informed decision-making.

From the researchers' perspective, the possibility to integrate EHR-derived information about patients' disease condition, treatments, interventions, clinical exams with other data sources is of paramount importance for a deeper comprehension of the COVID-19 disease mechanism and severity manifestation. Overmyer et al. [43] adopted a multiomic approach by quantifying thousands of different biomolecules from patients with and without COVID-19 in relation to their disease severity and outcomes. The integration of multiomics data showed good performances in predicting COVID-19 severity, allowing also to highlight informative features. A web-based tool (covidomics.app) allows the scientific community to further explore the generated data. The Severe COVID-19 genome-wide association study (GWAS) Group [44] performed a genome-wide casecontrol association study on severe COVID-19 patients and controls and identified a cluster of genes representing a genetic susceptibility locus in patients with respiratory failure. Shen et al. [45] performed proteomic and metabolomic analysis of COVID-19 sera and identified differentially expressed factors correlating with disease severity and evidenced dysregulation of multiple immune and metabolic components in clinically severe patients. Taken together, these results confirm the relevance of an integrated approach to the COVID-19 disease characterization.

There are a number of important issues that need to be addressed to achieve such ambitious goals. First, the level of interoperability, both syntactic and semantic ones, of EHR is still far too low. Even if noteworthy efforts have been made over the last 50 years, ontologies, terms, languages and criteria of EHR design are still not properly standardized. This is not related to the unavailability of adequate solutions [46, 47] , but rather to the combination of an underestimation of interoperability needs in the design of EHR, sometimes due to locking-in policies of software vendors [48] and of an insufficient capability of describing and formally representing healthcare processes and their information needs. The latter aspect is of fundamental importance to achieve interoperability and data interpretation in a comprehensive manner, considering the very nature of clinical data and their operational contexts. It is important to mention that the COVID-19 emergency has made this problem even more difficult, since hospitals and healthcare providers had to suddenly change their organization and careflows. As reported in our paper, other threats to successful data sharing are represented by different privacy regulations in different world regions, as well as by data protectionism that has recently emerged [49] . However, new and more flexible informed consent, as well as a more proactive and receptive role of IRB committees, are supporting international data sharing initiatives even in this complex scenario. Finally, data sharing implies also the introduction of stricter control on the process, not only to comply with privacy regulations, but also to avoid uncontrolled use of data for deriving evidence. To this end, a promising strategy is related to the introduction of a data steward that may apply data governance policies to support institutions overseeing data sharing not only from a legal viewpoint, but also from a holistic perspective for the benefit of individuals and of the society. Data stewards may provide also support to the more general aim of data quality, which currently is left to the policies of the single research networks.

It is finally important to remark that EHR and hospital data are not the only source of clinical data that are important to support understanding of disease evolutions determinants. It is now clear that in the challenge of managing the pandemic, the more the primary and secondary care have been coordinated among them and with social care organizations, the more the control has been effective [50] . The IT infrastructure in place to support such coordination is able to collect data that can be extremely useful for research purposes. For example, in the UW initiative, primary and secondary care have been strongly linked for the benefit of a better local management of the pandemic [51] . The collection of pre-hospital data has been demonstrated to be an important source of information for clinical research, as shown in [52] . More work on such integration can be extremely valuable to improve quality of data interpretation.

Lessons learned from COVID-19 pandemic can be a strong element to improve international research and our future capability of dealing with fast developing emergencies and needs, which are likely to be more frequent in our connected and intertwined world.

• The coronavirus disease 2019 (COVID-19) pandemic has clearly shown that data and their analytics are crucial for handling this world-wide emergency.

• An important but difficult aspect is related to sharing of accurate and detailed clinical data collected by Electronic Health Records (EHR).

• EHR data are not only essential for supporting day-byday activities, but also they can leverage research and support critical decisions about effectiveness of drugs and therapeutic strategies.

• In this paper, we will concentrate our attention on the current efforts related to collaborative data infrastructures to support COVID-19 research and on the open issues related to data sharing and data governance that COVID-19 had made emerge.

• Data interoperability, healthcare processes modelling, different data privacy regulations, and data stewardship and governance are seen as the most important aspects to boost collaborative research.

Supplementary data are available online at Briefings in Bioinformatics.

Phylogenetic network analysis of SARS-CoV-2 genomes

Ideas for how informaticians can get involved with COVID-19 research

Retraction: cardiovascular disease, drug therapy, and mortality in COVID-19

Retractionhydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis

Mitigating the effects of a pandemic: facilitating improved nursing home care delivery through technology

Leveraging health system telehealth and informatics infrastructure to create a continuum of services for COVID-19 screening, testing, and treatment

Rapid design and implementation of an integrated patient self-triage and self-scheduling tool for COVID-19

Digital technology and COVID-19

Emerging technologies to combat the COVID-19 pandemic

A comprehensive review of the COVID-19 pandemic and the role of IoT, drones, AI, blockchain, and 5G in managing its impact

Data sharing in the era of COVID-19

Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership

Seek COVER: development and validation of a personalized risk calculator for COVID-19 outcomes in an international network

Deep phenotyping of 34,128 adult patients hospitalised with COVID-19 in an international network study

International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium

NIH launches platform to serve as depository for COVID-19 medical data

SCOR: a secure international informatics infrastructure to investigate COVID-19

COVID-19 Government Response Event Dataset (CoronaNet v.1.0)

Policy-aware data lakes: a flexible approach to achieve legal interoperability for global research collaborations

Towards an ontology proposal model in data lake for real-time COVID-19 cases prevention

An overview of blockchain technology: architecture, consensus, and future trends

The role of blockchain to fight against COVID-19

How can blockchain help people in the event of pandemics such as the COVID-19

Survey of decentralized solutions with mobile devices for user location tracking, proximity detection, and contact tracing in the COVID-19 era

Blockchain in healthcare: insights on COVID-19

Managing patient medical record using blockchain in developing countries: challenges and security issues

Health information exchange with blockchain amid COVID-19-like pandemics

Digital health and the state of interoperable electronic health records

Providing an integrated access to EHR using electronic health records aggregators

Designing a system for patients controlling providers' access to their electronic health records: organizational and technical challenges

Public health reporting and outbreak response: synergies with evolving clinical standards for interoperability

Development of an openEHR template for COVID-19 based on clinical guidelines

Including all voices in international data-sharing governance

Physics of the medical record: handling time in health record studies

CLIN-IK-LINKS: a platform for the design and execution of clinical data transformation and reasoning workflows

Process mining

The politics of the Health Insurance Portability and Accountability Act

COVID-19: putting the General Data Protection Regulation to the test

Time required for Institutional Review Board review at one veterans affairs medical center

Meta-analysis in medical research

Trustworthy reuse of health data: a transnational perspective

Health data use, stewardship, and governance: ongoing gaps and challenges: a report from AMIA's 2012 Health Policy Meeting

Large-scale multi-omic analysis of COVID-19 severity

Genomewide association study of severe Covid-19 with respiratory failure

Proteomic and metabolomic characterization of COVID-19 patient sera

The characteristics and capabilities of the available open source health information technologies supporting healthcare: a scoping review protocol

Evaluation of openEHR repositories regarding standard compliance

Implications of an emerging EHR monoculture for hospitals and healthcare systems

Digital protectionism? Antitrust, data protection, and the EU/US transatlantic rift

Strengthening the UK primary care response to COVID-19

Responding to COVID-19: the UW medicine information technology services experience

COVID-19 preliminary case series: characteristics of EMS encounters with linked hospital diagnoses

The authors gratefully acknowledge Luigia Scudeller (IRCCS Policlinico di Milano), Antonio Bellasi (Hospital Papa Giovanni XXIII, Bergamo) and John H. Holmes (University of Pennsylvania, Philadelphia) for the insightful comments and discussions. The authors also thank the 4CE consortium for giving us the opportunity to participate in this important international initiative.