key: cord-0932875-jmgkjjzm
authors: Bastarache, Lisa; Brown, Jeffrey S.; Cimino, James J.; Dorr, David A.; Embi, Peter J.; Payne, Philip R.O.; Wilcox, Adam B.; Weiner, Mark G.
title: Developing real‐world evidence from real‐world data: Transforming raw data into analytical datasets
date: 2021-10-14
journal: Learn Health Syst
DOI: 10.1002/lrh2.10293
sha: 02c38fab497ced5bed7c001d8e257e2dd31e6815
doc_id: 932875
cord_uid: jmgkjjzm

Development of evidence‐based practice requires practice‐based evidence, which can be acquired through analysis of real‐world data from electronic health records (EHRs). The EHR contains volumes of information about patients—physical measurements, diagnoses, exposures, and markers of health behavior—that can be used to create algorithms for risk stratification or to gain insight into associations between exposures, interventions, and outcomes. But to transform real‐world data into reliable real‐world evidence, one must not only choose the correct analytical methods but also have an understanding of the quality, detail, provenance, and organization of the underlying source data and address the differences in these characteristics across sites when conducting analyses that span institutions. This manuscript explores the idiosyncrasies inherent in the capture, formatting, and standardization of EHR data and discusses the clinical domain and informatics competencies required to transform the raw clinical, real‐world data into high‐quality, fit‐for‐purpose analytical data sets used to generate real‐world evidence.

EHRs are a valuable resource for researchers that can be analyzed with variety of methods, from multivariate regression to machine learning, and may be used to support both cross-sectional and longitudinal studies. [1] [2] [3] But regardless of study design and method, all EHRbased research requires recognition of data quality issues 4 as well as data curation and cleaning. 5 Raw EHR data must be structured into analytical datasets that are formatted to fit the input requirements of an analytical function. This task is difficult given the chaotic nature of EHR data that are collected for nonresearch purposes, namely patient care and hospital administration. 6 Some challenges include erroneous or ambiguous recording of information, ascertainment bias, the temporality, or provisional nature of some findings and shifting context that can change the meaning of a given data point.

Decisions about data included in the EHR are rarely, if ever, neutral.

Facts about patients are almost exclusively recorded when they are in contact with the healthcare system, and the nature of what is recorded is highly dependent on who cares for the patient, the intended use of the information, and why they are seeking care in the first place. 7 If a patient goes to the emergency room for a broken arm, the fact that they have insomnia is unlikely to be recorded. If they go to a neurologist, recording of this diagnosis is much more likely to be noted. Rather than causing these conditions, a primary care visit is simply an opportunity to ascertain their existence. Ascertainment bias can be difficult to detect and may have profound consequences to downstream analyses.

The transformation of raw data into analytical data sets requires multiple decisions, and it is difficult to know a priori which decisions will impact an analysis. Making these decisions should be a collaborative process, one that combines the data scientist's understanding of underlying data as well as the clinician's domain expertise. However, these decisions are often delegated to the data scientist alone. 8 When clinical expertise is sought, the expert may make decisions based on their knowledge that is clinically correct, but nevertheless are inappropriate for the analysis because they are not informed by the actual content of the data. The data scientist may make decisions that are technically reasonable but insensitive to clinical realities.

The following sections describe issues that the data scientist and the domain expert must recognize to produce high-quality research.

The sections are organized by data types that are commonly used in analytical datasets-diagnoses, measurements, and medications-and describes common issues that arise in using real-world EHR data for research. The examples provided below will have varying relevance to a specific research application of EHR data, depending on study design and the input requirements of the analytical methods. But we hope that they convey a sense of the complexity of EHR data and emphasize the importance of data literacy and the need for collaboration between clinical domain experts and data scientists in EHR research.

Analytical data sets frequently require information on the presence or absence of diagnoses. To populate such fields, data scientists often look for assertions of a particular diagnosis, using data like International Classification of Diseases (ICD) codes, problem lists, or text mentions in notes. However, information in these data sources can be misleading. Errors may be introduced through simple typos or miscommunications. But even accurately recorded information can be misleading due to the nature of diagnosis itself. Rather than a statement of fact that can be expressed as a yes/no value, a diagnosis is a statement of probability that can fluctuate over time. For some diseases, there is disagreement even among experts about the correct clinical criteria that should be used to establish a diagnosis. 9 Because the EHR captures data throughout the diagnostic process, a patient's record may accumulate evidence for a diagnosis that is ultimately ruled out.

Diagnoses can be extracted from clinical notes using natural language processing. [10] [11] [12] Unstructured text contains detailed information about patients that may be essential to accurately ascertain a disease phenotype, but structured data like International Classification of Diseases (ICD) codes are often used as an alternative.

Compared with unstructured text, ICD codes are easy to manipulate and a ubiquitous component of EHR systems. However, data scientists should be aware of potential biases. Despite concerted efforts to standardize ICD coding across health systems, financial incentives and clinician styles can also distort coding practices. 13 Moreover, ICDs are subject to semantic drift, whereby the meaning of a code changes over time. 14 The transition from ICD-9 to ICD-10 in 2015 led to a heightened awareness of this problem, and several studies demonstrated changes in frequencies of some diagnoses during that period. 15, 16 But semantic drift can happen for more subtle reasons, including minor revisions to the ICD coding structure or changes in local EHR tools used to translate unstructured-text to ICDs. 17 To deal with the inherent ambiguity of diagnoses in the EHR, supporting information can be used to improve accuracy. Researchers can use information redundancy to refine a phenotype. For example, for phenotypes based on diagnostic codes, requiring more than one code on different dates can improve accuracy. 18 Supporting evidence from orthogonal data sources may also be useful. For example, a diabetes mellitus phenotype may be improved by incorporating objective laboratory parameters such as an elevated hemoglobin A1c. Clinical interventions can be used to refine a phenotype. For example, a case F I G U R E 1 Semantic shift in International Classification of Diseases (ICD) code causes shift in prevalence. The National Center for Health Statistics (NCHS) and the Centers for Medicare and Medicaid Services (CMS) periodically adds new code and updates guidance on the use of existing ICD-CM codes, which can radically shift coding practices at an institution. Below is an example in the change of prevalence of codes relating to shock in a 200-bed hospital for hypothyroidism might be defined as having both diagnostic code evidence and a prescription for levothyroxine.

The above strategies are not foolproof. For example, a patient with type 2 diabetes may have normal laboratory values because their disease is controlled with medication (eg, if blood glucose levels are wellcontrolled in a patient with diabetes the hemoglobin A1c may be normal). 19, 20 Requiring the presence of treatment as a marker for disease can limit the cohort to treated patients. Furthermore, with off-label usage, the data scientist needs to be careful about assuming the appearance of a medication means the patient has the expected condition.

Finally, the markers within the EHR that are used to increase certainty may also correlate with disease severity. This conflation of certainty with severity does not invalidate the phenotype but should be recognized in the interpretation of analyses that apply the phenotype.

For these reasons and others, even a carefully designed phenotype algorithm will inevitably generate false positives and negatives.

Adjustments to phenotypes designed to improve the positive predictive value such as requiring multiple instances of a diagnosis or a supporting laboratory value often are made at the expense of reducing the negative predictive value, or vice versa. Figure 2 examines the overlap of different sources of evidence for a diagnosis at different sites, illustrating that they do not always agree. The degree of difference in overlap of phenotype components may vary from site to site suggesting that the accuracy of the phenotype may vary across institutions.

Analytic data sets often require laboratory results. EHRs typically store this information in structured fields, so retrieving the data is relatively trivial. In contrast to the uncertainty often attached to diagnoses, laboratory measurements give the impression of relative objectivity and computability. However, laboratory results have their own idiosyncrasies.

Patients often have multiple measurements for a laboratory result.

But analytical datasets often require a single value per person, as is the case where a measurement is used as a covariate in a multivariate regression. This leads to a common conundrum of using measurement variables: What is the best way to summarize multiple data points into a single value? Should the median value be used? The maximum? The earliest or most recently measured? How should the timing and cadence of laboratory values be addressed in the decision? The answer to this question has downstream consequences. A study of EHR laboratory results shows that the way the values are summarized affects the ability to replicate known genetic associations. For most laboratory results, the median value had the best performance, but some for laboratory results the maximum or the first performed the best. 21 Related issues exist with longitudinal analyses of multiple measurements where the focus is on change in laboratory parameters over time. While changes of a certain magnitude between two individual values may be easy to define, it is more difficult to assess a sustained change from baseline. As is the case for many EHR-phenotyping conundrums, the question of how to best work with repeated measurements does not have a simple answer, but rather depends on the specific laboratory result and intended use of the values.

Laboratory values can be misleading when they are interpreted without an understanding of their context. Some laboratory results, such as a comprehensive metabolic panel or blood count, are ordered routinely, while other tests are ordered in response to patient complaints or to follow up a previously abnormal result. Therefore, the presence of some laboratory results in a patient's record may increase the likelihood that they have a particular disease, even if the test result is normal. For F I G U R E 2 Venn diagrams of overlap of suggestive diagnoses, medications, and laboratory results for type 2 diabetes at two different institutions. The different ratios of overlap of data elements for the diabetes computable phenotype suggest algorithm behavior is different between the two sites. These site-level differences in the proportions of patients with different markers of disease may be accompanied by differences in other characteristics that may impact the performance of predictive algorithms developed at one site and applied at another example, a patient with multiple test results for phenylalanine levels is 

Vital signs (temperature, heart rate, blood pressure, respiratory rate, height, weight) are typically collected at ambulatory encounters, which can be regularly scheduled or sporadic-sometimes prompted by specific patient concerns, and other times because the patient presented for a routine checkup. Inpatient vital signs are typically, but not necessarily, ordered on a schedule related to severity of illness-one a day, once a shift, or many more times per day. Vital signs are sometimes obtained by manual methods, and sometimes measured through automated devices with a much greater frequency, including continuous telemetry.

Some of the challenges associated with using vital signs are similar to those of laboratory results, including issues of multiple measurements, context dependent interpretation, incorrect data transcription, and terminology mapping. But vital signs distinguish themselves in terms of the density of the data readout. Some vital signs, like respiratory rate and temperature, can be measured on a near continuous basis during a hospital stay. Vital sign values, particularly with automated measures, can frequently be associated with a great deal of noise, so summary measures that represent the maximum or minimum values may be affected by sporadic, erroneously high or low values. Given the F I G U R E 3 Differences in mean values of common laboratory results measured in the inpatient vs outpatient setting potential for large numbers of vital signs to be available in the source data, operational decisions are sometimes made to reduce the volume of data stored in data warehouses and recorded in EHRs. The methods by which the source data are filtered can differ to include only the highest, lowest, and median value for a given time interval. Some vital signs like heart rate and blood pressure are typically measured with reasonable precision while others, notably respiratory rates, are often estimated, resulting in unexpected uniformity in many recorded values. 25 

As with measurements, a prescription order for a medication in the clinical record may have the appearance of an unambiguous, objective fact that the provider intended for the patient to take the medication and the patient was administered the treatment or filled the prescription and adhered to the regimen. However, there is still a great deal of uncertainty associated with the medication, especially regarding whether the prescription was filled and the duration over which a medication was taken. It is well known that patients do not always fill prescriptions they are given or take the medications they are dispensed. 26 Methods have been developed to infer adherence based on fill and refill patterns and patterns of prescriptions written. 27 shown that estimates are highly sensitive to slight changes in definitions. 28 Therefore, not all similar-appearing patterns in medication orders reflect a similar degree of exposure to a drug.

Medications may be discontinued because of side effects, ineffectiveness, cost, or a resolution of the problem they treat. 29 Once the data scientist has evidence (eg, via refill patterns) that a patient was on a medication, the next issue is how to represent the presence of that medication in the analytical data set. While standards like RxNorm allow encoding of medication information at a very granular level including the ingredient, dose, and form, one can also group medications into those having the same core ingredient or even group related drugs under the same drug class. 30 Combination products pose additional challenges. The data scientist and clinical expert need to collaborate to ensure a balance between creating features that group many drugs too-coarsely into a single category vs applying a very granular coding that distinguishes all medication variations but produces too many features to be well-analyzed.

Researchers may take different approaches to EHR phenotyping, depending on the requirements of the project. A disease phenotype can often be ascertained rapidly using ICD billing codes along, defining cases as individual who have one or perhaps multiple disease codes. This process can be scaled to generate a phenome-wide snapshot of the patient population, which can be used in a variety of methods that necessitate the capture of hundreds or thousands of different disease labels. 31 Some projects require higher quality phenotypes than can be generate using ICD codes alone. In this case, researchers may develop computable phenotypes that combine various datatypes (eg, diagnosis codes, text mentions, medications) to increase the specificity and/or sensitivity of the phenotype. 32, 33 Ideally, a computable phenotype should be designed with input Having an evaluation process for a computable phenotype is critical. Because of the different perspectives between and among data scientists and clinicians, it is common to have disagreement about the best way to define a computable phenotype. One way to settle these disagreements is to test assumptions against real data. The evaluation process for a computable phenotype usually requires a gold standard of expert reviewed charts. Surrogate markers like genetics may also be used when available to assess phenotype quality. 34 Figure 6 . 38 With these initiatives, the informatics community has grown more adept at harmonizing EHR data. The National COVID Cohort Collaborative (N3C) is a good example of an initiative that used past innovations and knowledge to quickly build a broadly accessible cohort of over 5 million patients from disparate health systems. 39 But while CDMs and phenotypes standards greatly increase the efficiency of cross-institutional research, data scientists using these tools should still be mindful of the caveats described above while analyzing and interpreting results. Phenotypes designed for CDMs may be guaranteed to execute smoothly, but the quality of the phenotypes will vary based on the complexities of EHR data. 40 

Developing high-quality clinical evidence from real-world clinical data requires that data scientists and clinicians collaborate, communicate, and iterate on both the development of the analytical data set and the conduct of the analysis. Data scientists in the clinical domain do not need formal clinical education, but to participate deeply in the research process as true collaborators, they require special training into the idiosyncrasies of clinical data and the impact on the interpretability of results. 19 Clinicians should be involved in the process of optimizing clinical phenotypes and need to work with data scientists to support data-driven decisions about appropriate granularity and timing of definitional components. Biostatisticians and epidemiologists also provide important insights into study design and addressing missing or noisy data. 41, 42 To support reproducibility of findings, more work is needed to develop reporting standards for the results of EHR analyses. An analysis of EHR data performed under one set of assumptions, even if well-F I G U R E 6 Flow chart for identifying type 2 diabetes in PheKB. Developing a computable phenotype for diabetes is illustrative of many of the issues highlighted in this article. Diabetes can be either type 1 or type 2. While type 1 diabetics always require insulin, type 2 diabetics sometimes require insulin. Some patients, especially those who receive insulin, may have accumulated evidence for the diagnosis of both type 1 and type 2 diabetes over time, so identifying the type of diabetes a patient has from diagnosis codes may be challenging. PheKB provides an algorithm for type 2 diabetes that excludes patients who have ever had a diagnosis of type 1 diabetes. That decision likely increases the positive predictive value of the phenotype but lowers the sensitivity. The flow diagram implies that the diagnosis of type 2 diabetes requires the combination of a type 2 diagnosis plus an abnormal laboratory test result or a type 2 diagnosis plus a suggestive medication or two diagnoses of type 2 diabetes. While this is a single definition, it allows for multiple paths for a diagnosis that could be differentially present at different sites. This definition is one of many that could be developed based on the specific data source and use case informed by expert opinion, may be spuriously correct or incorrect.

Worse still, the investigative team may have performed an analysis under multiple assumptions and only reported the version that supports their hypothesis. Therefore, the traditional style of reporting results of analyses of real-world data in a manner similar to randomized controlled trials (RCTs), with a single point estimate and confidence interval for the association between an exposure and outcome, is suboptimal. This approach has led to conflicting literature [43] [44] [45] [46] where it is difficult to understand the underlying cause of the different results and the implications for generalizability. 47 EHR-based research has earned a place in the research ecosystem. To be successful, researchers must always be mindful of the complexities of EHR data, many of which are described in this article, and remain vigilant for unexpected challenges that could comprise the science.

Reimagining the researchpractice relationship: policy recommendations for informatics-enabled evidence-generation across the US health system

Achieving a nationwide learning health system

Implementing risk stratification in primary care: challenges and strategies

A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data

Caveats for the use of operational electronic health record data in comparative effectiveness research

Electronic health records to facilitate clinical research

Disease associations depend on visit type: results from a visit-wide association study

Empowering the data science scientist

SLE: reconciling heterogeneity

Natural language processing: an introduction

Clinical information extraction applications: a literature review

Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review

Medicare upcoding and hospital ownership

Desiderata for controlled medical vocabularies in the twenty-first century

Transition to the ICD-10 in the United States: an emerging data chasm

Early impact of the ICD-10-CM transition on selected health outcomes in 13 electronic health care databases in the United States

Impact of ICD-10-CM transition on mental health diagnoses recording. EGEMS (Wash DC)

Current scope and challenges in Phenomewide association studies

Performance of a computable phenotype for identification of patients with diabetes within PCORnet: the patient-centered clinical research network

A comparison of phenotype definitions for diabetes mellitus

LabWAS: novel findings and study design recommendations from a meta-analysis of clinical labs in two independent biobanks

Trends in the diagnosis of vitamin D deficiency

Trends in use of high dose vitamin D supplements exceeding 1,000 or 4,000 international units daily

LOINC, a universal standard for identifying laboratory observations: a 5-year update

Is everyone really breathing 20 times a minute? Assessing epidemiology and variation in recorded respiratory rate in hospitalised adults

Medication adherence measures: an overview

Medication (re)fill adherence measures derived from pharmacy claims data in older Americans: a review of the literature

Pitfalls of medication adherence approximation through EHR and pharmacy records: definitions, data and computation

Understanding reasons for treatment discontinuation, attitudes and education needs among people who discontinue type 2 diabetes treatment: results from an online patient survey in the USA and UK

Analyzing U.S. prescription lists with RxNorm and the ATC/DDD index

Using Phecodes for research with the electronic health record: from PheWAS to PheRS

A framework to support the sharing and reuse of computable phenotype definitions across health care delivery and clinical research applications. EGEMS (Wash DC)

Optimizing identification of resistant hypertension: computable phenotype development and validation

Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data

Genetic and clinical characteristics of treatment-resistant depression using primary care records in two UKcohorts

Electronic clinical laboratory test results data tables: lessons from mini-sentinel

Transforming and evaluating electronic health record disease phenotyping algorithms using the OMOP common data model: a case study in heart failure

Making work visible for electronic phenotype implementation: lessons learned from the eMERGE network

The national COVID cohort collaborative (N3C): rationale, design, infrastructure, and deployment

Aggregating electronic health record data for COVID-19 research-caveat emptor

Review: a gentle introduction to imputation of missing values

pooled and harmonized study designs for epidemiologic research: challenges and opportunities

Oral fluoroquinolones and the risk of retinal detachment

Association between oral fluoroquinolone use and retinal detachment

The use of pioglitazone and the risk of bladder cancer in people with type 2 diabetes: nested case-control study

Pioglitazone and bladder cancer: a propensity score matched cohort study

A review of predictive analytics solutions for sepsis patients

Developing real-world evidence from real-world data: Transforming raw data into analytical datasets

The authors thank Jeffrey Goldstein (Northwestern Memorial Hospital) for his help creating Figure 3 .

None of the authors have any conflicts of interest to report nor received any funding in support of the manuscript development.