key: cord-0821935-cizcg9s5
authors: Thomas, Jason A; Foraker, Randi E; Zamstein, Noa; Morrow, Jon D; Payne, Philip R O; Wilcox, Adam B
title: Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)
date: 2022-03-31
journal: J Am Med Inform Assoc
DOI: 10.1093/jamia/ocac045
sha: 904cdba54cd4c821d9ad73ab9965f3fb0a36a786
doc_id: 821935
cord_uid: cizcg9s5

OBJECTIVE: To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS: Using an original data set (n = 1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS: In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5,819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION: Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression. CONCLUSION: In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression - an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

Background and significance to evaluate N3C synthetic data in a manner that can inform users with a wide range of intended use cases and definitions for synthetic data fitness for use. [25] The utility of synthetic health data has been evaluated in other work [15, 19, 20, [26] [27] [28] [29] [30] outside of N3C which applied a variety of the ways one can validate synthetic data. [31] However, N3C synthetic data utility has only been evaluated once before. Recently, the N3C synthetic data validation task team evaluated the utility of N3C synthetic data (MDClone, Be'er Sheva, Israel) across three use cases, one of which had a geospatial and temporal focus. [32] Foraker et al. (2021) found the synthetic data had high utility for construction of a single aggregate epidemic curve of COVID-19 cases. However, it showed that rural zip codes with smaller population counts were more likely than urban zip codes to have zip code labels censored (suppressed) in the synthetic data, which is where a categorical variable's value is replaced with the word 'censored'. Zip code censoring is a method that aims to protect privacy of patients with particularly uncommon, and thus identifiable, features. To date, no analyses have been conducted on the N3C synthetic data to assess utility for analyses by individual zip codes and/or aggregate indicators beyond case counts (e.g. percent positive) over time.

In this paper, we describe the N3C Synthetic data validation task team methods and results focused on evaluating whether synthetic N3C data can be used for geospatial and temporal epidemic analyses. Our replication studies focused on what we deemed were important and common analyses to be performed, such as epidemic curves for key indicators and creation of public-facing dashboards. [33] [34] [35] Our validation included replication of studies and general utility metrics [31] for: analyses at the zip code level over time, construction of epidemic curves, and aggregate population characteristics. We believe these approaches balance the need to provide broad utility results for a wide range of analyses while also providing specific validation results relevant to analyses of common interest.

The MDClone ADAMS Synthetic Engine [MDClone Ltd., Be'er Sheva, Israel] derives a novel, synthetic data set from input data, specifically computed to preserve the statistical properties, correlations, and higher-order relationships between variables while containing none of the individuals from the original data.

The synthetic process fits new data points to a derived, multi-dimensional model so that information cannot be learned with certainty about any one individual in the population that cannot be learned about a group of other similar individuals.

An authenticated researcher specifies the patient cohort of interest from the underlying local data lake using the graphical query tool in ADAMS. The user selects the variables to be included in the output and can specify temporal relationships of interest. The derivative synthetic data set is then computed from the original data for the selected cohort and variables, without exposing the user to the underlying original data.

For continuous variables, such computationally derived synthetic data are inherently privacy-preserving because, unlike de-identified data, the synthetic data process begins with a statistical model of the original data and samples entirely novel points to fit that model, maintaining the distribution, density, and co-variance between and among features within that model. There is no one-to-one correspondence between points in the original data set and sampled points in the computed synthetic derivative; this prevents information from being learned from the latter about individuals from the former, other than their overall inter-variable relationships and their population-level descriptive properties. Indeed, if the process is run repeatedly on the same source data, each set of output will be unique, all sharing the same statistical properties as the source but none identical to it or to one another.

For categorical variables (e.g., ZIP codes, genders), the finite number of categories presents the inherent possibility of an inference attack or other privacy-threatening methodology. [36] [37] [38] The MDClone engine therefore enforces a number of proprietary techniques, beyond the control of the user, to mitigate this risk. [39] The rows (i.e., patients) in the source database are sorted by the discrete values of categorical variables, thus grouping identical rows. If in any group there are fewer than a system-defined number of rows, termed κ, some of these discrete values are replaced by the word "censored." This process, which reduces the variance in the categorical variables, is repeated until all rows are members of groups of size ≥ κ.

Each resulting unique group of rows includes a matrix representing the group's associated numeric variables. Because knowledge of some of the numeric values might permit an attacker to discover something about other variables, MDClone replaces each matrix with an alternative matrix of similar statistical properties.

Specifically, the rows are clustered into sets of < κ rows, minimizing the scaled Euclidean distance between data points and preserving, within each cluster, statistical characteristics for every pair of variables. As there is an unlimited number of possible alternative matrices for each cluster that satisfies this requirement, the algorithm selects each solution randomly, resulting in an irreversible process and preventing the re-creation of the original data from the result.

Finally, to protect against a difference attack based on knowing the exact size of the original population [40] , the number of rows that are created to fit the overall data model is altered slightly. This small, arbitrary change in the number of rows prevents an attacker from deducing the exact size of the original population but does not affect the overall statistical properties of the resulting data set.

The N3C data analyzed include individual-level EHR data enriched with social determinants of health (SDOH) at the 5-digit zip code level. The data have been harmonized into the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) v5.3.1 [3, 41] and are the same data sets described in a previous N3C synthetic data validation use case. [32] The N3C LDS as of November 30, 2020 -which included 34 data source partners -was used as the data source. MDClone received a copy of the LDS then transformed these data from the N3C harmonized data model into MDClone's data model. Afterwards, the required data needed for the study team's analyses were extracted by MDClone from the transformed LDS for use as the "original" data set. A synthetic derivative of this transformed original data set was then created by MDClone.

MDClone provided both the original and synthetic data sets to the research team for evaluation within the N3C secure enclave environment (flowchart Figure S4 in supplement). Information on the MDClone data model and pre-processing steps specific to this study are described further in the supplement and in general in past analyses of MDClone data. [27] Both the original and synthetic data were formatted as a single Censored zip codes were those present within the original data not found (n=11,222) within the synthetic data set either because the zip code was suppressed by labeling the zip code 'censored' or removed within the synthetic data set to protect privacy. Conversely, uncensored zip codes were defined as discrete zip codes found in the original and the synthetic data (n=5,819).

All analyses were conducted solely by one author (JAT). All code was written in Python (v3.6.10) andas required by N3C -ran within the secure N3C enclave using the Palantir Foundry Analytic Platform (Palantir Technologies, Denver, CO). The entirety of code used in this analysis is contained within a single Foundry Code Workbook using a saved Spark environment to preserve required software versions and dependencies.

The code workbook and source data have been stored within the N3C enclave so that they may inform and be reused in future validation work. 

Descriptive statistics were calculated and reported in Table 1 for age, number of unique zip codes present, LOS, and admission date after positive test stratified by patients who were tested, positive, admitted, and who died during admission. Number of unique zip codes present excluded null or censored zip codes. The difference between original and synthetic values was reported as the raw synthetic difference (syntheticoriginal). The difference as a percentage of the original value was reported as synthetic difference percentage (raw synthetic difference/original).

We constructed aggregate epidemic curves using each data set spanning January 1st through November 30th 2020. The following key indicators were calculated and visualized: tests, cases (reproduced from Foraker et al., 2021, to view others in context), percent positive, admissions, and deaths during admission. Each indicator had the following daily metrics calculated: count (discrete indicators) or value (continuous indicators), 7-day midpoint moving average, 7-day slope (count or daily value -its value six days prior). To assess the statistical difference between original and synthetic epidemic curves, we conducted the paired two-sided t-test (scipy v1.5.3, stats.ttest_rel) and two-sided wilcoxon signed-rank test (scipy v1.5.3, stats.wilcoxon) for all metrics across all indicators (Table 2) , treating each data set's daily results as a pair.

To assess the distribution of tests by zip code and threshold of zip code censoring, we calculated the total number of tests per zip code in the original and synthetic data. In the synthetic data, we excluded rows with a censored (n=44,337; 2.4%) or null (n=444,092; 23.9%) zip code. In the original data, we excluded rows with a null (n=444,380; 24.0%) zip code. We computed the 99th, 97.5th, and 90th percentiles of tests per zip code in the original data. The distributions of tests by zip code were plotted as a histogram ( Figure 2 ) with the synthetic and original data overlaid. Additionally, we calculated the distribution of tests by zip code in the original data 21 that were censored in the synthetic data, then plotted the result as a histogram ( Figure S3 in supplement). We then calculated the difference in patients' SDOH values within the original data, comparing patients whose zip codes were censored within the synthetic data to those whose zip codes were not censored (Table 3) .

Next, we assessed synthetic epidemic curves' performance at the zip code level, focusing on zip codes with relatively abundant data. We created a list of zip codes from the original data in the 99th percentile (n=171) by total number of tests, then removed any zip codes without an uncensored matched zip code pair in the synthetic data (n=0). We randomly sampled ten zip codes from the list and constructed epidemic curves for these zip codes' original and synthetic data (Figures 3-4 ). Each epidemic curve was constructed using the same date range, methods, and metrics as the aggregate epidemic curves described above with the following change:

we only assessed tests and admissions indicators due to the infrequency of death during admission at the zip code level and manuscript space limitations.

We compared the difference in monthly counts of tests, cases, and admissions between the original data and paired uncensored synthetic zip codes. To do so, we calculated each data set's number of tests, cases and admissions for every zip code stratified by month for each month the zip code had ≥ 1 test. Then, the data sets were outer merged on month and zip code ( Figure 5 ). Synthetic error, defined as the difference between the synthetic monthly count and the original data monthly count value, was computed for every zip code month pair. The distribution of synthetic error was visualized ( Figure 6 ) for tests, cases, and admissions. 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 All visualizations (Plotly v4.14.1, Plotly Technologies Inc.) were interactive, allowing N3C enclave users to zoom in/out, pan, and hover to see values and/or labels. In this manuscript, static figures are presented. Log scales were avoided when possible and, when used, annotated to draw attention to the scale.

Visualizations that overlaid both datasets adhered to consistent style conventions. We encoded synthetic and original data sources as red and blue, respectively. Vertical overlaid bars were set to an opacity of 0.35 to 1) provide contrast between two datasets and 2) allow additional tracings, such as 100% opacity 7-day moving averages used in epidemic curves, to be seen on top of the bars.

All visualizations were created using colorblind-safe color mappings. Categorical mappings encoding values besides data source (synthetic or original) used hexadecimal color codes found in the seaborn colorblind palette [42, 43] . Each visualization was qualitatively tested for colorblind deuteranopia, protanopia, and tritanopia interpretability by one member of the research team (JAT) using Color Oracle. [44] Post-Hoc privacy evaluation A post-hoc assessment was done to determine the privacy preservation of the synthetic data produced for this study, by addressing the possibility that the presence of an individual in the original data set could be inferred from the synthetic data. Specifically, we queried whether there were any rows in the synthetic dataset that share identical attributes across continuous and categorical values with rows in the original dataset. Since exact matches across continuous variables are expected to be rare [45] , we also examined whether subjects with a unique value in a categorical variable or bearing a rare combination of categorical values were reproduced in the derivative synthetic dataset.

There were nearly two million tested patients (Original n=1,854,968; Synthetic n=1,854,950) in each data set. As seen in Table 1 , the overall central tendencies of variables of interest overall were similar between 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 the synthetic data and original data, especially for age and percent positive/admitted/died. The raw synthetic difference was zero, rounded to two decimal points, roughly one third (18/50 rows in Table 1 ) of the time. The variable with the greatest synthetic difference was unique zip codes, with between a 65-98% reduction in unique zip codes. Median LOS and IQR for admitted patients were exactly the same, yet the mean LOS was 6.48 (±290.81) and 8.32 (±10.66) days for original and synthetic values, respectively. The extreme LOS standard deviation observed in the original data was due to an erroneous outlier. A single row in the original data had an extreme negative LOS [~-44,000 days; ~-120 years] and 11 rows with a LOS=-1. The synthetic data also had negative LOS values (n<10) but the values were greatly attenuated, ranging from -1 to roughly -175.

As a result of noticing this extreme LOS, all columns in the original and synthetic data were assessed for implausible outliers likely to be the result of data quality issues. None were found. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 In our statistical analysis, no differences were found between the aggregate epidemic curves besides the 7-day average of percent positive [(t-test p-value=0.025; wilcoxon p-value=0.072), Table 2 ]. Differences were observed between patients' SDOH values whose zip codes were uncensored in the synthetic data compared to patients whose zip codes were censored in the synthetic data ( Table 3 ). The largest differences were found in the total population of zip code and age. Patients with uncensored zip codes lived in more populous zip codes (median median total population: uncensored=28,479, censored=7,935) and were younger (median age: uncensored=43.5, censored=48.7). The randomly sampled top 1% paired zip codes' epidemic curves are presented in Figure 3 and Figure 4 .

The 90th, 97.5th, and 99th percentiles for total tests by zip code in the original data were 125, 784, and 1,636 tests, respectively (see Figure 2A ). Thus, a small minority of zip codes account for the vast majority of total tests. There were 15,108 (88.7%) unique zip codes in the original data with <100 total tests and 11,039

(64.7%) with <10 tests. Above this threshold (n≥10 tests), the synthetic data mimic the original data distribution closely (see Figure 2B ). There were 17,041 unique zip codes and 5,819 unique uncensored zip codes in the original and synthetic data, respectively. The vast majority of censored zip codes are those that had <10 total tests in the original data (mean=2.9±2.4; median=2, IQR=3; max=16) as seen in Figure S3 of the supplement.

The absolute value of pairwise synthetic error stratified by month and zip code increased as the original data value of counts increased (see Figure 6 ; supplement Table S1 ). Thus, as sample size of data increased, so did the absolute synthetic error and vice versa. The synthetic error for tests ranged from an IQR=2 when the original value of tests was between 0 to 19 to IQR=9 when the original value of tests was between 250 and 1,705. All synthetic error for zip codes with an original bin value of zero count was positive. All other bins' synthetic error across key indicators was skewed negative, indicating that the synthetic data had lower counts than the original data.

In the post-hoc privacy assessment, 6,839 of 1,854,975 rows (0.37%) in the synthetic dataset contained all the same values in all 13 columns as corresponding rows in the original dataset. However, this included numerous values that were null or missing; all but six of the 6,839 rows included at least eight missing values among the 13 variables, which greatly mitigates the likelihood of a meaningful identify disclosure, particularly

given the vastly larger number of rows compared to columns. In a second run of the synthetic algorithm, none of the six rows with fewer than eight missing values appeared again in the new synthetic derivative, indicating that the initial replication was due to chance rather than individual characteristics of the rows. When the rows in both datasets were grouped into unique combinations of their categorical values (Supplemental Figure S2) , groups (or equivalent classes) of individuals with fewer than ten members existed in the original dataset but did not appear in the synthetic dataset; this is consistent with the censoring algorithm's minimum equivalence class of ten rows, chosen in conformance with a generally accepted cutoff. [46, 47] 

Overall, analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations with smaller sample sizes were notably less similar and had more data suppression, which is in agreement with prior work. [19, 32] Synthetic data most closely matched the original data on aggregate data tasks such as aggregate epidemic curves ( Figure 1 ) and broad summary statistics (Table 1) . At the aggregate level, only one metric (percent positive, 7-day average) across all indicators showed a significant difference between synthetic and original data aggregate epidemic curves ( Table 2) . Scarcity of data -as data collection used in this manuscript tapered off in November -is likely a contributing factor to the difference.

The summary statistics shown of both data sets' populations in Table 1 were similar. Major exceptions were the number of unique zip codes due to censoring in the synthetic data and attenuation in the synthetic data of a single extreme outlier (~-44,000 day LOS) caused by a data quality issue in the original data. Other erroneous negative LOS values persisted within the synthetic data, yet the bulk of the erroneous values remaining were a LOS=-1 which has been reported as a data quality issue attributed to daylight savings. [48, 49] Thus, we show that synthetic data can reduce the impact of data quality issues by removing or attenuating erroneous outliers with the aim of protecting the privacy of rare, and thus identifiable, data.

At the zip code and month level, the synthetic data error performed well on an absolute level; the error increased as the size of the original data increased ( Figure 5 & Supplementary Table 1) . Therefore, the amount 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 of synthetic error is predictable which gives users the ability to estimate the level of error in their data of interest. Additionally, the synthetic error relative to the original data value is likely small enough for most uses of synthetic data. For example, a zip code in the synthetic data with a monthly positive count of 6-49 is off from the original data by an average of -0.59±2.63. The overrepresentation of negative tests in the original data by 8.5-fold (Table 1 ) appears to bias synthetic error. Since it is impossible to have less than zero count, the synthetic data cannot add privacy-producing noise in the negative direction for zip code monthly counts equal to zero. Consequently, the synthetic data systematically underestimate the monthly count of key indicators in zip codes with the most tests, cases, and deaths, and overestimate them in zip codes with the least. Our results relate to Flaxman et al., 2020 which observed a similar effect resulting from a non-negativity constraint in the US Census' TopDown differential privacy algorithm. [12] The magnitude of the synthetic error skewing negative in a smaller concentration of zip codes increased as a key indicator became less frequent, which is fundamentally a signal problem in low-density data sets and is not specific to synthetic data generation.

The top 1% most tested zip codes' epidemic curves provide users with 10 qualitative examples of densely tested zip codes. Overall, the synthetic data closely matched the start and end dates of the original data and followed the overall trend of the original data over time (e.g. Figure 3A matched spike in late April). The ten examples show users the 99th percentile best-case scenario of key indicator original data availability and synthetic data performance at the zip code level, yet the size and testing density of N3C data will likely continue to increase.

Our findings show the importance of understanding the characteristics and limitations of the original data since we found these biases affected synthetic data utility. Data biases resulting in poorer performance of software tools, clinical guidelines and other applications for groups underrepresented in source data has been previously reported for separate tasks. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 zip code censored ( Table 3) . As a consequence, rural zip codes, which are already underrepresented in the original data, become even less available to directly analyze. Additionally, patients with censored zip codes were older, potentially due to older patients traveling from sparsely tested regions to receive care offered at distant academic medical centers which participate in N3C. Traditional de-identification methods would likely censor or suppress zip codes with few tests as well or group them together into higher-level geographic regions.

Thus, it is important to view our findings in relation to common alternatives.

While our results demonstrate the utility of using synthetic data for a broad range of geospatial analyses, a caveat to synthetic data use is its utility to analyze rural N3C populations since nearly all zip codes with <10 tests were censored and much more likely to be rural within the original data. Suppression of non-zero counts <10 is a common convention within state and federal guidelines to avoid inadvertent disclosure of protected health information for publicly released data. [46, 47, 54] Analyses such as choropleth maps at the zip code level including sparsely tested regions would benefit from using the LDS to obtain access to all zip codes without suppression, or by generating and using a different MDClone synthetic dataset that reports geospatial data at a lower level of granularity (e.g. 3-digit zip codes). Our results may inform future N3C discussions about data set balancing ranging from 1) creation of artificially balanced hybrid data sets to improve statistical models' performance on underrepresented data [50, 55] , 2) source partners sending a random sample of negative tests alongside all positive tests, or 3) expansion of data ingestion from rural regions.

Whether these synthetic data are "good enough" hinges on a fitness for use determination to be made by each user. The authors believe the data will be useful enough for a wide variety of use cases. Educational software engineering projects or pandemic preparedness tool development could be especially well-served by these data. A major limitation of the data, however, is that they are output in a different data model than the OMOP CDM.

[41] Thus, tools built on the synthetic data would not be transferable to run on the LDS without modification. Other users may find the synthetic data well suited to rapid, iterative hypothesis generation/testing without the delays of acquiring the relatively more restricted LDS. [3] We performed a basic privacy assessment consistent with analyses in other published studies of synthetic data. This post-hoc assessment demonstrated a lack of matches between the original and synthetic data, indicating that the data in the synthetic data set do not represent specific individuals from the original data set. Matches of values across data sets were rare (<0.4% of rows), non-informative (vast majority occurred for rows with sparse data), and random (matches were not duplicated in additional derived data sets). In addition, the absence of unique rows for categorical values demonstrated that value matches with categorical variables, which would be more common, are not unique. These results are not unexpected, given the applied algorithm's approach to generating synthetic data and censoring. While we consider this privacy assessment sufficient for this study, which focuses on demonstrating the utility of a synthetic data set for analysis, more work can be done in evaluating privacy with synthetic data approaches. We are currently completing an independent and more rigorous evaluation of synthetic data privacy using an adversarial network approach. Additional research is still needed to evaluate synthetic data privacy validation approaches and the actual risk of information gain if variable sets are matched. These issues are beyond the scope of this paper but represent the challenges in advancing the use of synthetic data. Others who have studied practical and legal implications of synthetic data sets have recommended their use over de-identified data [56] ; this paper demonstrates the utility of synthetic data for geographic and temporal analyses, which is a specific functional advantage over Safe Harbor deidentified data.

To date, no privacy analysis has been published on these synthetic data to provide context for its utility in relation to its privacy. N3C is currently assessing the privacy of the data used in this study. In a forthcoming manuscript, N3C will be able to quantify the privacy-utility tradeoff of these data through pairing the privacy analysis with these results. Such a full privacy analysis is well beyond the scope of the present paper. However, the methodology described above reflects that the synthetic data process, when computing a derivative data set for a user-defined patient cohort and selection of properties, is inherently privacy-preserving. While the algorithm maintains the statistical properties -and therefore the utility -of the data, the underlying original data are not visible to the user during the synthesis process. Categorical values are censored, when necessary, to mitigate their inherent exposure to inference attack. The synthetic algorithm is intentionally non-reversible, with multiple layers of protection against privacy attack. The mathematical calculation of alternate matrices is based on Euclidean distance, which is not simply "straight-line" distance, but rather the shortest path through the matrix and is by nature non-reversible. The size of the output population is modified slightly, without altering the statistical model, to further thwart potential attack.

The data used in this manuscript do not reflect the current size nor state of the N3C LDS. Other statistical techniques such as equivalence testing, bhattacharyya distance [57, 58] , or adversarial challenges [28] could be used in the future to compare similarity between epidemic curves. The Wilcoxon signed-rank and paired t-tests assume the null hypothesis that the original and synthetic datasets are equivalent. Equivalence testing, which flips the null hypothesis, may be better suited. Equivalence testing was not used in this manuscript due to the challenge of selecting an equivalence bound without knowing what threshold(s) data endusers would find most applicable. Additionally, adjustments for multiple testing were not made for differences between synthetic and original epidemic curves. Had they been, no p-values would be < 0.05. Future work conducting equivalence testing specific to well-defined, high-impact use cases may be merited. However, the work required to do so in an ad hoc manner may suggest the LDS is a better alternative in those cases. In future work, the effect of data quality on synthetic data may be worth studying through generation of synthetic data at each cycle of iterative data quality improvement.

Overall, the synthetic data are promising for a wide range of use cases including: population level summary statistics, epidemic curves for the data in aggregate and for the most densely tested zip codes, and analyses necessitating monthly counts of key indicators for the top third of zip codes by number of tests.

However, analyses requiring unsuppressed zip code analyses on populations with <10 tests may be better served by the LDS. Biases found in the original data -namely an underrepresentation of positive tests and tests in rural zip codes -were reflected in the synthetic data. Therefore, it is important to understand the limitations and biases of the original data in addition to the synthetic data impacted downstream from it. We expect the user base of N3C synthetic data to be heterogeneous and the use cases of the data to be broad, resulting in a wide range of fitness for use definitions. To date, there is no published evaluation that quantifies the privacy afforded by this synthetic dataset specifically -nor of the MDClone system itself broadly -to contextualize this synthetic dataset's utility in relation to a privacy-utility tradeoff; such evaluations are beyond the scope of this work.

Future privacy evaluations of MDClone will not necessarily reflect the privacy of the synthetic data analyzed in this study unless the same dataset and/or the same MDClone system version and parameters are evaluated. Our evaluation of the N3C synthetic data utility provides users the ability to assess whether the synthetic data are fit for use through its combination of general-purpose data utility assessments and visualized replications of analyses of common interest. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 and admissions (right column). Synthetic data are more similar to original data when indicator density is higher,

Overall, synthetic data closely match overall trends and closely match start and end dates. and admissions (right column). Synthetic data are more similar to original data when indicator density is higher,

Overall, synthetic data closely match overall trends and closely match start and end dates. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 59 60 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 59 60 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 59 60 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 59 60 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 59 60 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 59 60 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 

Elizabeth Zampino • Partners from NIH and other federal agencies

Amit Saha, Satyanarayana Vedula • Synthetic Data Domain Team including Yujuan Fu, Nisha Mathews, Ofer Mendelevitch Data was provided from the following institutions

Great Plains IDeA-Clinical & Translational Research • Maine Medical Center -U54GM115516: Northern New England Clinical & Translational Research (NNE-CTR) Network • Wake Forest University Health Sciences -UL1TR001420: Wake Forest Clinical and Translational Science Institute Science Center at Houston -UL1TR003167: Center for Clinical and Translational Sciences

Advance Clinical Translational Research (Advance-CTR) • Rutgers, The State University of New Jersey -UL1TR003017: New Jersey Alliance for Clinical and Translational Science • Loyola University Chicago -UL1TR002389: The Institute for Translational Medicine (ITM) • New York University -UL1TR001445: Langone Health's Clinical and Translational Science Institute • Children's Hospital of Philadelphia -UL1TR001878

A call to strengthen data in response to COVID-19 and beyond

Ethics and informatics in the age of COVID-19: challenges and recommendations for public health organization and public policy

Rationale, design, infrastructure, and deployment

The National COVID Cohort Collaborative: Clinical Characterization and Early Severity Prediction | medRxiv

HIPAA Privacy Rule and Its Impacts on Research

514 -Other requirements relating to uses and disclosures of protected health information. -Content Details -CFR-2011-title45-vol1-sec164-514

Guidelines for Producing Useful Synthetic Data

General and specific utility measures for synthetic data

Protecting GANs from membership inference attacks at low cost

Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing

Are Synthetic Data Derivatives the Future of Translational Medicine?

Differential privacy in the 2020 US census: what will it do? Quantifying the accuracy/privacy tradeoff

Privacy in the age of medical big data

Utility of privacy preservation for health data publishing

Transitioning from CDC's Indicators for Dynamic School Decision-Making (released September 15, 2020) to CDC's Operational Strategy for K-12 Schools through Phased Mitigation (released February 12, 2021) to Reduce COVID-19

Operational Strategy for K-12 Schools through Phased Mitigation

State-By-State Summary of Public Health Criteria in Reopening Plans

Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies

Ensuring electronic medical record simulation through better training, modeling, and evaluation

Publishing volumes in major databases related to Covid-19

Citizen science, public policy

A Global Digital Citizen Science Policy to Tackle Pandemics Like COVID-19

Juran's quality handbook

The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures

Spot the difference: comparing results of analyses from real patient data and synthetic derivatives

Evaluating the utility of synthetic COVID-19 case data

Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy

On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks

Seven Ways to Evaluate the Utility of Synthetic Data

The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data

An interactive web-based dashboard to track COVID-19 in real time

Our World in Data Published Online First: 5

Identifying inference attacks against healthcare data repositories

k-anonymity: a model for protecting privacy

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

United States Patent: 10977388 -Computer system of computer servers and dedicated computer clients specially programmed to generate synthetic non-reversible electronic data records based on real-time electronic querying and methods of use thereof

Observational Health Data Sciences and Informatics. OMOP CDM v5.3.1

Waskom M, team the seaborn development. mwaskom/seaborn

Color oracle. Color Oracle: Design for the Color Impaired

Every Needle in a haystack: finding fingerprints in a Safe harbor Dataset using a Single Common lab Test

Washington State Department of Health. Guidelines for working with small numbers

Guide to protecting the confidentiality of personally identifiable information (pii): recommendations of the National Institute of Standards and Technology. Special Publication 800-122

National Library of Medicine Training Conference -Poster Published Online First

Machine Learning and Health Care Disparities in Dermatology

Race/Ethnic Differences in the Associations of the Framingham Risk Factors with Carotid IMT and Cardiovascular Events

Face recognition vendor test part 3:: demographic effects

National Institute of Standards and Technology

Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry

Healthy people 2010 criteria for data suppression

Synthetic Generation of Clinical Skin Images with Pathology

Privacy and Synthetic Datasets

Real-time tracking of non-rigid objects using mean shift

Synthetic data in the civil service

The entirety of code used in this analysis is contained within a single Palantir Foundry Code Workbook using a saved Spark environment to preserve required software versions and dependencies. The code workbook and source data have been stored within the National Covid Cohort Collaborative (N3C) enclave (https://covid.cd2h.org/enclave) so that they may inform and be reused in future validation work. To view National Covid Cohort Collaborative (N3C) Data Enclave & Data Access Requirements, please navigate to the N3C website.

This study was approved by the Washington University and University of Washington Internal Review Boards.