key: cord-144221-ohorip57 authors: Kapoor, Mudit; Malani, Anup; Ravi, Shamika; Agrawal, Arnav title: Authoritarian Governments Appear to Manipulate COVID Data date: 2020-07-19 journal: nan DOI: nan sha: doc_id: 144221 cord_uid: ohorip57 Because SARS-Cov-2 (COVID-19) statistics affect economic policies and political outcomes, governments have an incentive to control them. Manipulation may be less likely in democracies, which have checks to ensure transparency. We show that data on disease burden bear indicia of data modification by authoritarian governments relative to democratic governments. First, data on COVID-19 cases and deaths from authoritarian governments show significantly less variation from a 7 day moving average. Because governments have no reason to add noise to data, lower deviation is evidence that data may be massaged. Second, data on COVID-19 deaths from authoritarian governments do not follow Benford's law, which describes the distribution of leading digits of numbers. Deviations from this law are used to test for accounting fraud. Smoothing and adjustments to COVID-19 data may indicate other alterations to these data and a need to account for such alterations when tracking the disease. There are several possible explanations for the high burden in democracies. First, democracies are on an average richer (higher per capita income and health expenditure as a percentage of gross domestic product) than other regimes. They can afford more tests, resulting in higher case and death counts. Second, democracies are more open to travel and trade. This facilitates the spread of COVID-19 across borders. Third, democracies may, idiosyncratically, have a larger elderly population, which is more vulnerable to COVID-19. Fourth, most democratic countries are north of 40° latitude. Fifth, perhaps authoritarian regimes have greater control over their population. They may be better able to enforce social distancing and limit mobility, both of which reduce spread of the disease. These explanations presume that the data on COVID-19 burden are reliable. However, the press has raised questions about the credibility of COVID-19 data reported by countries. Stories regarding data manipulation have emerged for China 3 , Iran 4 , Indonesia 5 , and the US 6 . Therefore, it is important statistically to investigate the reliability of COVID-19 data that is being reported across regimes. In democracies, with freedom of the press, separation of power, and an active opposition, there may exist checks and balances that prevent governments from manipulating the data. Authoritarian regimes have greater latitude to manipulate data. Such governments have been criticized, however, for manipulating other types of data [7] [8] [9] [10] . These governments have an incentive to use information as a form of social control [11] [12] [13] [14] [15] . Here we show evidence of manipulation of COVID-19 data by authoritarian regimes relative to democratic regimes. First, data from authoritarian governments show significantly less variation from a 7 day moving average. Because governments have no reason to add noise to data, lower deviation is evidence that data may be massaged. Second, data from authoritarian governments do not follow Benford's law, which describes the distribution of leading digits of non-manipulated numbers. These discrepancies do not provide direct evidence that the lower burden on authoritarian governments is due to data manipulation. However, they do provide indirect evidence: these modifications likely have a purpose and a plausible reason is suppressing bad news. Ensuring the credibility of data isn't a coronavirus specific concern. Data manipulation has been a perennial concern in public health and economics. There are notable instances of data fabrication in research 16 , disease surveillance 17, 18 , and measurement of economic conditions [7] [8] [9] [10] . There are many statistical methods for detecting fraud 16, 19, 20 . Here we focus on two types of tests. One compares moments of the distribution of data across sites 19, [21] [22] [23] , specifically variance 16, 19, 20, 22, [24] [25] [26] . The other looks at digit preference that deviates from Benford's law 19, 20, 27, 28 . There is a strong positive association between fluctuations in the COVID-19 data reported by different countries and their "democratic-ness". Figure 2 plots the natural logarithm of the mean of the squared deviation of daily cases and deaths per million people, respectively, from the 7 day moving average against the EIU's overall democracy index score. Not only do authoritarian regimes report fewer cases and deaths, there seems to be more random variation in the data in more democratic nations. Aggregated data on Covd-19 across all countries in each of 4 regime categories provides further visual evidence that there is less variation in case data in authoritarian regimes. Figures S2a & S2b plots daily cases and deaths per million people around a 7-day centered moving average for those indicators, respectively, for each regime type. In addition to a lower rate of reported cases and deaths, there is almost no fluctuation in the data from authoritarian or hybrid regimes. Variation in the data appears to increase as one moves to a higher category of democratic-ness. Regression analysis (Table 1) Although it is unlikely that features that affect the level of COVID-19 burden affect the variation in that burden, we estimate a version of the regression in Table 1 with controls for GDP per capita, health and trade as a percent of GDP, share of population over 65 and an indicator for countries above 40 degrees latitude. While greater democratic-ness is no longer associated with additional variability in cases, it continues to be associated with significantly greater variability in daily deaths per million people (Table S1) . Figure 3 presents the results of our analysis for cumulative case and death data when our screening criteria is that growth in the 7 day centered moving average is greater than 7.5%. (Results for tests for other screening criteria are presented in Tables S2 are roughly consistent.) One cannot reject that Benford's Law describes the distribution of first digits for cumulative cases for all regime types for p value less than 1%. However, one can reject the Benford's law that describes the distribution of first digits for cumulative deaths at p value less than 1% for cumulative deaths for authoritarian regime, hybrid regimes, and flawed democracy, while it cannot be rejected for full democracies. Validation with ECDC data. All of the analysis reported above were also conducted with data from the ECDC and the results are very similar. Analysis of compliance with Benford's law suggests data from authoritarian regimes, hybrid regimes, and flawed democracy on cases comply but not for deaths, while for full democracies the data complies with Benford's law, both for cases and deaths. Higher deaths may be more politically salient and, therefore, subject to manipulation. First, because the infection fatality rate of COVID-19 is close to 1%, cases are less consequential than COVID-19 deaths. Second, deaths better reflects state capacity than cases. Total cases are largely determined by transmissibility and infectiousness of the disease, and the total number of tests. Total deaths are influenced by, in addition to these factors, the health infrastructure, including availability of medical personnel and beds. Governments may be able credibly to blame low levels of testing on global shortages rather than government policy. Personnel and beds, however, require long term investments in medical education and construction. Therefore, a high death rate may imply the government has performed poorly for some time. This study has several limitations. One is that, while we establish an association between data smoothening and government regimes, there may be potential confounders not included here that could alter the conclusions of the study. Second, no causal link has been established between government regimes and data smoothening. Third, the study does not present methods to obtain less biased estimates of cases. Comparison of multiple sources of information or indirect methods of measuring COVID-19, such as SARI cases or orders of caskets, are worth exploring. A fourth limitation is that the paper presents two major methods of detecting manipulation. There are others, and these may reveal a greater degree of manipulation. The results here raise significant questions about the reliability of the data being reported by different countries and highlights the need for a degree of caution when making projections using such data. It may be appropriate to put in place systems for ongoing monitoring for fraud as are used for clinical trials 23, [29] [30] [31] [32] . Data. Data on the type of regime in different countries come primarily from the and authoritarian regimes (scores ≤ 4). Given the arbitrary and discontinuous nature of the boundaries between these categories, we also directly use the numerical scores in our empirical analyses. We also employ data from other measures of democracy, such as Freedom House's Democracy, the Varieties of Democracy Index, and the Polity5 of the polity project; these are described in the supplement. For validating results from JHU data, we use data on cases and deaths from the European Centre for Disease Prevention and Control (ECDC). The ECDC data are similar to JHU, except that they do not contain presumptive positive cases, defined as cases that have been confirmed by state or local labs, though not by national labs such as the CDC 39 . We do not employ World Health Organization (WHO) data on COVID cases and deaths because of a change in the reporting time for WHO numbers on March 18, 2020 that makes it difficult to compare WHO number before and after that date. Aggregate WHO data, on the one hand, and JHU and ECDC data, on the other, are very similar, with the exception of the period from February 12-16, 2020. We choose to use ECDC data rather than WHO data to validate results using JHU data because of errors found in the WHO data 39 . Country-level demographic and economic information (country-level per capita income, health and trade expenditure as a proportion of GDP, and the share of population over age 65) for the year 2017/2018 are drawn from the World Bank Open Database 40 . Missing values were substituted with regional averages. We used data from the 165 independent states and two territories for which the EIU produced scores. This covered more than 99% of the world's population. COVID-19 data was only available for 161 countries, Hong Kong was classified as part of China, and there was no data for Comoros, Lesotho, North Korea, Tajikistan, and Turkmenistan. These countries accounted for more than 99% of the total cases and deaths across the world. Data availability. All the data used for this study are publically available and will be posted, along with code for all statistical analyses, will be posted in a Github repository by the corresponding author. manipulation is to look for abnormal statistics (such as with the moments of the distribution) of the variable 19, [21] [22] [23] . It is difficult to identify abnormal means because one may not observe actual cases separately from the numbers reported by countries. A challenge for identifying abnormal variation in data is that there is no obvious baseline for normal variation. However, because the virus may not care about regime type, a comparison of variation across regime types may highlight abnormalities. In general, differences in variation across regime types cannot a priori distinguish whether one type suppressed variation or another type added variation. However, it is unlikely that higher variation is associated with manipulation, because countries gain little from adding variation to their data 16, 19, 20, 22, [24] [25] [26] . By contrast, manipulating data can lead to reduced variation if care is not taken to reintroduce "normal" levels of variation 41 . Therefore, we investigate whether authoritarian governments manipulate data by testing whether their COVID-19 data is "smoothened" relative to democratic governments. To determine if the difference in data variation between authoritarian and democratic regimes is statistically significant, we employ regression analysis. Our dependent variable is a measure of variation in burden. We compose this variable in three steps. First, we calculate a 7 day centered moving average in daily cases (deaths) for each day in each country. Second, we calculate the square of the deviation of the observed daily cases (deaths) around that moving average for each country. Third, we add one to the squared deviation and divide that by population (millions) and then take the natural logarithm. Our treatment variable is either the country's score on the EIU's Democracy Index, Freedom House's Democracy, the Varieties of Democracy Index, or the Polity5 of the polity project. Our regressions also include a constant. While our observations are at the country day level, we cluster standard errors at the country level to account for autocorrelation in COVID-19 burden. manipulation is to see if data follow patterns that are common in non-manipulated data. One such pattern is that the leading significant digits of a number (or mantissa) has a distribution such that Pr(mantissa < t/10) = log10 t for t in [1,10) 28 . Also known as Benford's Law, a wide assortment of data obey this law [42] [43] [44] [45] . Data have been checked against this distribution to test for fraud in accounting data 46 , campaign contributions 47 and scientific data 48 . We investigate whether governments manipulate data by testing whether the COVID-19 data on cumulative cases and deaths across different regimes (authoritarian, hybrid, flawed democracy, and full democracy) confirms to Benford's law. Before we can test COVID-19 case and death data against Benford's Law, we must decide whether these data are appropriate to test against the law. A concern is that early during an epidemic and after infections plateau, the data will have a number of repeated numbers. These repeats may be the result of true case counts but still violate the law. Therefore, we look at portions of the time series of COVID-19 data during which cases and deaths are rising. Specifically, we test data ("screened data") during which the growth rate of the 7 day moving average of cases and deaths is greater than some cutoff k, where k is 5%, 7.5%, and 10%. To implement the test, we look only at the first digit of the screened case and death data. According to Benford's law, Pr(first significant digit = d) = log10 (1+d -1 ), for d = 1,2,...,9. We group countries into the 4 regimes (authoritarian, hybrid, flawed democracy, and full democracy) defined by the EIU's democracy index. Within each category, we compare the observed frequency of each digit d in the case data against the frequency predicted by Benford's Law using a Pearson's chi-squared test. Natural logarithm of the Mean of squared deviations of observed daily cases and deaths per million people from a 7-day centered moving average, by EIU democracy index score. Notes. Case and deaths data are from Johns Hopkins University. The democracy index score is from the EIU's Democracy Index. We compute the 7 day centered moving average of daily cases and deaths. We compute the square of daily deviations of the observed cases (deaths) from the 7 day centered moving average and add one to it. Then for each country we divide this daily deviation by population per million, compute the mean for each country, and take the natural logarithm. Note: *** p<0.01, ** p<0.05, * p<0.1. 95% Confidence intervals are in parenthesis. The errors are clustered at the country level. Our unit of analysis is the "country-date". The dependent variable is the natural logarithm of the squared deviation of the observed value from the 7 day centered moving average plus one per million people for each country on a daily basis, from the date when the first case was noted till June 30, 2020. Freedom House democracy score ranges from 0 to 100, to make it comparable to the EIU democracy score, the score is divided by 10. The VDEM score ranges from 0 to 1, to make it comparable to EIU democracy score, it is multiplied by 10. Similarly the modified polity5 score ranges from -10 (strongly autocratic) to +10 (strongly democratic), therefore, to make it comparable we add 10 to the score and divide it by 2. Actual frequency of first significant digit in COVID-19 total cases and deaths during periods that 7 day centered moving average grows faster than 7.5% daily, frequency predicted by Benford's law, and test of the difference, by regime type. Total COVID cases, deaths per million people, and Case fatality ratio, by government regime, over time. New daily cases and deaths per million people and 7-day moving average of the same, by government regime. Ordinary least squares regression of deviations from a moving average on measures of democracy. The Atlantic The Guardian China's Statistical System in Transition: Challenges, Data Problems, and Institutional Innovations Measuring Economic Growth from Outer Space Reconsidering regime type and growth: lies, dictatorships, and statistics How Much Should We Trust the Dictator's GDP Growth Estimates? The 1937 census and the limits of Stalinist rule Why resource-poor dictators allow freer media: A theory and evidence from panel data Government control of the media China's strategic censorship Informational autocrats Central statistical monitoring: Detecting fraud in clinical trials Analysing the quality of routine malaria data in Mozambique Incentives for reporting disease outbreaks The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials Statistical techniques to detect fraud and other data irregularities in clinical questionnaire data Fraud and misconduct in medical science Are these data real? Statistical methods for the detection of data fabrication in clinical trials A key risk indicator approach to central statistical monitoring in multicentre clinical trials: method development in the context of an ongoing large-scale randomized trial Detecting fabrication of data in a multicenter collaborative animal study Statistical techniques for the investigation of fraud in clinical research 14: Statistical aspects of the detection of fraud. Fraud and misconduct in biomedical research The law of anomalous numbers A Statistical Derivation of the Significant-Digit Law Guidelines for Quality Assurance in Multicenter Trials: A Position Paper Ensuring trial validity by data quality assurance and diversification of monitoring methods A statistical approach to central monitoring of data quality in clinical trials Data fraud in clinical trials Economist intelligence unit democracy index in relation to health services accessibility: a regression analysis Religion and Volunteering in Context: Disentangling the Contextual Effects of Religion on Voluntary Behavior Economic and political determinants of the effects of FDI on growth in transition and developing countries Political Regime Characteristics and Transitions COVID-19 deaths and cases: how do sources compare? World Bank Open Data Forensic Economics Assessing the integrity of tabulated demographic data A taxpayer compliance application of Benford's law On the peculiar distribution of the US stock indexes' digits The effective use of Benford's law to assist in detecting fraud in accounting data Breaking the (Benford) Law Not the First Digit! Using Benford's Law to Detect Fraudulent Scientif ic Data The errors are clustered at the country level. The dependent variable is the natural logarithm of the squared deviation of the observed value from the 7 day centered moving average plus one per million people for each country on a daily basis, from the date when the first case was noted till