key: cord-0193881-z2i8z6ms authors: McGovern, Stephen title: Spectral Processing of COVID-19 Time-Series Data date: 2020-08-13 journal: nan DOI: nan sha: 4ca0379d2d1141e74d30de87546421298c54bf17 doc_id: 193881 cord_uid: z2i8z6ms The presence of oscillations in aggregated COVID-19 data not only raises questions about the data's accuracy, it hinders understanding of the pandemic. A spectral analysis is presented, and the oscillations in the data are replicated using sinusoidal resynthesis. The precise behavior of the seven-day moving average is also discussed, specifically, the cause of its jaggedness and the phase error it introduces. In comparison, other filtering techniques and Fourier processing produce superior smoothing and have zero phase error. Both of these are presented, and they are extended to isolate several frequency ranges. This extracts some of the same short-term variability that is resynthesized, and it shows that fluctuations with periods between 8 and 21 days are present in U.S. mortality data. Multiple groups are aggregating data on COVID-19 including [1] - [4] . By and large, they rely on a large number of other sources for their data. A comprehensive list of data sources is given by [1] . For many countries and regions, counts of daily new cases and daily new deaths show clear oscillations. Over the course of just a few days, the numbers of cases and deaths fluctuate wildly. For example, between July 6 and 7, the number of deaths in Arizona jumped from 1 to 117 [3] . On June 19th, Brazil had 55,209 new cases and then on June 21st, it had 16,851 [4] . A number of studies have examined the oscillations [5] - [11] . It has been shown that the phases of the oscillations align with one another more often than not [11] . Cases typically reach a maximum close to Friday before falling to a minimum close to Monday [6] . A few different hypotheses have been proposed to explain the oscillations. New York City and Los Angeles county maintain their own data on COVID-19. According to [7] , in these datasets, dates are backdated, and the oscillations are not present in their mortality data. It was also shown that oscillations in the number of infections resulted from oscillations in the number of tests being administered on a given day. Even if the oscillations are caused by data acquisition practices, their presence is problematic for more than just mathematical analysis. They indicate that the data points contain a large amount of error. Identifying the sources of error could give insight into the shortcomings of data acqui-sition practices, and thereby lead to more effective pandemic response strategies. By their nature, data aggregators rely on many sources of information, and there is an absence of any standard procedures for reporting and acquiring data. Estimation of the total number of infections has been greatly hindered by tests not being available. A number of seroprevalence studies [12] have sought to help assess prevalence. Additionally, analyses of excess deaths have indicated that the true death toll is probably substantially higher than what is reported by data aggregators [13] . The seven-day moving average has become an immensely popular method for suppressing oscillations in COVID-19 data. It's in widespread use by both news agencies and researchers. While superficially simple, its precise behavior is complex and unintuitive. Generally speaking, oscillations are attenuated rather than removed. It also flips the phase of oscillations if they lie within specific frequency ranges. Many papers have looked at trend analysis and forecasting. Some of these, such as [14] - [16] , have employed alternative smoothing methods such as exponential smoothing. In contrast, this paper employs methodology from signal and audio processing theory. Spectral techniques are used for improved data smoothing as well as the extraction of shorterterm fluctuations. Additionally, spectral analysis and modeling techniques are used to resynthesize time-series oscillations. As a source of aggregated COVID-19 data, the repository from [1] is used. Spectral analyses was performed in [7] - [10] to study the 7-day oscillation. Oscillations shorter than 7 days have also been observed by multiple researchers, such as [7] , [9] , [10] . Our own spectral analysis found 7-day oscillations for both cases and deaths in many, but not all, countries. A number of countries exhibit a 3.5-day oscillation, and a few were found to have 2.33-day oscillations. Both of these time periods are integer divisors of 7 days; thus, they are potentially harmonics. Harmonics are frequently observed in physical systems, and they are always present in more complex periodic waveforms (e.g., square waves and triangle waves). Three papers [5] , [8] , [10] used first differences in their spectral analyses. Conceptually, a first difference is very similar to a derivative. Both amplify higher frequencies and attenuate lower ones. First differences have a phase response that varies linearly with frequency, while true derivatives have a phase response that is constant. For time-series data, attenuation of low frequencies has the effect of removing longterm trends. For COVID-19 data, it also makes the peaks for the 3.5 and 2.33-day oscillations more prominent relative to the peak for 7-day oscillations. Spectrograms of the derivatives are shown in Fig. 1 for six countries. As of this writing, there are only around 130 days in the COVID-19 time-series that are mathematically significant to this paper's analysis. From a signal processing perspective, this is an extraordinarily small number of data points. Sections of the daily counts also contain exponential growth and decay. Selecting different time-ranges can cause the locations of spectral peaks to wobble and sometimes disappear altogether. Applying window functions is also problematic as data becomes blurred, leaving spectral features more difficult to discern. It was observed that when Fourier transforms were instead applied to the derivatives of the time-series, spectral variations were reduced and features became more clear. This is illustrated in Fig. 2 using a sliding time-window. In this paper, derivatives are calculated in the frequency domain. Time-domain calculations can also work; however, special care may be needed so as to account for the spectral and phase artifacts. Reproducing time-series oscillations can aid in understanding the weekly progression of the epidemic. It can be used to forecast minimums, maximums, and plateaus in the daily counts of infections, deaths, tests, and other data. It can also help identify varying data acquisition practices, and it's potentially useful for detecting irregularities in data. Given a time-series x[n] and its N -point Fourier transform X[k], the magnitude and phase angle of the frequency components are respectively given by, Using Eqs. 1 and 2, and zero-indexed arrays, the time-series can then be reconstructed as, Using sinusoidal resynthesis, the oscillations were recreated for the three aforementioned frequencies. The derivative of the time-series was taken, followed by the Fourier transform. Then Eq. 3 was symbolically integrated with respect to n, and the summation was taken only for values of k that corresponded to one of the three frequencies. The result is a waveform having only the oscillations in the original time-series. The input dataset is aggregated on a daily basis. To improve time alignment, each data point was treated as having occurred at 12 noon on its respective day. Resynthesis calculations were then performed using minute-level time-resolution, and the result of this is shown in Fig. 3 for five countries. The behavior of COVID-19 time-series data can be better understood and better characterized by altering its spectral properties. In fact, spectral alteration is the precise mechanism by which the seven-day moving average smooths data. In this study, three methods were used for modifying the spectral content of U.S. time-series data: the moving average filtering, infinite impulse response (IIR) filtering, and frequency domain processing. As an important first and final step, the data underwent pre-and post-processing. In the context of digital signal processing, the seven-day moving average is a finite impulse response (FIR) filter. Its time-center form is non-causal, and it can be denoted by, Substituting e j2πf for z in Eq. (5) and taking the absolute value results in the frequency response. It follows that the frequency response in decibels is given by, where f is the frequency. The continuous frequency phase response is obtained by substituting H e j2πf for X[k] in Eq. 2. The frequency phase response for the seven-day moving average is shown in Fig. 4 . The three nulls in the plot occur for frequencies with periods of 7 days, 3.5 days, and 2.33 days. The presence of nulls indicates that the filter will completely remove the respective frequencies. All other frequencies, including those periods less than 7 days, are still present. While they are reduced in magnitude, it is enough to be the sole cause of the jaggedness seen in media reports. The jaggedness is illustrated in Fig. 5 . To further muddle the data, frequencies on certain intervals have inverted phases. This is illustrated in the phase response in Fig. 4 . A second application of the same moving average filter will correct the phase inversions. B. Processing Methodology 1) Pre-and Post-Processing: Padding the input dataset is beneficial to all three spectral processing methods. It allows the moving average to run to the end of the time-series. It facilitates the initialization of the IIR filters, and it can help prevent artifacts that frequency domain processing would otherwise leave at the ends of the modified time-series. Before processing, 28 days of synthetic data were added to both ends of the original data. Each synthetic data point was calculated by linearly extrapolating from the nearest existing data points located at distances of 7m and 7(m + 1) days where m is an integer. Numbers extrapolating to less than 0 were set equal to 0. After processing, sections of the output data corresponding to extrapolated data were removed. 2) IIR Filters: Elliptic filters are a type of IIR filter. Their steep frequency roll-off permits strong frequency isolation. Frequency-dependent phase changes are a common artifact of signal processing filters. However, these artifacts are canceled out if filtering is applied twice in opposite directions. Four elliptic filters were created using Python's SciPy library. Comparable functionality is found in Matlab and Octave. The dataset was filtered twice: backwards first and then forwards. The filter design parameters are given in Table I . Frequency response plots for two passes of two of the listed filters are shown in Fig. 4 . 3) Frequency Domain Processing: The Fourier transform will convert a time-series to and from the frequency domain. From there, its spectral content is readily modified. The general formula for such operations can be written as, where FFT, FFT −1 , and H s [k] are the Fourier transform, the inverse Fourier transform, and a computed spectrum, respectively. Four spectra were computed for H s [k]. A "brick wall" spectrum could be calculated by setting H s [k] equal to one or more unit step sequences mirrored about Nyquist. However, for reasons that will become more clear in Sec. III-C, "brick wall" spectra were closely approximated by, where H e [k] is the discretely sampled transfer function of the single-pass filters described in Table I . The moving average was applied to the COVID-19 data. The result is plotted in Fig. 5 . Using the parameters from Table I , the elliptic filters and the frequency domain method were also used to process the data. 1) Low-pass #1 -shown in Fig. 6 respective pass-band and stop-band edge periods of 8 and 7 days were used. However, bandwidth associated with the 7-day oscillation was able to pass through this. 2) Low-pass #2 -shown in Fig. 7 : The line is again visibly smoother. Also, there are fewer lower-frequency oscillations. As in Fig. 6 , the elliptic filters and the Fourier method are in almost perfect agreement. 3) High-pass #1 -shown in Fig. 8 : The long-term trends are effectively removed. As in the two low-pass methods, the elliptic filters and the Fourier method are once again in near perfect agreement. 4) Band-pass #1 -shown in Fig. 9 : The seven-day oscillation is fairly well isolated. The resultant waveform is much more sinusoidal in nature than that produced by the high-pass filter. Like the other operations, the two methods agree very well. 5) Band-pass #2 -shown in Fig. 10 : A bandwidth of oscillations between 8 and 21 days is separated. The intent of this trial is partially to isolate some of the oscillations visible in Fig. 6 . There are still notable oscillations despite having 80 dB of attenuation on all oscillations with periods shorter than 8 days. The two methods still agree well, however, their difference is slightly more visible than in the other trials. Significant spectral differences were observed among different countries. These differences are often evident in the timeseries data. For example, the United Kingdom, Sweden, and Brazil each had a large peak for 3.5-day oscillations in at least one of their spectra. Upon reviewing the respective time-series, a double-humped oscillation is observed. Provided that there is no biological explanation, this must result from some kind of tangible difference in how information is being collected, reported, and aggregated. The oscillations in the time-series were recreated using sinusoidal resynthesis. This effectively recreated the minimums, the maximums, and the general shape of the time-series data including double-humps. This gives us some ability to predict the behavior of the time-series over the course of a week. IIR filters and frequency domain processing can be used to selectively manipulate the spectral properties of COVID-19 time-series data. When tuned appropriately, both methods produce virtually identical results. The superior suppression of higher-frequency components smooths data far more effectively than the seven-day moving average. It would be difficult to argue that statistically sensitive calculations should carried out using a statistically erroneous seven-day moving average. Additionally, the isolation of higher-frequency components could be useful for predicting and modeling short-term variations in observed data. Using these two methods, a band of oscillations longer than 8 days and shorter than 21 days was identified in U.S. mortality data. The cause and significance of this is currently not known. COVID-19 data repository European Centre for Disease Prevention and Control The COVID tracking project A seven-day cycle in COVID-19 infection and mortality rates: Are inter-generational social interactions on the weekends killing susceptible people? COVID-19: daily fluctuations, a weekly cycle, and a negative trend Oscillations in U.S. COVID-19 incidence and mortality data reflect diagnostic and reporting factors," mSystems Assessment of the weekly fluctuations of the Covid-19 cases in Italy and worldwide Oscillatory dynamics in infectivity and death rates of COVID-19 Rapidly evaluating lockdown strategies using spectral analysis: the cycles behind new daily COVID-19 cases and what happens after lockdown Globally Coherent Weekly Periodicity in the Covid-19 Pandemic Lessons from a rapid systematic review of early SARS-CoV-2 serosurveys Excess deaths from COVID-19 and other causes Trend analysis and forecasting of COVID-19 outbreak in India Modeling and forecasting for the number of cases of the COVID-19 pandemic with the curve estimation models, the Box-Jenkins and exponential smoothing methods Day level forecasting for coronavirus disease (COVID-19) spread: analysis, modeling and recommendations