key: cord-0216119-gqs8wnrj authors: Courbariaux, Marie; Cluzel, Nicolas; Wang, Siyun; Mar'echal, Vincent; Moulin, Laurent; Wurtzer, S'ebastien; consortium, Ob'epine; Mouchel, Jean-Marie; Maday, Yvon; Nuel, Gr'egory title: A flexible smoother adapted to censored data with outliers and its application to SARS-CoV-2 monitoring in wastewater date: 2021-08-04 journal: nan DOI: nan sha: f26266f384f02dcf5c93263d421d4707b9d80579 doc_id: 216119 cord_uid: gqs8wnrj A sentinel network, Ob'epine, has been designed to monitor SARS-CoV-2 viral load in wastewaters arriving at wastewater treatment plants (WWTPs) in France as an indirect macro-epidemiological parameter. The sources of uncertainty in such monitoring system are numerous and the concentration measurements it provides are left-censored and contain outliers, which biases the results of usual smoothing methods. Hence the need for an adapted pre-processing in order to evaluate the real daily amount of virus arriving to each WWTP. We propose a method based on an auto-regressive model adapted to censored data with outliers. Inference and prediction are produced via a discretised smoother which makes it a very flexible tool. This method is both validated on simulations and on real data from Ob'epine. The resulting smoothed signal shows a good correlation with other epidemiological indicators and is currently used by Ob'epine to provide an estimate of virus circulation over the watersheds corresponding to about 200 WWTPs. A sentinel network, Obépine (Observatoireépidémiologique dans les eaux usées) 1 , has been designed to monitor SARS-CoV-2 viral load in wastewaters arriving at about 200 2 wastewater treatment plants (WWTPs) in France as an indirect macro-epidemiological parameter. This survey was initiated in March 2020 in WWTPs of the Greater Paris, during the first epidemic wave in France [Wurtzer et al., 2020] . In this system, samples are currently typically taken twice a week, resulting in missing data on most days. The sources of uncertainty in such monitoring system are numerous [Li et al., 2021 , Arabzadeh et al., 2021 . These include, for example: • sampling variance (although sampling at the WWTPs is integrated over 24 hours), • variance from the Reverse Transcriptase (RT) and quantitative or digital Polymerase Chain Reaction (qPCR or dPCR) processes used to measure the concentration of SARS-CoV-2 RNA genes in the samples, • uncertainties on other analytical steps in the labs such as virus concentration, genome extraction and the presence of inhibitors. Using the raw viral load measurements right away to monitor the pandemic can therefore be misleading, as a large variation in the measured concentration can either be due to a real variation in virus concentration or to a quantification error. Therefore, these data need pre-processing in order to provide an accurate estimate of the real daily amount of virus arriving to each WWTP and to evaluate the uncertainty on this estimate as also underlined by Arabzadeh et al. [2021] , who tested, for this purpose, 13 concurrent smoothing methods on data from 4 Austrian WWTPs. A solution to estimate the underlying true concentrations (from which the noisy measurements are derived) is to exploit the time dependence of the successive measurements. This temporal dependence is indeed due to this (never observed) underlying process, while the measurement noises from one time step to another are assumed to be independent. A natural way of exploiting this time dependence to denoise the signal is Kalman filtering [Kalman, 1960, Kalman and Bucy, 1961] . This is, for example, what is done for concentration monitoring by Leleux et al. [2002] and it is one of the methods experimented by Arabzadeh et al. [2021] . Concentration measurements provided by the Obépine network are left-censored because of detection and quantification limits in (RT-)qPCR [a problem pointed out by Luo et al., 2013, for instance] . This drawback, that is not taken into account in the work of Arabzadeh et al. [2021] , results in a non-linear dependence between the measurements and the underlying process which cannot be handled by the classic Kalman filter. Such censoring obstacle to Kalman filtering can also happen in many other fields [such as visual tracking, due to camera frame censoring, see Miller et al., 2012, for instance]. The Ensemble Kalman (EKF) permits to handle this problem by local linearization, but this linearization is sometimes not accurate enough which can lead to large errors [Wan and Van 1 https://www.reseau-obepine.fr/ 2 in November 2021, representing more that one third of the French population Der Merwe, 2000] . Unscented Kalman Filter [UKF, Julier et al., 1995] offers a solution to this problem with a enhanced statistical linearization around the state estimates. The Particle Filter [PF, MacKay, 1992] also permits to handle censored data, but with a high computational cost. Finally, the Tobit Kalman filter [Allik et al., 2015b] is designed to handle censoring and has a reduced computational cost compared to PF [Allik et al., 2015a] . In this approach, the censored data are replaced within an Expectation-Maximization algorithm by their conditional expectation with the current parameters. However, we wish not only to predict in real time the value of the underlying process being measured, but also to reconstruct its past values and to estimate the error of the measurement system. To do so, a backward filter has to be computed and superimposed on the previously described (forward) filters, resulting in smoothers [Haykin, 2004] . Data from the Obépine network moreover contains numerous outliers (which may be caused by the vagaries of sampling or by a handling error, for example) that can bias a smoothing that would treat them as regular measurements. Aravkin et al. [2017] review adaptations of Kalman smoothing robust to such outliers (and also to abrupt changes of the underlying state and to switching linear regressions). This is mainly done by modifying the loss function to be minimized. However, those adaptations do not apply to the censoring scheme. We here focus on the one-dimensional setting and propose a Smoother adapted to Censored data with OUtliers (SCOU) that answers both the censoring and the outliers detection problems through a discretization of the state space of the monitored quantities and more generally permits a very high degree of modelling flexibility. The proposed model and its implementation are described Section 2. Section 3 provides an illustration and validation of our approach using numerical simulations. Section 4 gives an example of utilization on data from the Obépine network. Our method is based on the following state-space model: where: • t is the time index (ranging from 1 to n). • X t ∈ R is the real quantity at time t and X = (X t ) t∈{1,...,n} is the vector of real quantities (to be recovered). • Y t ∈ R is the measurement at time t. Y t is generally only partially observed. We note T ⊂ {1, ..., n} the set of t at which Y t is observed. Y = (Y t ) t∈T = Y T is the vector of measurements. Y * is an accessory latent variable corresponding to a non-censored version of Y . I is the identity matrix. • η ∈ R, δ ∈ R, σ ∈ R + , p ∈ [0, 1] and τ ∈ R + are parameters (to be estimated). • is the threshold below which censorship applies 3 . • O t ∈ {0, 1} is, for any t ∈ T , the indicator variable of the event "Y * t is an outlier". O = (O t ) t∈T . B(p) stands for the Bernouilli distribution of parameter p and U([a, b]) for the Uniform distribution on the interval [a, b]. a and b have to be chosen, they can for example correspond to quantiles (respectively very close to 0 and very close to 1) of the empirical marginal distribution of Y . Figure 1 depicts this model. In this figure, the arrows represent the direct dependency relations between the variables. We can see the auto-regressive nature of the latent process X, while the Y are independent conditionally to X being known. Figure 1 : Diagram of the proposed auto-regressive model. (X t ) t∈{1,...,n} is the underlying autoregressive process (to be recovered), (Y t ) t∈T is the measured quantity process, Y * t a non-censored version of Y t , O t is the indicator variable for the event "Y * t is an outlier", ε X,t are tiny innovations and ε Y,t are the measurement errors. We are interested in the distribution of (X t |Y T = y T ) for every t ∈ {1, ..., n}. This distribution can be computed from the forward (F ) and backward (B) quantities, which are defined as follows: (2) • B n (x) = 1 (by convention). For t = n − 1, . . . , 1, In the strictly Gaussian case, the Forward and Backward quantities can be computed recursively in closed formulas. Censorship and outliers unfortunately prevent the use of these closed formulas. We have hence developed a discrete version of those computations, which makes it easy to adapt to many possible extensions of the model (censorship but also outliers and possibly heteroscedasticity for instance). The set of values that can be taken by X, approximated by [a, b] , is discretized into D values and becomes X , with, for example, X = {a, a + ∆, a + 2∆, ..., b} and ∆ = b−a D−1 . We then return to the framework of hidden Markov chains with D states, the D hidden states being the D possible values for each X. The transition matrix, π, is calculated as follows: where I {A=a} is the indicator variable for the event {A = a} and F N (.; µ, σ 2 ) is the Gaussian cumulative distribution function with mean µ and variance σ 2 . The Forward and Backward quantities, defined in Equations 2 and 3, can then be calculated recursively from π and e: • F 1 (x) = e 1 (x) D and, for t ∈ {2, ..., n}, • B n (x) = 1 (by convention) and, for t ∈ {n, ..., 2}, Numerical tricks like logarithmic re-scaling are moreover used to calculate e, π and the Forward and Backward quantities in order to avoid underflow and overflow problems. The parameters η, δ, σ, τ and p maximizing the likelihood, which is given by x∈X F n (x), could be obtained with the Expectation Maximization algorithm. However, it is made challenging by the discretization and renormalization. This is why numerical optimization is preferred. The parameters are more precisely obtained by the numerical optimization of Nelder and Mead (as implemented in the R function optim). For any t, the marginal distribution (mean, standard deviation) of (X t |Y T = y T ), is given by for any x ∈ X , the probability that X t (once discretized) takes the value x conditionally to Y T = y T . We similarly show that: Thus, to simulate from P(X {1,...,n} |Y T = y T ), one can process sequentially, simulating x 1 with The probability for each observation Y t to be an outlier (knowing Y T = y T and for the a priori probability of p) can be computed as follows: where h has to be chosen according to the targeted false positive and false negative rates. The proposed estimation is first evaluated on artificial data. The simulations are designed to study the ability of the algorithm to produce a good estimation of the parameters, to identify correctly the outliers and to adequately predict the conditional distribution of the underlying process. Artificial data are simulated according to the model (1) with n = 150 time steps, 50% of which are observed. We set a and b respectively to the quantiles 0.02% and 99.98% of the marginal distribution of Y . For the sake of consistency with the study presented in Section 4, parameters are set so as to be realistic with regard to the Obépine data: σ = 0.3, τ = 0.6, η close to 1 and δ close to 0. We realize five experiments: • with no censoring, no outlier, δ = 0, η = 1, • (4) with ∆ = 0.1, set for each simulated data such that 16% of Y are censored (medium censoring level), an outlier rate of p = 7%, η = 0.99 and δ = 0.001, • (5) same setting as the previous experiment but with 31% of censored data (high censoring level). Each experiment is replicated 100 times. For each simulation, we compare our approach, SCOU, to : • a 2-parameter Kalman Smoother implemented in the DLM R package [Petris, 2010] . In the latter method, censoring and outliers are not taken into account and δ and η are not estimated (δ is set to 0 and η is set to 1). Those hypotheses correspond to the settings of the experiments (1), (2) and (3), which aims at checking that the two methods give identical results if ∆ is small enough. • the moving average smoother. • the LOESS [Atkeson et al., 1997] which consists in locally weighted polynomial regressions. As in the study by Arabzadeh et al. [2021] , the parameters of the two last methods (the number of days of the moving average smoother and the span parameter of the LOESS) are chosen by leaveone-out cross-validation so as to minimize the Root Mean Squared Error (RMSE) for the prediction of the left out observations. When (left-)censoring occurs, considering the observed censored values would result in a poor performance of those smoothers in recovering the underlying auto-regressive signal. We compensate this by instead taking the estimated mean for the corresponding right-truncated normal distribution in (estimated by assuming that the observations follow a left-censored normal distribution in ) as the observed value for those smoothers, which results in improved performances in all cases. The first, second and third experiments (no outliers, no censoring, δ = 0 and η = 1) show, as expected, that the 2-parameters Kalman smoother implemented in the DLM R package and our method give identical results for ∆ = 0.02. The two methods show close performances for ∆ = 0.1 and a substantial degradation when ∆ = 0.7, as shown Figure 2 , where the RMSE (Root Mean Square Error) for the prediction of X is computed for each of the 100 repetitions of each of the three experiments. In what follows, we focus on the two other experiments (medium and high censoring levels, p = 7%, η = 0.99, δ = 0.001 and ∆ = 0.1). Figure 3 illustrates on an example the results of our method on simulated data within the fourth experiment setting (medium censoring level). On this illustration, we can see that the learning process and the use of the smoother permit to finely predict the trajectory of the underlying process, X (in red), from the observations Y T = y T (represented by dots) while adequately taking into account the time interval during which the censoring applies. On this figure, the more pink-colored the points are, the higher the estimated probability of them being an outlier. For a detection threshold set lower than h = 0.95, two out of the four simulated outliers (pink crosses and circles, two of them being censored) are identified. The 23 days moving average and the LOESS (for span= 0.24) smoothers, which parameters are selected by leave-one-out cross-validation, give rather good reconstitutions but fail in reconstructing the trajectory when censoring or outliers happen. Figure 4 shows that the parameters are correctly estimated with the proposed method. However, one can notice a little negative bias in η estimates which might be due to the asymmetric impact of the estimation bias, with a strong degradation if the η estimate exceeds one. The parameters learned by Figure 2 : RMSE obtained for the prediction of the true underlying signal by the two-parameter exact smoother and by our method on data sets simulated with no outliers, no censoring, δ = 0 and η = 1 and with varying values of the discretization step, ∆. As expected, in this ideal setting, the 2-parameters Kalman smoother implemented in the DLM R package and our method give identical results for ∆ = 0.02. our method are indeed close on average to the parameters used for the simulation of the data. The parameters learned with the 2-parameter Kalman smoother (σ-dlm and τ -dlm) present an estimation bias which illustrates the necessity to take into account the censoring of the data and the outliers when they exist. The a posteriori probability that an observation Y t is an outlier, P(O t = 1|Y T = y T ), is computed as specified Equation 4 with, for each simulation, the estimated parameters. The ROC (Receiver Operating Characteristic) curves for the detection of the simulated outliers (around 525 out of about , for 16% and 31% of censored data (resp. white and cyan boxes). Finally, we evaluate the ability of our method to predict the distribution of X conditionally on the set of observed Y , i.e. the distribution of (X|Y T = y T ) . As illustrated Figure 5 , in both the medium and high censoring level experimental settings, the RMSEs obtained by our method are significantly lower than the ones obtained with the 2-parameter Kalman Smoother, with the LOESS method and with the moving average even when their parameters (respectively the span and the number of days) are chosen so as to minimize the RMSE obtained by leave-one-out cross validation. As for the variance prediction, the coverage rates of the 95% Prediction Intervals of our method, derived from the predicted distributions of (X t |Y T = y T ), are (on average) close to the target of 95%, with a median coverage rate of 93% (compared to 91% with the 2-parameter Kalman Smoother) in the medium censoring level setting (resp. 93% and 85% in the high censoring level setting) as illustrated Figure 6 . The developed smoothing method aims to provide an estimate of the actual amount of viral genome arriving at each WWTP and to assess the uncertainty of this estimation. The concentration measurements provided by Obépine are adjusted beforehand for rainfalls and wastewater sources other than from households, which can otherwise distort the conclusions when it is Figure 6 : Coverage rates of the prediction intervals for the prediction of X, with 16% and 31% of censored data (resp. white and cyan boxes). abundant by diluting the water arriving at the WWTPs [Cluzel et al., 2022] . The direct application of SCOU on the flow-adjusted measurements shows heteroscedasticity of the residuals (estimated by y T − E(X T |Y T = y T )): their variance increases with y (even after removal of numerous measurements identified as outliers and of censored measurements). Besides, the underlying process, X, follows the dynamic of an epidemic. During the exponential growth, it is thus supposed to multiply from one day to another. For both those reasons, a logarithmic transformation of the Obépine data was performed prior to applying the proposed method. Hence, for this data set, Y is the logarithm of the flow-adjusted measured concentrations, X is the logarithm of the actual virus concentrations (to be estimated), is the logarithm of the limit below which censoring occurs (typically log(1000U/L) for the quantification of the SARS-CoV-2 E gene, where U stands for RNA Units, but it can fluctuate from one day to another due to flow adjustments). The application of SCOU to real data from the Obépine network is illustrated for two WWTPs on Figure 7 . As shown by the successive predictions, once the parameters are fixed, the predictions are rather stable from one day to another. The estimates of the smoothing parameters for the 190 Obépine WWTPs with enough available observations (at least 10 measurements) are illustrated in panel (A) of Figure 8 . The estimated δ parameters are above 0 while η parameters are very close to 1. Hence, when no new information is provided (no new Y is observed), the smoother predicts, in average, an increase in X values. The uncertainty of the monitoring system is evaluated by the parameter τ , whose estimates distribution for the 190 Obépine WWTPs is illustrated in panel (B) of Figure 8 . The average value for τ is 0.54 log(U/L). This corresponds to a standard deviation of about 0.65x, where x is the real E-gene RNA concentration (in U/L) to which the standard deviation of the measurement error is supposed proportional in our model. The allocation of this uncertainty between the sampling error, the RT-qPCR or RT-dPCR error and other possible sources of error throughout the system is however not known. Importantly, the resulting smoothed signal is well correlated with the logarithm of the local COVID-19 incidence rates and this correlation is most of the time greatly enhanced by the proposed smoothing step as depicted Figure 9 for the 15 cities with enough data available on both sides. On this figure, the correlations are only computed for dates at which raw data was available and corresponds to the best correlation for a time lag ranging from 1 to 3 days in one direction or the other. They are computed for a period ranging from the beginning of the second wave to the 21 st of May 2021. Those correlations are not expected to be higher since, contrary to incidence rates, the indicator also points to asymptomatic peoples infected by SARS-CoV-2 and is not biased by people getting tested outside their city (during holidays for instance) nor by varying testing policies. In order to produce comparable values from one WWTP to another, Obépine moreover performs a scaling of the data after smoothing [Cluzel et al., 2022] . This scaling takes into account, for example, the maximum amount of virus measured during the epidemic wave that occurred in Autumn 2020 in France, the range of volume treated by the WWTP and the specifics of the laboratory that analyzed the sample by RT-qPCR or RT-dPCR. The final computed indicator (called WWI for WasteWater Indicator) shows good behavior with regard to the corresponding incidence rates [Cluzel et al., 2022] . We developed a method to smooth one dimensional time-series consisting of successive censored measurements with outliers when the associated measurement uncertainty is not known and the measured quantities have an auto-regressive nature. By discretizing the state space of the monitored quantities, the proposed method has the advantage of being easily adaptable to the specificities of the data (such as measurement censoring and the occurrence of outliers). An experiment on artificial Figure 9 : Correlation of raw Obépine data and flow-adjusted and smoothed Obépine data with the logarithm of incidence rates of the corresponding cities for 15 cities with enough data available on both sides, for a lag time ranging from 1 to 3 days in both directions and for a period ranging from the beginning of the second wave to the 21 st of May 2021. data validates the proposed inference and prediction method. Our method has then been successfully applied to data generated during the Obépine monitoring of SARS-CoV-2 genome concentration in WWTPs [Cluzel et al., 2022] . Importantly, the proposed smoothing procedure enhances the correlation of the data with other epidemiological indicators such as the incidence rate of COVID-19. The time lag between the two signals is moreover just a few days [Cluzel et al., 2022] , making the WWI a credible alternative to the evaluation of the incidence rate through massive individual testing. This approach may be especially relevant if massive testing campaigns become less relevant notably with the advancement of the vaccination campaign and the availability of self-tests to the general public. Both of these factors may indeed induce a progressive but significant decline in participation in testing in a few months and a significant dwindling of the population surveyed to monitor the pandemic, potentially making it even more partial than it is now. The proposed method could be further developed. First, the underlying X process could have a longer time dependency than an AR(1) process. We could thus develop an AR(p) version of this method to handle this (with p > 1). Besides, the behavior of the marginal X process, and thus its parameters, are expected to change as we move from a propagation of the epidemic stage to a decreasing stage [Berestycki et al., 2021] . Joint treatment of the WWTPs time-series could overcome the lack of individual data to face this problem. One could, for example, automatically detect common breakpoints corresponding to a change in the parameters η or δ. Another possibility would be either to use extrinsic knowledge of the reproduction factor of the epidemic as input data or to add it as a latent variable that slowly evolves from one day to another. Another way to proceed would be to deduce from other available epidemiological data the shape of the signal to be found in wastewater (and thus an adequate smoothing) based on a fine modeling of the whole pathway of SARS-CoV-2 from the human population to wastewater such as the one proposed by Nourbakhsh et al. [2021] . However, such a mechanistic representation includes a large number of unknowns (actual number of infected individuals in the population, rate of RNA degradation in wastewater,...) which makes it difficult to exploit for the reconstruction purposes aimed here. Nonlinear estimators for censored data: a comparison of the EKF, the UKF and the Tobit Kalman filter The Tobit Kalman filter: an estimator for censored measurements Data filtering methods for SARS-CoV-2 wastewater surveillance Generalized Kalman smoothing: Modeling and algorithms Locally weighted learning. Lazy learning The effects of heterogeneity and stochastic variability of behaviours on the intrinsic dynamics of epidemics. medRxiv A nationwide indicator to smooth and normalize heterogeneous sars-cov-2 rna data in wastewater Kalman filtering and neural networks A new approach for filtering nonlinear systems New results in linear filtering and prediction theory A new approach to linear filtering and prediction problems Applications of Kalman filtering to real-time trace gas concentration measurements Uncertainties in estimating SARS-CoV-2 prevalence by wastewater-based epidemiology Modelling HIV-1 2-LTR dynamics following raltegravir intensification Bayesian interpolation Kalman filter-based tracking of multiple similar objects from a moving camera platform A Wastewater-Based Epidemic Model for SARS-CoV-2 with Application to Three Canadian Cities An R package for dynamic linear models The unscented Kalman filter for nonlinear estimation Evaluation of lockdown effect on SARS-CoV-2 dynamics through viral genome quantification in waste water The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Yvon Maday, Jean-Marie Mouchel, Vincent Maréchal, Laurent Moulin and Sébastien Wurtzer brought on the scientific problem. Marie Courbariaux contributed to the design of the algorithm, performed experiments on artificial and real data, and wrote the paper. Grégory Nuel contributed to the design of the algorithm and coordinated the experiments and the writing of the paper. Nicolas Cluzel and Siyun Wang prepared the data provided by Obépine, contributed to the design of the algorithm, performed experiments on artificial and real data and discussed the results. Nicolas Cluzel, Siyun Wang, Jean-Marie Mouchel, Yvon Maday and Vincent Maréchal proofread and discussed the content of the paper. Yvon Maday moreover coordinated the exchanges about the article within the Obépine consortium. This work was carried out within the Obépine project funded by the French Minister of Higher Education, Research and Innovation (Ministre de l'Enseignement Supérieur, de la Recherche et de l'Innovation). Financial support was also obtained from the French National Center for Scientific Research and Sorbonne Université. We would like to thank the WWTPs and the laboratories that contribute day to day to the Obépine network. In particular, the LCPME (Isabelle Bertrand and Christophe Gantzer) produced three quarters of the raw data that was used to produce Figure 9 . The experimental reflections of this work were fed by exchanges with the members of the Obépine consortium and notably Karine Laurent. WWI produced by Obépine are freely available for all WWTPs treated by the Obépine network on the data.gouv.fr public platform. Incidence rate data are partially available in open access for 22 Public Establishments of Intermunicipal Cooperation (EPCI) and can be found on the same platform.