key: cord-0536445-keg45rtg
authors: Eckardt, Matthias; Mateu, Jorge; Greven, Sonja
title: Generalised functional additive mixed models with compositional covariates for areal Covid-19 incidence curves
date: 2022-01-20
journal: nan
DOI: nan
sha: 27b30733d26b430cb8e1f4d710eda7ac1068d557
doc_id: 536445
cord_uid: keg45rtg

We extend the generalised functional additive mixed model to include (functional) compositional covariates carrying relative information of a whole. Relying on the isometric isomorphism of the Bayes Hilbert space of probability densities with a subspace of the $L^2$, we include functional compositions as transformed functional covariates with constrained effect function. The extended model allows for the estimation of linear, nonlinear and time-varying effects of scalar and functional covariates, as well as (correlated) functional random effects, in addition to the compositional effects. We use the model to estimate the effect of the age, sex and smoking (functional) composition of the population on regional Covid-19 incidence data for Spain, while accounting for climatological and socio-demographic covariate effects and spatial correlation.

Understanding the infectious disease dynamics of the Covid-19 (Coronavirus disease 2019) pandemic and its potential associations with different exogenous environmental, socio-economic and health-related variables has become an important challenge of current interdisciplinary research. Although massive data is collected, the long-term interplay of the local numbers of daily Covid-19 cases with sets of different (potentially time-varying) risk factors still remains an open and challenging topic. In particular, this includes the effects of the composition of the local population (age, sex, smoking etc.) on the spread of the disease, which has not been investigated to the best of our knowledge. This paper aims to fill this gap by extending the generalised functional additive mixed model (GFAMM) of Scheipl et al. (2016) to the case where parts of the covariate set are finite or infinite compositions, i.e. multivariate or functional covariates carrying relative information of a whole. The proposed formulation allows to model the local Covid-19 dynamics conditional on various types of exogenous variables, including among others the population density (a scalar), the average temperature (a function of time), the smoking status (a finite composition), the age composition (a functional composition or infinite composition or density) and the regional structure (a grouping factor with spatial correlation). To this end, using areal Covid-19 incidence data collected in Spain daily until the vaccination onset, the local incidence curves are modelled taking a functional data analysis (Ramsay and Silverman, 1997) perspective.

In functional data analysis, the response curves are considered as realisations of some stochastic process with continuous support, such as time. While the process itself could theoretically be observed for any point at arbitrary resolutions, the curves are only measured on a discrete grid. A suitable class of functional data analysis techniques for the present purpose are functional regression models which can, in general, be summarised into scalar-on-function (where only the covariates are functions), function-onscalar (where response curves are related to scalar covariates), and function-on-function models (where both the response and the regressor are functions). See Morris (2015) and Greven and Scheipl (2017a) for a general review of different functional regression specifications. Regressions for non-Gaussian functional responses (e.g. counts) include e.g. the generalised function-on-scalar model (Goldsmith et al., 2015) and also the GFAMM (Scheipl et al., 2016) , which provides a flexible regression framework for possibly non-Gaussian functional responses on potentially irregular or sparse grids using basis function representations of the fixed and/or random effects of scalar and/or functional covariates. We note that a complementary approach for functional regression models observed on equidistant grids is presented in Morris and Carroll (2006) and subsequent work, which uses a Bayesian estimation approach based on wavelet transformations of the functions that is, however, not suited to generalised functional responses. A general comparison of both the basis function and the wavelet transformation approaches is given by Morris (2017) and Greven and Scheipl (2017b) .

Although different (generalised) functional regression specifications exist, only a very small body of the literature discusses extensions of functional regression models to the case where the responses or parts of the covariate set are finite or infinite compositions. Predominantly, the literature focused on extensions to density-valued (functional compositional) responses while extensions to density-valued covariates appeared only rarely (see Petersen et al. (2021) for a recent review of different statistical approaches to densityvalued quantities). Apart from a Hilbertian random variable treatment of density-valued explanatory variables (Sierra et al., 2015) or density-valued explanatory variables and responses (Park and Qian, 2012) embedded into the standard L 2 space, density-valued responses were covered in the additive regression model of Han et al. (2020) , where the outcome is mapped to the space of unrestricted square integrable functions in a preprocessing step, and also the nonlinear space formulation of Petersen and Müller (2019) . However, while the L 2 space approach does not account for the constrained nature of the response, the additive regression and the nonlinear space formulations are affected by instabilities of the pre-transformations (see Happ et al., 2019) and do not allow for a straightforward interpretation of the regression coefficients, respectively. Different from the above approaches, Arata (2017) ; Talská et al. (2018) and Maier et al. (2021) applied a Bayes Hilbert space formulation (van den Boogaart et al., 2014) to include densityvalued responses into a functional regression framework where the estimation uses the centred log ratio (clr) transformation to map the response to a subspace of the L 2 with integration-to-zero constraint. Although this approach provides a promising framework for density-valued response regression, extensions which (additionally) allow for densityvalued explanatory variables remain extremely limited. A first linear Bayes Hilbert space regression model for scalar responses and density-valued covariates was proposed by Talská et al. (2021) using a constrained spline representation (Machalová et al., 2021) and extended further by Scimone et al. (2021) , allowing for both density-valued responses and covariates. While this model allows for linear effects of the functional composition on the response, extensions to generalised (functional) additive models where parts of the predictor are finite or infinite compositions still remains an open topic. We fill this gap within the flexible GFAMM framework. We build on Verbelen et al. (2018) , who presented an extended predictor for the generalised additive model for scalar responses with compositional covariates, by extending this idea to functional responses and functional compositions as covariates. In particular, both finite and infinite compositions are included into a general structured predictor through suitable basis function representations that account for the constrained covariate nature.

The remainder of this paper is structured as follows. Section 2 briefly reviews the history of the pandemic in Spain and recent findings on potential risk factors for the spread of the disease that inform our selection of covariate effects. It also provides information on the data sources and (generated) variables used in the analysis, and presents a descriptive analysis of the different covariates included in the GFAMM. An introduction to this model is presented in Section 3. In particular, extensions of the covariate set to finite and infinite compositions are discussed in Section 3.3. An application of the proposed extension to Spanish Covid-19 incidence data is given in Section 4. The paper concludes with a discussion in Section 5.

Within three months of the official notification of a small regional outbreak in Wuhan, China, in late December 2019, Spain was facing one of the highest infection and, in particular, mortality rates among the European countries (Soriano and Barreiro, 2020) . Within four weeks of the first tourist-based case on the Island of La Gomera in January and the first domestic hospitalisation on 15th February 2020, Spain witnessed a large but spatially strongly heterogeneous increase in numbers of Covid-19 infections with a clear concentration in large metropolitan conurbations (Henríquez et al., 2020) . These marked regional differences and early local peaks in e.g. Madrid, may in part be explained by the regional mobility from and to the Spanish capital (cf. Mazzoli et al., 2020) . Using data for Catalonia, Coma Redon et al. (2020) provided some evidence that early local Covid-19 cases may have been masked by excess of influenza notifications between 4th February 2020 and 20th March 2020 as polymerase chain reaction (PCR) tests were restricted to hospital-admitted patients only and general practitioners were asked to diagnose Covid-19 infections without PCR confirmation. Due to this uncertainty, early cases and deaths in private and nursing homes may have been excluded from the official reports for this period. However, previously undetected symptom-based cases and deaths were subsequently added to the official notification system such that we can plausibly assume that any remaining bias can be neglected in the present study.

In response to the rapid increase in mortality, particularly among the elderly (potentially multi-morbid) parts of the population, and the transmission dynamics of the disease, the Spanish National Government imposed a series of global regulatory interventions to suppress the spread of the virus. A first strict lockdown excluding only essential services (e.g. food, health) and some economic subsectors was imposed on 14th March. This measure was tightened in subsequent actions by placing strict entry refusal at the Spanish borders on 17th March and prohibiting any non-essential activities within the period from 30th March to 12th April 2020. In consequence, these restrictions yielded a dramatic reduction in overall regional mobility. Ended on 21st June, this lockdown remains the only global action of the Spanish Government against the Covid-19 pandemic. Although facing severe increases in numbers of Covid-19 cases in the second half of 2020 and early 2021, the public health responsibilities were delegated back to the local governments of the autonomous communities by 21st June and only local but no further global restrictions were reimposed. These (mostly soft) local measures, however, show a strong heterogeneity, and information on the exact timing, nature and extend of the different imposed restrictions was not available from official sources for our study.

Apart from a higher risk of Covid-19 infections caused by the increase in immunosenescence for the older ages (Crimmins, 2020), a clear association of higher age with the development of severe symptomatic Covid-19 infections, hospitalisation and fatality rates is stressed in the literature (see e.g. Tiruneh et al., 2021) . Besides certain health conditions and comorbidities (Du et al., 2021) , Wolff et al. (2021) identified smoking as an important co-factor. In particular, active smokers or non-active smokers with a clear smoking history face an increased risk for the development of symptomatic Covid-19 (Hopkinson et al., 2021; Gülsen et al., 2020) . Apart from these findings, e.g. Moosa and Khatatbeh (2021) reported a close relation between densely populated regions and contact rates between different (potentially infected) individuals, which positively impact the disease transmission and, in turn, the reproduction rate of the disease. This idea is also supported by Paez et al. (2021) who reported a clear positive effect of mass transport systems on the incidence. These authors also reported a positive association of the disease with wealthier regions, i.e. regions with a higher GDP per capita, with a potential explanation via a connection between wealth and the regional level of globalisation, i.e. international trade and travel.

The effect of climatological and environmental covariates on the spread of the disease is less consistent. In line with results on similar pathogens, which suggest that the virus is more stable and transferable in conditions of low temperature and low humidity, Paez et al. (2021) found a negative effect of higher values in temperature and humidity on the incidence of the disease, contrasted with a positive impact of sunshine. In a systematic review, Mecenas et al. (2020) found a positive effect of cold and dry weather conditions on the seasonal viability and transmissibility of Covid-19. Takagi et al. (2020) reported an inverse association of temperature, air pressure, and ultraviolet light with the prevalence of the disease, while Hossain et al. (2021) draw mixed conclusions based on both positive and negative effects of the weather characteristics. Shahzad et al. (2020) highlighted a clear positive effect of bad air quality on the transmission of Covid-19 whereas temperature serves only as a contributory factor, with higher temperatures reducing the spread of the disease. Summarising the findings of 23 articles in a systematic review, McClymont and Hu (2021) found a clear association of temperature and Covid-19 and also a significant association of humidity and Covid-19 which, however, was derived from mixed results. Using a nonlinear effect specification, Wu et al. (2020) found a negative association of high temperature and also high humidity with the daily number of Covid-19 cases and associated deaths.

The data was compiled from different sources, most commonly information provided by the regional governments. It originates from a collaborative data project by the geovoluntarios community (https://www.geovoluntarios.org), Centro de Datos Covid-19 and ESRI Spain and provides information on the daily numbers of Covid-19 cases for 52 Spanish provinces, each of which subsumes numerous local administrative units. It covers the period from 5th January 2020 to 19th January 2021 until just before vaccinations became more widespread. Covid-19 cases are defined as probable infections without test information or confirmed infections based on positive test results derived from (i) PCR, antibody, and antigen detection or Elisa techniques, and (ii) reported by other laboratories -showing a clear majority of results derived from positive PCR tests. In contrast, notifications based on antibody tests (with less precise timing information on the time of infection) contributed only at rather small and also spatially-varying rate. Restricted to the first Spanish Covid-19 wave only, the highest regional proportion of antibody based test results relative to all cases reported appeared for the provinces of Cuenca (5.06%) and Albacete (2.34%). We thus use the complete incidence counts based on all tests. To account for a potential delay between the date of the test and the notification date caused by the individual testing procedures, the dates were shifted back by the provider using a three days lag. Different from data used in this study, official periodical data releases on the Covid-19 pandemic through the Spanish National Government cover only information at a coarser level of spatial aggregation, i.e. 18 so-called autonomous communities, to which the 52 Spanish provinces belong. Note that 2.05 % (resp. 0.02 %) of the Spanish population received the first (resp. second) vaccination before 19th January 2021. Due to this relatively small proportion of partly vaccinated inhabitants, any potential confounding effects of the vaccination action on the incidence data is assumed to be ignorable.

To investigate (potential time-varying) effects on the spread of the pandemic over space and time, we linked this data to climatological, socio-economic and demographic information, recorded for the provinces and time period under study (see Table 1 in the Supplement for a detailed description of the variables). Daily climatological information including the average daily temperature (in • C), humidity (in %), maximal wind speed (in km/h), sun hours (in h) and precipitation (in mm) were collected for each province from the state meteorology agency (AEMET) and the ministry of agriculture, fisheries, and food (MAPA). The original data exhibited some missing values, in particular 0.5% for average temperature, 0.77% for maximal wind speed, 2.07% for sun hours, 5.34% for precipitation, and 0.5% for humidity -most likely caused by transmission or technical problems. Any of these missing values were imputed by linear interpolation using the imputeTS package (Moritz and Bartz-Beielstein, 2017) in R. As information on the daily solar exposure was not provided at all for Malaga, missing values for sun hours for Malaga were replaced by average values computed from the neighbouring provinces, i.e. Cadiz and Granada. For the precipitation variable, the reported values show a large number of zeros in the daily amount of precipitation at the province level (ranging from 36 days at minimum to 215 days at maximum) as well as skewness with extreme peaks of 150mm. For this reason, we summarised the original information into a binary variable indicating the absence or presence of rain per province on a daily basis. In addition to this rain indicator, we computed the log transformation of the nonzero precipitations. Next, all weather information was shifted using a 5-days lag to account for the time lag between infection and symptom onset, as we want to investigate weather effects on the infection probability and assume symptom onset to be strongly correlated with the timing of the performed Covid-19 test. While the 5-day lag is supported by the findings of e.g. Linton et al. (2020) who reported an average incubation period (defined as the time from the infection to symptom onset) by around 5 days, we tested for the effect of different lag specifications in a sensitivity analysis.

Regional socio-demographic information on the number of inhabitants, the gross domestic product (GDP) per capita, the proportion of males, and age pyramids (0-100 years) were collected from public data provided by the national statistics institute. This source was also used to compute the smoking composition of the population (categorised in daily, occasional, ex-and non-smokers) at the provided coarser level of the autonomous communities, which we then assigned to all provinces within this community.

In addition to the above data, we generated different variables to control for the regional and geographical characteristics of the individual spatial units. First, to account for a potential impact of public mass transportation systems on the transmission and spread of the virus, we generated a binary variable indicating whether or not a province offers access to a metro or subway system. In addition, we generated a second binary variable indicating whether or not a province offers direct access to the Mediterranean Sea or the Atlantic Ocean. Besides a higher population density in the coastal regions compared to the inland provinces (except for the metropolitan regions), the coastline is commonly strongly affected by high numbers of incoming tourists during the summer and public vacation periods, which might serve as an acceleration factor for the risk of infection. Finally, to account for (i) potential temporal variation in the regional notification systems, and (ii) the effect of the global lockdown measures, we generated six weekday dummy variables treating Sundays as reference and three global lockdown dummies. Each of these lockdown indicators corresponds to one of the three subsequent global lockdown periods with different measures imposed during the first wave in 2020, i.e. (i) May, 11th to May, 24th, (ii) May, 25th to June, 7th, and (iii) June, 8th to June, 21st.

The response curves show strong regional heterogeneity, with the highest numbers of cases in the provinces of Madrid and Barcelona. The highest peak (y = 6750) appeared for Madrid on September 18th, 2020, contrasted with e.g. only y = 77 on that date recorded for the province of Lleida. To better understand the similarities among the spatial disease patterns, we calculated the regional incidence rates per 100,000 inhabitants, and its average version, computed as the mean rate per region over all 381 days (see Figure  1 ). The incidence curves (left panel) reflect a clear positive deviation for Madrid (green) and also Lleida (blue) from the mean curve (red) for the period from June to October 2020 -with a temporal delay in the second wave for Madrid compared to Lleida. The corresponding regional averages (see right panel) indicate a clear spatial pattern, with Palencia and Cuenca showing the highest average incidence rates with 20.5 and 19.66 cases per 100,000 inhabitants, respectively. These high average values are contrasted with relatively small reported rates for e.g. Lugo (6.77). The clustering of small and high average rates suggests a positive spatial autocorrelation structure in the data, which is supported by highly significant results for Moran's I (Moran, 1950) index (restricted to continental provinces). See Figures 1 and 2 in the Supplement for the spatial distribution over all Spanish provinces and communities including the Spanish islands and African enclaves. Fig. 1 . Distribution of daily Covid-19 cases per 100,000 inhabitants over time and space: (a) mean (red) and regional incidence for all provinces (grey), as well as for Madrid (green) and Lleida (blue) over 381 days, and (b) average incidence over 381 days at province level, restricted to continental Spain. Figure 2 illustrates the patterns of the scalar, functional and compositional covariates. Both the averaged spatial patterns over time of mean temperature and solar exposure (left column of this plot) show a general increase from the northern to the southern parts of Spain. The temporal means for humidity and precipitation reflect some spatial heterogeneity with higher values computed for the northern areas contrasted with lower values for both variables at the Mediterranean coastline. Different from the other four weather variables, the maximum wind speed exhibits little spatial correlation, with the highest values reported for Gipuzkoa in the north. Over time (central column), both average temperature and sun hours show high values during the summer contrasted with low values in the winter. At the same time, humidity, precipitation and maximum wind speed reflect less clear temporal patterns. Although some variation and higher peaks are shown for humidity and wind speed in winter, spring and autumn compared to the summer period, the mean precipitation levels remain constantly at low values over time.

For the compositional covariates depicted in the right column of this plot we found a clear dominance of non-smokers over daily and ex-smokers in all 52 provinces, with occasional smokers (occ) constituting the smallest part of the regional populations. For the sex composition, a small dominance of females over males exists in all provinces. The normalised age curves computed from the age pyramids show a clear mode, with the largest population mass around 50 years. A second smaller mode and large variation can be seen for younger ages of around 10 years, while the densities decrease roughly monotonically and consistently across provinces for ages older than 55 years. Finally, the spatial patterns for the socio-demographic time-constant variables, shown in the bottom two right panels of Figure 2 reflect a clear spatial variation of the individual GDP (in 10,000 Euro), with lower values in the south contrasted with higher values in the northern provinces of Spain and also Madrid. The population density shows a strong heterogeneity with the highest values for the metropolitan provinces of Madrid and Barcelona, but also the provinces of Bizkaia and Gipuzkoa.

3.1. The generalised functional additive mixed model We extend the GFAMM specification of Scheipl et al. (2016) to include compositional and functional compositional covariates, such as for gender, smoking status and age composition in the present context. We adopt a general representation in the form of a structured additive regression model with Y i (t) ∼ F(µ i (t)), where Y i (t) in our setting is the number of Covid-19 cases in province i = 1, . . . , 52 at time t ∈ T . In general, we assume that Y i (t) pointwise follows a (here count) distribution F with conditional expectation E [Y i (t)| x it , t] = µ i (t) and is recorded over a domain T , here covering the 381 days of observations. To account for potential overdispersion of the response, we here assume a quasi-Poisson model for the Covid-19 incidences such that the variance is related to the mean through the overdispersion parameter ξ, i.e. Var [Y i (t)|x it , t] = µ i (t)ξ. To achieve high flexibility of the model, the mean µ i (t) is related to an additive predictor η i (t) through a known link function g,

Each additive term f r (x rit , t) is a smooth function of the argument t of the outcome -also implying smoothness of the response mean µ i (t) -and of a subset x rit of the complete covariate set x it .

The above formulation allows for linear, nonlinear and time-varying effects of grouping factors, functional and potentially time-varying scalar covariates, as well as functional Distributional characteristics of the climatological, compositional and socio-demographic covariates: temporally averaged spatial variation (left column), daily variation over time (central column, mean in red) of regional weather characteristics at province level, and regional variation of compositional and scalar covariates (right column). random effects, where the form of f r (x rit , t) is determined by the covariates in x rit . For example, for a functional intercept that varies over t, f r (x rit , t) simplifies to β 0 (t). For a smooth effect of a scalar x ir that is constant over t, f r (x rit , t) becomes f r (x ir ) (and f r (x ir , t) in the time-varying case), whereas linear effects of x ir that vary over t correspond to x ir β(t). Linear time-varying effects of a functional covariate x ir (s), s ∈ S, are included as S x ir (s)β(s, t)ds, while a concurrent effect for S = T can be included as f r (x ir (t), t) or f r (x ir (t)). For a grouping variable c with M levels, scalar and functional random effects γ c and γ c (t) are included as zero mean Gaussian variables with a potentially general specified correlation structure, and as Gaussian processes

This specification also allows to control for the spatial correlation of the different levels of c = i (formalised through the precision matrix of a Markov Random Field (MRF) with known correlation based on the planar neighbourhood structure) in the construction of (potentially spatially correlated) smooth residual curves for the individual locations. In the case of multiple random effects, a mutual independence assumption is placed between the individual random terms. See Scheipl et al. (2015 Scheipl et al. ( , 2016 for a full detailed list of potential covariate specifications.

However, compositional and functional compositional covariates have not yet been included into this framework, as occur in our context due to the available information on the age-, sex-and smoker-compositions of the provinces. To include both kinds of effects into the GFAMM framework, we extend the predictor η i (t) by η comp i (t) summarising the effects of the compositional and functional compositional covariates. We note that this approach is similar in spirit to Verbelen et al. (2018) , who introduced an extended additive predictor to incorporate the effect of a compositional covariate in the context of generalised additive models for scalar responses. We extend this approach to functional responses and functional compositional covariates and give details in Section 3.3 below.

Having n functional observations y i (t) on a grid of T i points t i = (t i1 , . . . , t iTi ) ⊤ available, the model can be fitted through a penalised (quasi-)likelihood approach based on the ( n i=1 T i )-vectors y = (y 1 ⊤ , . . . , y n ⊤ ) ⊤ and t = (t 1 ⊤ , . . . , t n ⊤ ) ⊤ of the concatenated response curves and their arguments, respectively, where y i = (y i1 , . . . , y iTi ) with y il = y i (t l ), l = 1, . . . , T i . Here, T i ≡ 381 =: T for all i. On the grid, we can write the predictor as η il = R r=1 f r (x ritl , t l ). Each of the R terms f r (x r , t), containing evaluations f r (x ritl , t l ), l = 1, . . . , T i , i = 1, . . . , n in a vector, can then be represented through a tensor product basis function expansion

denotes the row tensor product of the matrices A (h × a) and B (h × b), with 1 d the d-vector of ones and · the element-wise multiplication, and Φ xr and Φ tr contain the evaluations of the (K xr respectively K tr ) marginal basis functions for the covariate effects and over t, respectively. The effect shape is determined by ϑ r representing an unknown vector of coefficients. To provide sufficient flexibility of the model, the approximation uses a large set of basis functions, which is further regularised by a corresponding quadratic penalty term for the coefficients in the (quasi-) log-likelihood pen(ϑ r |λ tr , λ xr ) = ϑ r ⊤ (λ xr P xr ⊗ I Ktr + λ tr I Kxr ⊗ P tr )ϑ r .

Here, λ xr and λ tr are positive smoothing parameters, P xr and P tr are known and fixed positive (semi)definite marginal penality matrices corresponding to the basis matrices Φ xr and Φ tr , and I Kxr and I Ktr are identity matrices of dimensions K xr and K tr , respectively (see Scheipl et al., 2015 Scheipl et al., , 2016 . The unknown coefficients can then be estimated through a penalised (quasi-)maximum likelihood approach, maximising ν|y) is the (quasi-)log-likelihood implied by the respective response distribution (see Scheipl et al., 2016) optionally depending on additional nuisance (e.g. dispersion) parameters ν.

The marginal basis matrices and corresponding penalty matrices are suitably chosen depending on the specified covariate effects. A full description is given in Scheipl et al. (2015 Scheipl et al. ( , 2016 ; we here list some common choices also used in the model for the Covid-19 data for illustration and completeness, restricting to the case of equal grids for ease of presentation. For covariate effects that are constant over t, Φ tr = 1 nT is a vector of length nT containing ones and P tr = 0, while smooth time-varying effects are achieved when choosing Φ tr as a matrix of spline evaluations with P tr a corresponding penalty matrix (e.g. based on finite differences of B-spline coefficients). Functional intercepts β 0 (t) are obtained through Φ xr = 1 nT and P xr = 0. For effects xβ and xβ(t) that are linear in x, Φ xr changes to Φ xr = x ⊗ 1 T where x = (x 1 , . . . , x n ) ⊤ and P xr = 0. In case of a nonlinear effect specification for x, i.e. f (x) and f (x, t), Φ xr corresponds to a suitable marginal spline basis matrix over x and P xr is specified accordingly.

For linear effects of a functional covariate x(s), s ∈ S, a tensor product spline representation for β(s, t) is used with marginal spline basis functions Φ ks , k s = 1, . . . , K xr over S and Φ kt , k t = 1, . . . , K tr over T . This yields

marginal penalty matrices corresponding to the chosen marginal spline bases. In practice, the integral is approximated using numerical integration. A concurrent effect f (x(t)) or f (x(t), t) of a functional covariate x(s), s ∈ T , is constructed analogous to f (x, t) above.

Finally, functional random intercepts for groups c = 1, . . . , M , c(i) being the group level of observation i, are associated with a marginal basis Φ xr = δ c(i)m i=1,...,n m=1,...,M ⊗ 1 T , with δ cm the indicator for c = m . The matrix P xr then is a M × M precision matrix defining the dependence structure between levels of c.

We now introduce the new compositional predictor η comp i (t). We first discuss existing methods for the case of finite compositions as covariates and scalar responses in Section 3.3.1, before introducing the proposed extensions to functional compositional covariates and/or (generalised) functional responses in 3.3.2.

Making use of the core principles of compositional data analysis (Aitchison, 1986) , we formalise scalar-valued covariates which describe D parts of a whole summing to a constant -such as the regional smoking status proportions -as compositions of D parts living on the simplex

This space, equipped with the perturbation x⊕ y = cls(x 1 y 1 , . . . x D y D ) and the powering Pawlowsky-Glahn and Egozcue, 2001) . Noting this isometric correspondence, a central idea in compositional data analysis is to map the provided relative information isometrically to R D−1 , perform well-established statistical analysis methods there, and then potentially back-transform the result onto S D using inverse operations.

Common operations include first the centred log ratio transform

where m(x) is the geometric mean of x. The clr projects the composition onto the clrplane H D , a (D − 1)-dimensional subspace of R D whose components add to zero. By contrast, the second transform ilr (Egozcue et al., 2003) returns (D − 1) coordinates with respect to an orthonormal system on the clr-plane H D , which is equivalent to the logitfunction used in logistic regression for D = 2. For D > 2, infinitely many orthonormal basis systems exist. As we will use the ilr only internally for estimation, the choice does not affect the interpretation and we use pivot coordinates (Fišerová and Hron, 2011), for which the (D − 1)-dimensional vector ilr(x) has components

Making use of the isometric isomorphism between the Aitchison and the Euclidean geometry established through the clr and the ilr transformations, linear effects of compositional covariates can be modeled using x, b A = clr(x), clr(b) = ilr(x), ilr(b) (e.g. Verbelen et al., 2018) . In particular, with b = ilr −1 (β) the inverse of the regression coefficients in R D−1 , the compositional effect in the predictor is

Recalling Barceló-Vidal et al. (2011) , b represents the simplicial gradient of the predictor with respect to the composition x, and can be interpreted as the direction of perturbation of compositions on the S D which yields the largest effect on the outcome (cf. Verbelen et al., 2018) . That is, for

Further, to quantify the effect of a change in the composition on the predictor, Verbelen et al. (2018) suggested to perturb the composition into the direction of each part. For example, a change in the relative ratio of the first compositional component of x by some α = 1, while keeping the relative ratios for all other components constant, leads to a perturbation of x by cls ′ = cls(α, 1, . . . , 1) ⊤ and a resulting change on the predictor by b, cls ′ A = log(α) clr 1 (b), where clr j is the the j-th component of the clr transformation, j = 1, . . . , D. In particular, for a log-link relation of the expected outcome and the predictor, as under the present model, the effect of a relative ratio change in the first component on the response scale simplifies to a change by the factor α clr 1 (b) .

The above formulation of (2) allows for a direct extension of the GFAMM to include compositional covariates by including the (D − 1) ilr-transformed coordinates as scalar covariates with linear effects, such that Φ xr = ilr(X) ⊗ 1 T and P xr = 0 for

Treating density functions as infinite (functional) compositions, we extend the previous results to such covariates. The idea is to use an isometric isomorphism between the space of functional compositions and a subspace of the L 2 space of functions via a functional clr transform, and to then treat the transformed functional composition as a functional covariate within the GFAMM framework using a suitably adapted basis function specification.

A suitable space in this context is the Bayes Hilbert space of densities Boogaart et al., 2014) . It generalises the Aitchison geometry from compositional data and provides a suitable geometric framework for the analysis of density functions. We here focus on some basic properties that are relevant in our setting and refer to van den Boogaart et al. (2014) for a more formal definition and further mathematical details. Analogous to S D , B 2 (T ) has a vector space structure with perturbation and powering operations. For f, h ∈ B 2 (T ), t ∈ T and α ∈ R, the perturbation (⊕) and powering (⊙) operations are defined by

where f, h ∈ B 2 (T ) and |·| is the Lebesgue measure of the argument. In particular, noting

can be shown to be a separable Hilbert space and to be isometrically isomorph to the subspace L 2 0 (T ) of functions in L 2 (T ) integrating to zero with the usual L 2 metric. While this allows a transformation of densities to the L 2 (T ), the additional integrationto-zero constraint of L 2 0 (T ) needs to be accounted for and, in general, prohibits a direct application of standard functional data analysis techniques to the transformed densities.

For the GFAMM, functional compositions x(s), s ∈ S, are included into the regression with a linear effect in terms of the scalar product in B 2 , using the equivalence

fulfilling an integration-to-zero constraint for each t. Thus, we can estimate the effect similarly to a linear function-on-function regression term, with the modification of this additional constraint. We achieve this through the specification of a tensor product basis, which places an integration-to-zero constraint on the marginal basis for β over s (but not on the marginal basis over t) to get terms with integrationto-zero-for-each-t constraints (see Wood, 2017, Chapter 5.6) .

In a post estimation step, the functional composition surface b(s, t) with b(., t) ∈ B 2 (T ) for all t can be computed through the inverse clr transformation, b(., t) = clr −1 (β(., t)) for each t. Similar to finite compositions, b(., t) can then be interpreted as the preferential direction in which to perturb the functional composition to yield the largest increase in the outcome, from

The regression model for our application includes scalar covariates mostly with timevarying effects (i.e. as linear concurrent functional terms). Based on expert opinion and the inspection of the roughly constant effect patterns, we considered the three lockdown indicators x l,i , l = 1, 2, 3, the rain indicator x rain,i , the weekday indicators x d,i , d = 1, . . . , 6, the GPD x gdp,i and the transport system indicator x tra,i to have timeconstant effects. In constrast, recalling the hypothesised impact of overcrowded areas on the disease transmission, in particular during the initial stages of the pandemic, and the strong variation in the size of the population over the different seasons for the coastal regions, we modelled the effect of the population density x dens,i and the coastline indicator x sea,i through linear concurrent functional effects. To account for the observed time-varying patterns of the lagged temperature (x temp,i ), sun hours (x sun,i ), humidity (x hum,i ) and wind speed (x wind,i ) and the log transformed non-zero precipitation (x lprec,i ) (summarised into x w,i = (x temp,i , x sun,i , x hum,i , x wind,i , x lprec,i )) ⊤ , we considered a smooth, time-varying effect specification in an initial modelling strategy. However, as the estimated effect surfaces of all covariates were roughly constant over time and monotone in the specific covariates, all weather covariates were specified to have time-constant smooth effects to reduce the complexity of the model. In addition to these terms, we also considered a concurrent tensor product smoother for x temp,i and x hum,i to control for the interaction between these two terms reported in the literature. To account for the observed spatial correlation among the provinces, a spatially correlated functional random effect γ i (t) is included, using a MRF specification for the marginal basis Φ xr . The structure for this MRF was derived from a Gabriel graph (Matula and Sokal, 1980) to control for the strong economic and social interrelations of continental Spain and the Spanish islands and African enclaves. In addition, we included independent smooth functional random intercepts γ 0comi (t) for the 18 community spatial units to control for potential spatially nested effects and unobserved heterogeneity of the local Covid-19 measures on community level. For the compositional covariates, we included the effect of the smoker status composition x smoke,i = (x daily,i , x occ,i , x ex,i , x non,i ) ⊤ as a time-constant linear function-on-composition term (internally using the ilr transformation). The sex composition x sex,i = (x male,i , x f em,i ) ⊤ effect was modelled with a time-varying linear function-on-composition term to account for the strong heterogeneity in proportions of males and females within the public health and the nursing sectors -with a clear majority of female workers -which yielded high numbers of infected females already at the beginning of the pandemic. Finally, for the age densities x age,i we considered a linear function-on-functional composition term (internally specified through a tensor product interaction smooth of the clr transformed age curves).

Combining these terms and writing κ i = {x it , t, γ i (t), γ 0,comi (t); t ∈ T }, the expected number of Covid-19 cases E [y i (t)|κ i ] for province i is specified through the following regression equation

where log(N i ) is an offset for the population size N i in province i and (t − 5) denotes a 5-day lag. 

We now discuss the results of the proposed model for the Spanish Covid-19 data. All computations were performed in R version 4.1.1 (R Core Team, 2021), using in particular the compositions (van den Boogaart et al., 2021) and refund (Goldsmith et al., 2020) packages. The model was implemented using the pffr() function from the refund package. Basis sizes before application of possible constraints are as follows. The marginal basis functions of the smooth effects in the t direction were specified through penalised B-splines (Eilers and Marx, 1996) using K tr = 30 knots (yielding around 1 knot per 13 days), and K tr = 28 knots for the global functional intercept. Covariate effect marginal bases were chosen smaller due to the smoothness of effects: K xr = 10 for the smooth effects of temperature and humidity, K xr = 9 for sun hours and K xr = 7 for wind speed. Tensor product interactions were specified with 5 × 5 knots, which applied to the smooth interaction of temperature and humidity as well as to the function-on-function effect for clr(age). Under the above specification, the model explains 96% of the deviance. The estimated dispersion parameter under the quasi-Poisson specification is 15.1. Table 1 shows the estimated covariate effects which are treated as time-constant, where the reported pvalues are based on a Wald test for the null hypothesis that each parametric term is zero (Wood, 2017) . All effects except for the indicators for the 2nd and 3rd global lockdown periods, and the 2nd ilr component are significant at a significance level of α = 0.05. We found a negative effect of the first global lockdown period (Lockdown 1), indicating a reduction by around 12% of the daily numbers of Covid-19 cases by the imposed measures (compared to the trend under no lockdown). The weekday effects show a clear positive impact on the numbers of Covid-19 notifications, which is similar for Monday through Friday and smaller for Saturday, compared to Sunday. This heterogeneity over the weekdays may be due to daily variation in the availability at local authority levels and of tests. In line with the findings of Paez et al. (2021) , the coefficient for transport suggests an increase in Covid-19 cases by around 56% if higher-order transit systems are available. The expected incidence decreases with increasing GDP by around 31% per 10.000 EUR, ceteris paribus. Lastly, the rain indicator (representing days with non-zero levels of precipitation using a 5 day lag) shows a positive effect, leading to around 5% more Covid-19 cases after rainy days.

The first row of Figure 3 shows the functional intercept and the linear functional (time-varying) effects of the scalar covariates. The functional intercept (left) has its highest peak during the second Covid-19 wave, with maximum numbers of Covid-19 infections in mid-September. The effect for the population density at province level (central panel) reflects a clear positive impact on the expected number of infection notifications, with the strongest impact during the early stages of the first wave up to mid-March 2020. This finding is consistent with the association of densely crowded areas with the spread of the disease stated in the literature. The effect of coast (right panel) suggests smaller incidences for coastal compared to non-coastal provinces, in particular starting from mid-April 2020 onwards and reaching a minimum at around mid-July. This negative effect might be explained by an increased public risk awareness and protective travelling behaviour caused by the aftermaths of the recent Covid-19 and lockdown experiences. Indeed, facing the massive impact of the disease on the Spanish population and health system during the first wave, overcrowded regions including the coast and metropolitan conurbations suffered a larger exodus of the population, and rural areas and the countryside became a favourite travelling destination. In addition, imposed national travelling restrictions and strict quarantine regulations for incoming and/or homecoming travellers yielded a strong reduction in numbers of international tourists and travellers. In a recent paper, Sun et al. (2021) reported a reduction of global scheduled flights for Spain by over 90% for April to June 2020 compared to those month in 2019, which decreased to 65.7% for July 2020. The observed negative effect of coast on the number of Covid-19 cases slowly vanishes towards the fall and winter of 2020, which could potentially be due to less protective individual travelling behaviour.

The estimated nonlinear time-constant concurrent effects of the lagged weather covariates and the interaction surface for the lagged mean temperature and lagged humidity are depicted in the two central rows of . Effects are given on the predictor level (on the log-mean) and pointwise 95% confidence bands are shaded in gray. Central rows: nonlinear time-constant concurrent effects for different weather characteristics on the log-mean number of daily Covid-19 cases considering a 5 day lag. Upper panels: effects for lagged mean temperature, average sun hours, and maximum wind speed on the log-mean number of daily cases. Lower panels: effects for the average lagged humidity, log transformed non-zero levels of the precipitation variable, and interaction effect surface for lagged mean temperature and lagged humidity (including main effect functions of temperature and humidity). Bottom row: effects of the compositional covariates smoking status (left, clr-transformed), sex (middle, ilr-transformed) and age (right, clr-transformed) on the log-mean number of Covid-19 cases.

creasing effect on the expected incidence. Finally, the interaction surface of the lagged average temperature and the lagged humidity (lower right panel) suggests a positive effect for low temperatures and low levels of humidity on the spread of the disease dynamics, contrasted with a negative impact of high temperatures and low humidity levels. The interaction indicates that smooth main effects of temperature and humidity should be interpreted with care, as they average over parts of the interaction surface that depend on the data range. This might potentially explain the mixed results on the effects of climate variables on the spread of the disease observed in different climatological regions, and is also seen in our sensitivity analysis below.

The effects of the compositional covariates on the expected number of daily notifications are shown in the final row of Figure 3 . To interpret the time constant effect of the individual smoking habits, we obtained the simplicial gradient via inverse ilr transformation of the corresponding coefficients on ilr-level, see Table 1 , indicating that the largest increase in the expected incidence is obtained by increasing the proportion of smokers (see Supplement, simplicial gradient roughly (1, 0, 0, 0)). Applying the clr transformation to the simplicial gradient (see Section 3.3.1) allows to evaluate the effect of a multiplicative change in the relative ratio of one component while holding all other ratios constant. Depicted as sum-to-zero constrained effect estimates for the clr transformed composition with corresponding 95% confidence intervals, we found a clear positive effect of a larger fraction of daily (and less so ex-smokers) on the disease incidence, contrasted with a negative effect of a larger fraction of occasional and in particular non-smokers (see bottom left panel) which is in line with the results of Hopkinson et al. (2021) and Gülsen et al. (2020) . A relative ratio increase by 10% for daily smokers yields a multiplicative increase in the expected daily Covid-19 incidence by the factor 1.1 5.747 = 1.729, i.e. by 73%. An analogous 10% relative ratio increase for non-smokers (1.1 −5.009 ) yields a 38% decrease in the expected incidence. The estimated effect of the sex log-ratio (central panel) suggests a negative effect of an increase in the male-to-female ratio on the mean Covid-19 counts, particularly early in the pandemic. A possible explanation could be the described heterogeneity among the sexes in terms of employment in high-contact jobs such as in the retail and medical fields.

The right panel of the final row shows the effects of the clr-transformed age compositions on the disease dynamics, with estimated effect surface constrained to fulfill an integration-to-zero constraint, β(., t) ∈ L 2 0 (S), for each time point t. The effect surface shows clear variation over time and over the different ages. The strongest positive effects on the number of Covid-19 cases appear for the younger and also for the very old parts of the population, with a clear mode for around 25 year olds. To interpret the effect of the age distribution on the incidence, we applied the inverse clr transformation to the estimated surface β(., t) ∈ L 2 0 (S) for each t, to obtain for each time point the direction b(., t) ∈ B 2 (S) of change in the age composition leading to the largest increase in the mean incidence analogously to (3). Inspecting the time trend of b(., t) depicted in Figure 5 of the Supplement, all age curves show a clear mode for the younger ages and a second, but smaller, mode for around 80 year olds, with small variations in the exact density shape over time. This suggests that provinces with high proportions of young people (and to a lesser extend old people) are more strongly affected by Covid-19 cases (see Supplement for a discussion of the results).

A discussion of the results for both the spatially correlated functional random intercepts per province and the spatially uncorrelated community-specific functional random intercepts is given in the Supplement. Both effects exhibit some variation in sign and effect size over the 52 provinces and 19 communities, respectively (see Figure 4 in the Supplement).

Comparing the fitted and the observed incidence curves ( Figure 6 in the Supplement) shows only small deviations of the fitted from the observed daily values over the study period. The scaled Pearson residuals and the autocorrelation (Figure 7 in the Supplement) suggest overall good model fit, with some amount of heterogeneity in variation and autocorrelation of the residuals, in particular some structure corresponding to the three Covid-19 waves, remaining.

We also performed sensitivity analyses to assess the effects of considering different lags for the weather covariates as well as a different spatial neighbourhood specification. Overall, results were largely similar and general conclusions did not change (see the Supplement for a detailed discussion). Note that while the interaction surface for temperature and humidity was stable under different model specifications and data sets, the main effects did change when considering continental Spain only, due to differences in the variable distribution over which the main effects average the interaction surface.

This paper has extended the generalised functional additive mixed model to the case when some of the covariates in the predictor are finite or infinite (functional) compositions summing or integrating to a whole. We use an equivalence between the scalar product in the Bayes Hilbert space and a constrained linear (functional) term for the clr transformed compositional covariate to embed the new model terms into the existing model framework. For the transformed functional composition, the linear effect was modelled with a tensor product basis with a bivariate spline for the effect function, placing centering constraints on the marginal basis for the covariate effect to account for the integration-to-zero constraint. Although not considered here due to the increased model complexity given the sample size, the proposed model in principle also allows to include nonlinear effects of the transformed (functional) covariates.

The proposed model was applied to spatially correlated daily Covid-19 count data for Spain to quantify potential impacts of population compositional, socio-economic, weather and regional covariates on the disease dynamics. The sampled data was collected from 5th January 2020 up to 19th January 2021, just before a large scale nationwide immunisation program was imposed in February 2021, which minimises unknown effects on our results of the regionally varying and heterogeneous vaccination regimes. The available data has some limitations. First, the reported numbers on a given day likely represent a mixture of counts over neighbouring (unobserved) true dates of symptom onset, given that the reconstruction of symptom onset dates by a three day lag from the positive test results is only an approximation. There is even more uncertainty regarding the true infection date, even if the average incubation period is 5 days. We may thus underestimate the weather effects, if the lagged weather variables only approximately measure weather on the date of infection. However, we did not detect large variations of the estimated results in our sensitivity analysis considering alternative lag specifications. Second, the data does not provide separate infection counts for subgroups of the population according to sex, age and smoking habits. While we incorporate the effects of these variables on overall infections via compositional covariates measuring the composition of the population, we have to acknowledge the typical risks of ecological inference here. For instance, for the increasing effect of a larger share of smokers in the population on the Covid-19 incidence, we cannot distinguish whether this is due to a higher infection risk for smokers or due to a higher risk of Covid-19 positive smokers to infect others. Third, while the available data for 52 provinces and 381 days has better temporal and spatial resolution than other publicly available data sets, the data size still limits the complexity of the model in terms of the number of non-linear and/or time-varying effects. Taking these limitations into account, our model highlights a clear effect of the population composition according to sex, age and smoking habits, of weather variables (rain, temperature, wind speed and humidity), of GDP, population density, coast and public transit on the number of Covid-19 notifications, as well as spatial and temporal heterogeneity.

Supplemental Materials to Eckardt, Mateu and Greven: Generalised functional additive mixed models with compositional covariates for areal Covid-19 incidence curves Table 1 gives a description of the covariates used in the model for the Spanish Covid-19 incidence curves.

The spatially correlated functional random intercepts per province and the spatially uncorrelated community-specific functional random intercepts are shown in itive deviation of the province of Lleida (red) from the overall intercept for the second wave in July 2020 contrasted with a negative deviation of Madrid (green). The Markov Random Field basis used to specify this effect was derived from a Gabriel graph (Matula and Sokal, 1980 ) specification using the gabrielneigh function of the spdep (Bivand et al., 2013; Bivand and Wong, 2018) package in R (R Core Team, 2021). This function generates a subgraph of a Delaunay triangulation constructed from the centroids of the individual areal units collected in a set of point locations D. Within D two distinct locations s and s ′ are said to be Gabriel neighbours if the closed disc with diameter equal to the distance between s and s ′ does not contain any alternative point z in D. That is, s and s ′ are connected whenever d(s, s ′ ) ≤ min z∈D ((d(s, z) 2 + d(s ′ , z) 2 ) 1/2 ). Different from the commonly shared border specification, which does not allow isolated regions, i.e. islands, and potentially yields a higher number of neighbours for the individual areal units, graph based approaches such as the Gabriel graph specification also work well for complex spatial structures including disconnected subregions or islands. Taking into account the strong economic and social interrelations of the continental provinces and the Spanish islands and African enclaves and the inter-regional interplay in the Spanish Covid-19 disease history, the Gabriel graph specification seems to preferable.

The constructed neighbourhood structures under the Gabriel and also the commonly shared border based definitions are visualised in Figure 3 . Under the Gabriel graph speci- fication, two provinces (Girona, Tenerife) are connected to one alternative province while, at maximum, three provinces (Avila, Burgos, Malaga) are connected to six alternatives provinces.

The estimated spatially uncorrelated functional random intercepts for the larger areal units communities are depicted in the right panel of this plot. This effect controls for unobserved spatial heterogeneity at the community level, where different local measures were imposed after the global lockdown (captured through the three global lockdown indicators in our model) ended. For example, both communities Catalonia (red) and Madrid (green), containing the provinces of Lleida and Madrid, respectively, show clear positive deviations from zero for the second and third Covid-19 waves. The largest variation among the community-specific functional intercepts is shown during the end of the global lockdown period and early stages of the second Covid-19 wave from June and July 2020 onwards. These differences may be due to differences in the regional measures which were imposed after the global lockdown at only the local level. 

The simplicial gradient b derived from the inverse ilr operation of the corresponding coefficients and its clr transformation are reported in Table 2 .

We estimated an increase of the relative ratio for daily smokers by 10% to increase the expected daily incidence by the factor 1.1 5.747 = 1.729 or about 73%. To see what an increase in the relative ratio for daily smokers by 10% translates to, we can look at regional smoking compositions and at a perturbation of the original compositional vectors. For example, applied to the regional smoking composition of Castellón (a province located at the Mediterranean coast in the eastern part of Spain), an increase in the relative ratio for daily smokers by 10% would change the compositional vector x Cast = (0.248, 0.021, 0.205, 0.527) ⊤ (daily, occasional, ex-, non-smokers) tox Cast = x Cast ⊕ C(1.1, 1, 1, 1) ⊤ = C(1.1 · x Cast,1 , x Cast,2 , x Cast,3 , x Cast,4 ) ⊤ = (0.266, 0.020, 0.200, 0.514) ⊤ .

To interpret the estimated coefficient surface, the clr transformations of the simplical gradient b(., t) of the age compositions for selected ages and dates are reported in Table 3 . Investigating the effect of a multiplicative change in the relative ratio of one functional component (age) at a particular date t, suggests that a 10% relative ratio increase of 5 year olds while holding all other ratios constant yields a decrease in Covid-19 cases on March, 1st 2020 by the factor 1.1 −0.13 = 0.988, i.e. by 1.2 %. On the same date, we found an increase of 3% in case of a corresponding relative ratio increase of 25 year olds (1.1 0.34 = 1.03).

In addition, we computed the simplicial gradient b(., t) through the inverse clr operation of the effect surface. The inverse clr transformed age effects for selected dates are depicted in Figure 5 . All panels support the above findings and highlight the relevance of the younger (and to a smaller extend also of the older) age groups on the spread of the disease. In particular, we found a clear increase of expected Covid-19 cases for regions with high proportions of younger ages (mode around 25) compared to only smaller increases in disease notifications for regions with larger proportions of above fifty year olds (mode around 75). The shape of the effect varies slightly over the different waves. 

The fitted against the observed response curves for all 52 Spanish provinces are plotted in Figure 6 . While most plots plots suggest small deviations of the fitted curves (red) from the reported values (black), some regions show larger deviations between both curves including e.g. the provinces of Ávila, Huesca and Lugo. For all three provinces, the model underestimates the reported numbers which might, in part, be explained by the neighbourhood structure. For example, located in northeastern Spain, the province of Huesca has a higher incidence rate in October compared to its neighbouring provinces of Lleida, Navarra, Tarragona and Zaragoza. Likewise, Ávila located in central-western Spain has a higher incidence rate in March and August to November 2020 compared to most of its neighbouring provinces of Caceres, Madrid, Salamanca, Segovia, Toledo and Villadolid. The fit to the data could potentially be improved further by additionally including a province-specific spatially uncorrelated functional random intercept; however given the model complexity relative to the data size we refrain from this additional model extension. Figure 7 shows different residual plots computed for the extended generalised functional additive mixed model. The computed scaled Pearson residuals of the upper panels and the corresponding histogram shown in the lower right panel of this plot suggest that some amount of heterogeneity in variation of the residuals remains. While a large number of the residuals takes values close to zero over the observed period (upper left panel), the pattern of the scaled Pearson residuals against the fitted values shows some remaining heterogeneity over the functional domain with peaks corresponding to the three Covid-19 waves. Investigating the variance of the residuals, which should be around 1, we found that 20.99% and 5% of the scaled Pearson residuals have absolute value greater than 1 and 2, respectively, indicating no unusually high number of large values.

The autocorrelation of the scaled Pearson residuals (lower left panel) also reflects some structure, again corresponding to the three Covid-19 waves. While the remaining structure could likely be reduced by additionally incorporating a spatially uncorrelated functional random intercept per province, by increasing the number of basis functions over time and/or including province-specific weekday effects, we refrained from these model extensions due to the increased model complexity relative to the sample size. 

To investigate the effect of the MRF specification in the extended GFAMM, we replaced the Gabriel graph neighbourhood specification of the spatially correlated functional random intercepts at province level through a commonly shared border neighbourhood approach. Different from the Gabriel graph approach used in the model discussed in the main document, this neighbourhood specification does not allow for isolated regions such as the African enclaves or the Spanish islands and we thus had to decrease the data set accordingly, leaving 47 provinces and 15 communities. The resulting model yielded an explained deviance of 96.2% and an estimated scale parameter of 16.025. In general, most effects of Table 4 and Figure 8 have the same sign and are similar in size and shape. However, the effect of lagged humidity, lagged average temperature and the smoking composition is somewhat different under the common shared borders approach. At the same time, the interaction surface for the lagged average temperature and lagged humidity shows a similar effect pattern compared to the Gabriel graph based model. The observed difference might be explained through the differences in the variable distributions of the distinct populations under study. Comparing the global averages computed over the complete study period and all 52 provinces of Table 5 with the corresponding global mean values for the continental and the non-continental provinces, we found that the Spanish islands and African enclaves show higher values for temperature, wind speed, sun hours, humidity and also males, contrasted with lower average numbers in rain, GDP and the population density and, in particular, the average number of daily Covid-19 cases. Investigating the smoking proportions of the complete data and the subset of continental and non-continental provinces, we found slight differences between the Spanish islands and African enclaves provinces compared to continental Spain.

To investigate potential variations of the lagged average temperature and lagged humidity pattern between the continental and non-continental Spanish province, we plotted the joint distribution of both variables in Figure 9 . As suggested by Table 5 , both the recorded temperature and humidity values are lower for the continental (left panel) Spanish provinces, ranging from 18% to 95% (humidity) and −13.3 • to 34.6 • centigrade, compared to the non-continental provinces, ranging from 38% to 95% (humidity) and 6.4 • to 35.4 • centigrade, and the non-continental provinces in particular have high probability mass for both high temperature and humidity. This difference in the bivariate distribution means that the main effects average over different parts of the interaction surface for humidity and temperature, leading to no main effect (continental provinces) vs. a negative effect (all provinces) for temperature and a negative main effect (continental provinces) vs. no main effect (all provinces) for humidity. This illustrates that smooth main effects are difficult to interpret in the presence of a smooth interaction.

Note that the differences in results are mainly due to the reduced data set and not to the neighbourhood specificiation, as we will see below when discussing results from a Gabriel graph based model for the reduced data set.

Comparing the estimated spatially correlated and uncorrelated functional random intercepts of the commonly shared border based model (Figure 10 ) with the random effect terms under the Gabriel graph specification shows some differences, also in sign. We note that the Gabriel and the commonly shared border approaches yield different neighbourhood specifications and different numbers of regional neighbours, e.g. Lleida has 5 neighbours under the commonly shared border approaches compared to 3 under the Gabriel graph specification, such that some differences in spatial smoothing are not unexpected. To better understand the variation in the effect between the Gabriel graph and the commonly shared boarder based models, we additionally fitted a Gabriel graph model restricted to continental Spain only. This model explains 96.4% of the deviance (scale = 14.288). Table 6 and Figure 11 give the corresponding results. Results are similar to those for the commonly shared border neighbourhood specification, reinforcing the conclusion that differences seen there to results in the main paper are due to the reduced data set rather than the specified neighbourhood structure. Besides the same differences in the main effects for temperature and humidity compared to the main results (while the interaction stays similar), we mainly observe some differences in the size while not general trend of the estimated smoking effect.

To account for the uncertainty in the lag-specification for the weather covariates (given the imperfect reconstruction of symptom onset and variation in the incubation period), we constructed identical models to the one in the main paper using a 4-day, 6-day, 7day, and 8-day lag for the concurrent weather effects and interaction surface. In all models, we specified the spatial neighbourhood structure of the MRF through a Gabriel graph approach. For all models, we found only small differences of the estimated scale parameters (with the smallest scale parameter under the 5-day lag specification) while all models (slightly lower for the 4-day lag model) explained 96% of the deviance (see Table  7 ). The constant coefficients for all alternative lag specifications are reported in Tables 8  to 11 . Results are overall similar to the main model in effect size and significance, however with the effect of the first lockdown indicator becoming smaller and non-significant for 4-day, 6-day and 7-day lag models. The largest difference in effects is seen in the weekday effects of the 4-day lag model, where positive effects of weekdays compared to Sunday become smaller and even slightly negative for Wednesday and Sunday. The other effects either stay roughly constant (e.g. transport or GDP) or change somewhat in size (e.g. rain, between 0.02 and 0.07), but not in sign or significance. Overall, effects are relatively stable across lag specifications. All four lag models support the reported structure of the functional intercept and smoothly time-varying effects of the population density and coast and the effects of the compositional covariates (i.e. all non-weather variables) (see Figures 13 to 16) . For the non-linear effects and interaction effect of the lagged weather variables themselves, general trends are mostly similar to the 5-day lag model, although effect function shapes are not always identical and effects tend on average to be somewhat smaller than for the 5-day model. However, for the non-linear effects of the lagged average temperature and lagged humidity we observe some variation under the 6-day lag specification, with humidity having no effect and the trend for average temperature reverting, while the corresponding confidence band completely contains the zero line. The main effects plus interaction surface, however, shows similar overall trends with smaller values for both low temperature and low humidity or both high temperature and high humidity. Overall, the observed patterns are consistent with our hypothesized 5-day lag best capturing the time lag between observed incidence and infection, with results mostly not being sensitive to that choice. Finally, all four different lag-specifications barely affect the spatially correlated or the spatially uncorrelated functional random intercepts depicted in Figures 17 to 20. 

The Statistical Analysis of Compositional Data

A Functional Linear Regression Model in the Space of Probability Density Functions. Discussion papers 17015

on-scalar regression and principal component analysis

A general framework for functional regression modelling

Rejoinder (for a general framework for functional regression modelling)

The effect of smoking on covid-19 symptom severity: Systematic review and meta-analysis. Pulmonary Med

Additive functional regression for densities as responses

A general framework for multivariate functional principal component analysis of amplitude and phase variation

The first months of the covid-19 pandemic in spain

Current smoking and covid-19 risk: results from a population

Impact of weather on covid-19 transmission in south asian countries: An application of the arimax model

Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: A statistical analysis of publicly available case data

Compositional splines for representation of density functions

Additive density-on-scalar regression in Bayes Hilbert spaces with an application to gender economics

Properties of gabriel graphs relevant to geographic variation research and the clustering of points in the plane

Effects of mobility and multi-seeding on the propagation of the covid-19 in spain

Weather variability and covid-19 transmission: A review of recent research

Effects of temperature and humidity on the spread of covid-19: A systematic review

The density paradox: Are densely-populated regions more vulnerable to covid-19?

Notes on continuous stochastic phenomena

imputeTS: Time Series Missing Value Imputation in R

Comparison and contrast of two general functional regression modelling frameworks

Wavelet-based functional mixed models

A spatio-temporal analysis of the environmental correlates of covid-19 incidence in spain

Functional regression of continuous state distributions

Geometric approach to statistical analysis on the simplex

Fréchet regression for random objects with Euclidean predictors

Modeling probability density functions as data objects

R: A Language and Environment for Statistical Computing

R Foundation for Statistical Computing

Functional Data Analysis

Generalized functional additive mixed models

Functional additive mixed models

A look at the spatiotemporal mortality patterns in italy during the covid-19 pandemic through the lens of mortality densities

Effects of climatological parameters on the outbreak spread of covid-19 in highly affected regions of spain

Element enrichment factor calculation using grain-size distribution and functional data regression

Why such excess of mortality for covid-19 in spain?

Covid-19 pandemic and air transportation: Successfully navigating the paper hurricane

Higher temperature, pressure, and ultraviolet are associated with less covid-19 prevalence: Meta-regression of japanese prefectural data

Compositional scalar-on-function regression with application to sediment particle size distributions

Compositional regression with functional response

The effect of age on the incidence of covid-19 complications: a systematic review and meta-analysis

2021) compositions: Compositional Data Analysis

Unravelling the predictive power of telematics data in car insurance pricing

Risk factors for covid-19 severity and fatality: a structured literature review

Generalized Additive Models: An Introduction with R