key: cord-0176162-ucc08ft6
authors: Ansell, Lauren; Valle, Luciana Dalla
title: A New Data Integration Framework for Covid-19 Social Media Information
date: 2021-10-08
journal: nan
DOI: nan
sha: 97692ab52c973e6f19a65e991aaebe9d4b49066d
doc_id: 176162
cord_uid: ucc08ft6

The Covid-19 pandemic presents a serious threat to people's health, resulting in over 250 million confirmed cases and over 5 million deaths globally. In order to reduce the burden on national health care systems and to mitigate the effects of the outbreak, accurate modelling and forecasting methods for short- and long-term health demand are needed to inform government interventions aiming at curbing the pandemic. Current research on Covid-19 is typically based on a single source of information, specifically on structured historical pandemic data. Other studies are exclusively focused on unstructured online retrieved insights, such as data available from social media. However, the combined use of structured and unstructured information is still uncharted. This paper aims at filling this gap, by leveraging historical as well as social media information with a novel data integration methodology. The proposed approach is based on vine copulas, which allow us to improve predictions by exploiting the dependencies between different sources of information. We apply the methodology to combine structured datasets retrieved from official sources and to a big unstructured dataset of information collected from social media. The results show that the proposed approach, compared to traditional approaches, yields more accurate estimations and predictions of the evolution of the Covid-19 pandemic.

The outbreak of the Covid-19 disease has infected and killed millions of people globally, resulting in a pandemic with enormous global impact. This disease affects the respiratory system, and the viral agent that causes it spreads through droplets of saliva, as well as through coughing and sneezing. As an extremely transmissible viral infection, Covid-19 is causing significant damage to countries' economies because of its direct impact on the health of citizens and the containment measures taken to curtail the virus (Pinheiro et al., 2021) . In the UK, Covid-19 had serious implications for people's health and the healthcare services, with more than 8 million confirmed cases and 150,000 deaths, as government figures show. There is thus a widespread interest in accurately estimating and assessing the evolution of the pandemic over time.

Current studies on Covid-19 are typically based on a single source of information. Most of them implement quantitative analyses focusing on historical data to produce forecasts of the pandemic. For example, used clinical data of Covid-19 patients to statistically analyse with meta-analysis the clinical symptoms and laboratory results aiming at explaining the discharge and fatality rate. Rahimi et al. (2021) considered data on confirmed and recovered cases and deaths, the growth rate and the trend of the disease in Australia, Italy and the UK. The authors predicted epidemiology in the three countries with mathematical approaches based on susceptible, infected, and recovered (SIR) cases and susceptible, exposed, infected, quarantined, and recovered (SEIQR) cases, comparing them with the Prophet machine learning algorithm and the logistic regression. Machine learning methods were also adopted, for example, by DeCaprio et al. (2020) , who implemented logistic regression and gradient boosted trees on health risk assessment questionnaires and medical claims diagnosis data to predict complications due to Covid-19. Ahmed et al. (2021) performed descriptive, diagnostic, predictive, and prescriptive analysis of the pandemic applying neural networks and other machine learning algorithms, focusing on different pandemic symptoms. Pinheiro et al. (2021) employed network analytics and machine learning models by using a combination of anonymized health and telecommunications data to understand the correlation between population movements and virus spread and to predict possible new outbreaks.

However, some authors in the literature criticize approaches which are exclusively based on official quantitative information. Wynants et al. (2020) and Jewell et al. (2020) noted that existing Covid-19 studies based on historical data and prediction models for the pandemic are "poorly reported, at high risk of bias and underperforming".

Another strand of research focuses on different types of data, analysing textual information with Natural Language Processing (NLP) methods. For example, Liu et al. (2021) adopted the Latent Dirichlet Allocation (LDA) model, to allocate research articles into different research topics pertinent to Covid-19 according to their abstracts. A LDA model was also employed by Wang et al. (2021) , who applied it to the question and answer data about Covid-19 in Chinese online health communities. Text data in the form of social media insights are also used by other contributors in the literature to evaluate and predict the progression of the Covid-19 pandemic. For example, employed a lag correlation analysis with data collected from Google Trends, Baidu Search Index and Sina Weibo Index. Sina Weibo messages were also analysed by Liu et al. (2020) , Peng et al. (2020) and Zhu et al. (2020) . The former authors carried out a statistical analysis based on the Fisher exact test, calculated the rates of death with the Kaplan-Meier method and established risk factors for mortality using the multivariate Cox regression. Peng et al. (2020) implemented a kernel density analysis and an ordinary least square regression to identify the spatiotemporal distribution of Covid-19 cases in the main urban area of Wuhan, China. Zhu et al. (2020) calculated descriptive statistics and applied a time series analysis to the data. Baidu Search Index information was analysed by Qin et al. (2020) using different statistical methods, including the subset selection, forward selection, lasso regression, ridge regression and elastic net. Google trends searches, Wikipedia page views and Twitter messages were gathered by O'Leary and Storey (2020), who implemented a regression analysis to show that online-retrieved information provided a leading indication of the number of people in the USA who became infected and die from the coronavirus.

However, contributions in the literature which are exclusively based on either official or online information as stand-alone sources are not taking into account the drawbacks affecting the data and the potential synergies between different data sources. On the one hand, due to limited capacity of testing, official data on confirmed cases are unlikely to reflect the true Covid-19 numbers. On the other hand, social media data are generated by users on a voluntary basis and may not capture information about the entire population. Therefore, predictive models built on a single source of information might generate biased results.

The goal of this paper is to develop a state-of-the-art data integration framework, leveraging the dependencies between historical and online data to provide more accurate evaluations and predictions of the Covid-19 dynamics. Our approach is based on vine copulas, which are very flexible mathematical tools, able to correctly capture the dependence structures between different variables (Czado, 2019) . Integration of different data sources using copulas and Bayesian networks was first proposed by Dalla Valle (2014 ) and, later, by Dalla Valle and Kenett (2015 . However, the approach adopted by the authors was based on data calibration (Dalla Valle, 2017c) . In this paper, our aim is to propose a comprehensive novel data integration framework, able to improve data modelling and forecasting, along the lines of the paper by Ansell and Dalla Valle (2021) , who successfully applied a similar approach to small size natural hazard datasets.

So far, the application of copulas and vines to pandemic data has been limited to study the implications of Covid-19, especially in the financial field, rather than to directly calculate forecasts of pandemic trends. For example, Maneejuk et al. (2021) implemented a Markov-switching dynamic copula with Student-t distribution to explore Covid-19 shock effects on energy markets. Sifat et al. (2021) used a quantile regression model estimated via vine copula to show that speculation in energy and precious metal futures are more prevalent in crisis periods such as the Covid-19 pandemic. This paper will be the first to propose statistical methodology based on vine copulas, able to exploit and integrate official and social media data, to accurately model the spread of Covid-19. The methodology will be applied to structured historical pandemic data and to a large unstructured online-retrieved dataset, relevant to the UK geographical area. The results show that our approach performs better than traditional approaches, which do not take into account associations between official and on-line information, to estimate and predict the Covid-19 dynamics.

The remainder of the paper is organised as follows. Section 2 describes the different data sources used in the analysis; Section 3 illustrates the vine copula methodology; Section 4 reports the results of the analysis; finally, concluding remarks are presented in Section 5.

The structured and unstructured data used in this paper were collected daily between the 21st April 2020 and the 9th May 2021. As structured data, we considered the number of new admitted patients (Admissions), the number of hospital cases (Hospital), the number of patients on ventilation (ICU Beds), the number of tests (VirusTests), the number of positive cases (Cases) and the number of deaths (Deaths). The first four data variables were gathered from the UK Government dashboard 1 , while the last two variables were downloaded from the Johns Hopkins University database 2 . This information was available in cumulative form, therefore to obtain the daily time series the previous days total was subtracted from the current days total. As unstructured data, we collected Google Trends information on the number of searches for the keywords Covid-19, coronavirus, first wave, second wave and variant, using the gtrendsR package from the R software (Massicotte and Eddelbuettel, 2021; R Core Team, 2020) . In addition, we retrieved Twitter messages containing the same keywords used to perform Google Trends searches, using the rtweet R package (Kearney, 2019) . Three batches of 18,000 tweets were collected 3 times a day, everyday. We proceeded with data cleansing by removing tweets written from outside the UK and those produced from locations with less than 10 tweets. We also removed any duplicate tweets and those sent by automated accounts which contained factual information about daily case numbers or retweets of news stories. Finally, we removed tweets directly addressed to foreign political leaders or politicians, obtaining a final large Twitter dataset of 577,231 tweets. From the Twitter data, we considered the total number of tweets as well as the sentiment scores calculated using two different lexicons: Bing and Afinn (Hu and Liu, 2004) , which are available in the R tidytext package (Silge and Robinson, 2016) . The Bing lexicon splits words into positive or negative. The Bing sentiment score for each tweet is calculated by counting the number of positive words used in each tweet and subtracting from this the number of negative words. The Afinn lexicon scores words between ±5. The Afinn sentiment score is calculated by multiplying the score of each word by the number of times it appears in the tweet; these scores are then summed to derive the overall sentiment score. Figure 1 shows the trace plots of the Covid-19 official and social media times series. The panels depict the variables (from top to bottom) Admissions, the Afinn sentiment scores (Afinn), the Bing sentiment scores (Bing), Cases, Deaths, the Google Trends searches (Google), Hospital, ICU Beds, the total number of Tweets (Tweets) and VirusTests. We notice a higher volume of online messages produced at the beginning of the collection period and spikes corresponding to periods of more heated online discussions. In addition, the daily reporting of official structured data shows high variability and there is often a lag in reporting in the UK government figures due to figures being under reported at the weekend. The highlighted drawbacks of the official data could be overcome by data integration with social media information, which do not suffer from reporting lags. 

The copula is a function that allows us to bind together a set of marginals, to model their dependence structure and to obtain the joint multivariate distribution (Joe, 1997; Nelsen, 2007; Dalla Valle, 2017b,a) . Sklar's theorem (Sklar, 1959) is the most important result in copula theory. It states that, given a vector of

where F j (x j ) = u j , with u j ∈ [0, 1] are called u-data, and θ denotes the set of parameters of the copula. The joint density function can be derived as

where c denotes the d-variate copula density. The copula allows us to determine the joint multivariate distribution and to describe the dependencies among the marginals, that can potentially be all different and can be modelled using distinct distributions.

In this paper, we adopt the 2-steps inference function for margins (IFM) approach (Joe and Xu, 1996) , estimating the marginals in the first step, and then the copula, given the marginals, in the second step.

Given the different characteristics of the ten marginals, we fitted different models for each of the ten time series. Further, we extracted the residuals ε j , with j = 1, . . . , d, from each marginal model and we applied the relevant distribution functions to get the u-data F j (ε j ) = u j to be plugged into the copula.

The best fitting model for the Admissions marginal was the SHASHo2 model. This model belongs to the family of GAMLSS distributions, which stands for Generalised Additive Models for Location, Scale and Shape. GAMLSS are very flexible models, which include a wide range of continuous and discrete distributions (Stasinopoulos et al., 2017) . The SHASHo2 model, developed by Jones and Pewsey (2009) , is also known as Sinh-Arcsinh original type 2 distribution and depends on four parameters: µ the location parameter, σ the scaling parameter, ν the skewness parameter and τ the kurtosis parameter. The probability density function (pdf) of the SHASHo2 model is given by

We assumed that the parameter µ of the SHASHo2 model is related to time, as explanatory variable, through an appropriate link function, with coefficient β (for more details, see Rigby and Stasinopoulos (2005) ; Rigby et al. (2019)).

We fitted the Afinn marginal with a reparametrized version of Skew Student t type 3 model (SST), which, similarly to the previous marginal, belongs to the family of GAMLSS distributions and depends on four parameters: the mode (µ), scaling (σ), skewness (ν) and kurtosis (τ ) (Fernández and Steel, 1998) . The pdf for the SST model is

Similarly to the SHASHo2 model, for the SST model we assumed that the parameter µ is related to time, as explanatory variable, through an appropriate link function, with coefficient β.

The best model for Bing was the Normal-Exponential-t (NET) distribution. This is again a four parameter continuous distribution, belonging the GAMLSS family, which was introduced by Rigby and Stasinopoulos (1994) . The parameters are: mean (µ), scaling (σ), first kurtosis parameter (ν) and second kurtosis parameter (τ ). The pdf of the NET model is given by

with Φ the cumulative distribution function of the standard normal distribution. As with the previous marginals, we assumed that the parameter µ of the NET model is related to time.

We fitted the Cases marginal with an ARIMA-GARCH model with Student's t innovations. This model combines the features of the autoregressive integrated moving average (ARIMA) model with the generalized autoregressive conditional heteroskedastic (GARCH) model, allowing us to capture time series volatility over time (for more information see, for example Hyndman and Athanasopoulos (2018)). The GARCH model is typically denoted as GARCH(p, q), with parameters p and q, where p is the number of lag residuals errors and q is the number of lag variances. The ARIMA(p, d, q)-GARCH(p, q) model can be expressed as:

where the first line is the ARIMA part of the model, while the second line is the GARCH part of the model. Also,

x t are the original data values, B is the backshift operator, a is a constant, φ i are the autoregressive parameters, θ i are the moving average parameters; α i and β i are the parameters of the GARCH part of the model, and ε t follows a Student's t distribution.

The best model for the Deaths marginal was the SHASHo model, whose acronym stands for Original Sinh-Arcsinh distribution. This model is very similar to the SHASHo2, represented in Eq.

(1). The pdf of the SHASHo model is

As for the other marginals fitted with GAMLSS-type models, we assumed that the parameter µ of the SHASHo depends on time.

Since Google includes values equal to zero, we fitted a Tweedie Generalised Linear Model for this marginal (Dunn and Smyth, 2018) . The Tweedie distribution has nonnegative support and can have a discrete mass at zero, making it useful to model responses that are a mixture of zeros and positive values. The Tweedie distribution belongs to the exponential family, assuming the following mean and variance, for t = 1, . . . , T :

where ζ t denotes the time covariate, b is the associated regression coefficient, φ is the dispersion parameter and p and q are extra parameters that control the mean and variance of the distribution, respectively.

The best model for the Hospital marginal was the ARIMA-GARCH model with Student's t innovations, as illustrated in Eq.(2).

The best fitting model for the ICU Beds marginal was the SHASHo model, with pdf given by Eq. (3).

We fitted the Tweets marginal with a Skew Exponential Power type 4 (SEP4) model, which is a four parameter distribution belonging to the GAMLSS family (Rigby and Stasinopoulos, 2005) . This is a "spliced-shaped" distribution with the following pdf

for −∞ < x < ∞, −∞ < µ < ∞, σ > 0, ν > 0 and τ > 0, where z = (x − µ)/σ, c = (Γ(1 + ν −1 ) + Γ(1 + τ −1 )) −1 , Γ is the gamma function and 1(·) is the indicator function. Note that µ is the mode of Y . Here we assumed that the parameter µ is related to time, as explanatory variable.

The best model for the VirusTests marginal was the ARIMA-GARCH model with Student's t innovations, as illustrated in Eq.

(2), fitted on the the number of test adjusted by 1000.

A vine copula (or vine) represents the pattern of dependence of multivariate data via a cascade of bivariate copulas, allowing us to construct flexible high-dimensional copulas using only bivariate copulas as building blocks. For more details about vine copulas see Czado (2019) .

In order to obtain a vine copula we proceed as follows. First we factorise the joint distribution f (x 1 , . . . , x d ) of the random vector X = (X 1 , . . . , X d ) as a product of conditional densities

The factorisation in (4) is unique up to re-labelling of the variables and it can be expressed in terms of a product of bivariate copulas. In fact, by Sklar's theorem, the conditional density of X d−1 |X d can be easily written as

where c d−1,d is a bivariate copula, with parameter vector θ d−1,d . Through a straightforward generalisation of Eq.(5), each term in (4) can be decomposed into the appropriate bivariate copula times a conditional marginal density. More precisely, for a generic element X j of the vector X we obtain

where v is the conditioning vector, ν is a generic component of v, v − is the vector v without the component ν , F X j |v − (·|·) is the conditional distribution of X j given v − and c X J ,ν ;V − (·, ·) is the conditional bivariate copula density, which can typically belong to any family (e.g. Gaussian, Student's t, Clayton, Gumbel, Frank, Joe, BB1, BB6, BB7, BB8, etc.; for more information on copula families, see Nelsen (2007)), with parameter θ X J ,ν ;V − . The d-dimensional joint multivariate distribution function can hence be expressed as a product of bivariate copulas and marginal distributions by recursively plugging Eq.(6) in Eq.(4).

For example, let us consider a 6-dimensional distribution. Then, Eq.(4) translates to f (x 1 , . . . , x 6 ) = f 6 (x 6 ) · f 5|6 (x 5 |x 6 ) · f 4|5,6 (x 4 |x 5 , x 6 ) · . . . · f 1|2...6 (x 1 |x 2 , . . . x 6 ). (7) The second factor f 5|6 (x 5 |x 6 ) on the right-hand side of (7) can be easily decomposed into the bivariate copula c 5,6 (F 5 (x 5 ), F 6 (x 6 )) and marginal density f 5 (x 5 ):

f 5|6 (x 5 |x 6 ) = c 5,6 (F 5 (x 5 ), F 6 (x 6 ); θ 5,6 ) · f 5 (x 5 ).

On the other hand, the third factor on the right-hand side of (7) can be decomposed using the (6) as f 4|5,6 (x 4 |x 5 , x 6 ) = c 4,5;6 (F 4|6 (x 4 |x 6 ), F 5|6 (x 5 |x 6 ); θ 4,5;6 ) · f 4|6 (x 4 |x 6 ). Therefore, one of the possible decompositions of the joint density f (x 1 , . . . , x 6 ) is given by the following expression, which includes the product of marginal densities and copulas, which are all bivariate:

f (x 1 , . . . , x 6 ) = 6 j=1 f j (x j ) · c 1,2 · c 1,3 · c 3,4 · c 1,5 · c 5,6 · c 2,3;1 · c 1,4;3 · c 3,5;1 · c 1,6;5 · c 2,4;1,3 · c 4,5;1,3 · c 3,6;1,5 · c 2,5;1,3,4 · c 4,6;1,3,5 · c 2,6;1,3,4,5 . (8) Eq.(8) is called pair copula construction. Note that in the previous equation the notation has been simplified, setting c

Two particular types of vines are the Gaussian vine and the Independence vine. The first one is constructed using solely Gaussian bivariate pair-copulas as building blocks, such that each conditional bivariate copula density c X J ,ν ;V − (·, ·) described in Eq.(6) is a Gaussian copula. The second type is the independence vine, which is constructed using only independence pair-copulas, that are simply given by the product of the marginal distributions of the random variables. In this latter case each conditional bivariate copula density c X J ,ν ;V − (·, ·) described in Eq.(6) is an Independence copula, implying absence of dependence between the variables.

Pair copula constructions can be represented through a graphical model called regular vine (R-vine). An R-vine V(d) on d variables is a nested set of trees (connected acyclic graphs) T 1 , . . . , T d−1 , where the variables are represented by nodes linked by edges, each associated with a certain bivariate copula in the corresponding pair copula construction. The edges of tree T k are the nodes of tree T k+1 , k = 1, . . . , d − 1. Two edges can share a node in tree T k without the associated nodes in tree T k+1 being connected. In an R vine, two edges in T k which become two nodes in tree T k+1 , can only share an edge if in tree T k the edges shared a common node, but they are not necessarily connected by an edge. Figure 2 shows the 6dimensional R-vine represented in Eq.(8). Each edge corresponds to a pair copula density (possibly belonging to different families) and the edge label corresponds to the subscript of the pair copula density, e.g. edge 2, 4; 1, 3 corresponds to the copula c 2,4;1,3 .

In order to estimate the vine, its structure as well as the copula parameters have to be specified. A sequential approach is generally adopted to select a suitable Rvine decomposition, specifying the first tree and then proceeding similarly for the following trees. For selecting the structure of each tree, we followed the approach suggested by Aas et al. (2009) and developed by Dissmann et al. (2013) , using the maximal spanning tree algorithm. This algorithm defines a tree on all nodes (named spanning tree), which maximizes the sum of absolute pairwise dependencies, measured, for example, by Kendall's τ . This specification allows us to capture the strongest dependencies in the first tree and to obtain a more parsimonious model. Given the selected tree structure, a copula family for each pair of variables is identified using the Akaike Information Criterion (AIC), or the Bayesian Information Criterion (BIC). This choice is typically made amongst a large set of families, comprising elliptical copulas (Gaussian and Student's t) as well as Archimedean copulas (Clayton, Gumbel, Frank and Joe), their mixtures (BB1, BB6, BB7 and BB8) and their rotated versions, to cover a large range of possible dependence structures. For an overview of the different copula families, see Joe (1997 ) or Nelsen (2007 . The copula parameters θ for each pair-copula in the vine are estimated using the maximum likelihood (MLE) method, as illustrated by Aas et al. (2009) . The Rvine estimation procedure is repeated for all the trees, until the R-vine is completely specified.

In order to evaluate the suitability of the proposed vine copula model in relation to other methods, we produced one-day-ahead out-of-sample predictions and we compared them to the original data. Let X = {X t ; t = 1, .., T } be the 10-dimensional time series of Covid-19 and social media data. Our aim is to forecast X T +1 based on the information available at time T . In order to do that, we adopted the forecasting method described by Simard and Rémillard (2015) . Before fitting the vine, we extracted the residuals from the marginals, as explained in Section 3.1, and obtained the u-data. Next, after fitting the vine, we simulated M realizations from the vine copula. Hence, we calculated the predicted values for each simulation, using the inverse cdf and the relevant fitted marginal models. More precisely, we applied the inverse transformation to the M realizations from the vine copula to obtain the residuals which we then plugged into the marginal models to get the predicted values of the official variables. Then, we calculated the average prediction for all simulationŝ X Avg T +1 and use it to forecast X T +1 . The prediction interval of level (1 − α) ∈ (0, 1) for X T +1 was calculated by taking the estimated quantiles of order α/2 and 1 − α/2 amongst the simulated data. We denote byX l T +1 andX u T +1 the lower and upper values of the prediction intervals.

In order to compare and contrast the accuracy of predictions for different models, we made use of two indicators: the mean squared error (MSE) to evaluate point forecasts and the mean interval score (MIS), proposed by Gneiting and Raftery (2007) , to assess the accuracy of the prediction intervals. The MSE for each variable j = 1, . . . , d was calculated as follows

where x t,j is the observed value for each variable at each time point t,x t,j is the corresponding predicted value, T + 1 denotes the first predicted date, while T + S indicates the last predicted date. The 95% MIS for each variable, at level α = 0.05, was computed as

wherex l t,j andx u t,j denote, respectively, the lower and upper limits of the prediction intervals for each variable at each time point, and 1(·) is the indicator function.

We now present the results of the analysis of the official and online-retrieved Covid-19 data.

First, we analysed the information gathered on Twitter, cleaning and stemming the tweets and producing graphical representations of the data using wordclouds. Figure 3 displays the wordcloud obtained from the collected tweets discussing Covid-19 in the UK. The most frequent words are related to "people" and the effects of the pandemic on them. We can also notice the names of the most prominent politicians and words related to political decisions. Figure 4 shows the sentiment wordcloud created from the collected tweets, obtained with the Bing method. This data visualization highlights the positive words in blue and the negative words in pink. The most popular positive words are related to the "support" received troughout the pandemic, while the most popular negative words are related to the worst consequences of Covid-19 on the health of individuals. Table 1 lists the parameter estimates, obtained via the MLE method, of the best fitting models for the marginals, as described in Section 3.1. Standard errors are in brackets.

As an example, Figure 5 shows the fit of the residuals for the Tweets marginal. The top panel displays the QQ-plot comparing the Gaussian theoretical quantiles with the sample quantiles, the middle panel illustrates the observations (black line) and in-sample predictions obtained from the fitted SEP4 model (red line), while the bottom panel shows the histogram of the resulting u-data. The plots clearly show an excellent fit of the SEP4 model to the marginal, as demonstrated by the points in the Figure 3 : Wordcloud of the UK Covid-19 data.

QQ-plot aligning well to the main diagonal, the in-sample predictions overlapping the observed data and the shape of the u-data histogram being close to a uniform pattern.

Once the marginals were estimated, we derived the corresponding u-data from the residuals, as illustrated in Section 3.1. Then, we carried out fitting and model selection for the vine copula for each location using the R package VineCopula (Nagler et al., 2021) . Figure 6 displays the first tree of the vine copula for the Covid-19 data. The nodes are denoted with blue squares, with the numbers corresponding to the margins reported on them. On each edge, the plot shows the name of the selected pair copula family and the estimated copula parameter expressed as Kendall's τ . In order to estimate the vines, we adopted the Kendall's τ criterion for tree selection, the AIC for the copula families selection and the MLE method for estimating the pair copula parameters. As it is clear from Figure 6 , the total number of tweets plays a central role in the vine, linking official Covid-19 to social media variables. The total number of tweets and Google searches are contiguously related. Likewise, the sentiment scores Bing and Afinn are directly associated. Tweets is also directly connected to the number of deaths, the total number of tests and the number of hospital cases. The symmetric Gaussian copula, which is often employed in traditional multivariate modelling, was only identified once as the best fitting copula, to link the number of patients on ventilation and the number of new admissions. On the contrary, the selected copula families include the Student's t copula, which is able to model strong tail dependence, the Tawn copula and Archimedean copulas such as the Clayton, Gumbel and Joe, that are able to capture asymmetric dependence, and mixture copulas such as the BB8 (Joe-Frank) , that can accommodate various dependence shapes. Most of the associations between the variables are positive. The strongest associations are between the official Covid-19 variables ICU Beds and Admissions, between ICU Beds and Deaths and between the Bing and Afinn sentiment scores. Also, Deaths and Tweets are mildly associated.

In this Section we constructed out-of-sample predictions using the proposed vine methodology, which integrates official and social media Covid-19 variables. We then compared the predictions obtained with our methodology with those yielded using two traditional approaches. The former is based on vines built exclusively using Gaussian pair copulas, which are the most common in applications, but are restricted to dependence symmetry and absence of tail dependence. The latter approach assumes independence among the ten time series under consideration and therefore calculates predictions ignoring any association between official and online information.

Out-of-sample predictions based on the proposed model were constructed as illustrated in Section 3.3, considering the vine copula estimated as explained in Section 4.3 until the 1 st April 2021 and using it to predict the period between the 2 nd April 2021 and the 9 th May 2021. Tables 2 and 3 , show a similar model performance across the ten variables. According to both the MSE and MIS indicators, the vine copula approach outperforms the other two approaches for predicting the variables Deaths, Hospital and VirusTest. The Gaussian vine approach also performs well with several variables, while the independent vine approach seems to exceed the other two approaches only with the variables Admissions and Afinn.

The official Covid-19 variables Admissions, Cases, Deaths, Hospital, ICU Beds and VirusTests are generally better predicted by the vine method, as opposed to the Gaussian and independence methods. This last approach assumes no dependence between any of the variables involved in the model. Hence, this approach indicates the absence of any association between the official and the social media variables, implying the lack of contribution of online-generated information in predicting the official Covid-19 variables. On the contrary, the vine approach assumes the presence of a dependence structure between the variables and, in particular, between the official and social media insights. Therefore, the better performance of the vine compared to the independence model demonstrates usefulness of social media information in forecasting official Covid-19 variables.

The prediction of online-generated information (Afinn, Bing, Google and Tweets) also benefits from data integration. Indeed, most of the social media variables are more accurately forecasted by the vine model, particularly the Gaussian one. This indicates that the Gaussian approach, characterized by a symmetric dependence structure, is flexible enough to model the social media variables. Figure 7 shows the forecasts and prediction intervals for the the official Covid-19 variables Admissions, Cases (first row), Deaths, Hospital (second row), ICU Beds and VirusTests (third row), obtained with the vine copula methodology for the period between the 2 nd April 2021 and the 9 th May 2021. The black lines denote the observed values, the inner red lines denote the predicted values and the outer dotted red lines denote the 95% prediction intervals. We notice that intervals predicted by the vine copula method capture most of the dynamic of the official Covid-19 variables, indicating that the proposed methodology is able to leverage social media information for forecasting official Covid-19-related data.

In this paper, we propose a new methodology aimed at obtaining more accurate forecasts compared to traditional approaches, for variables measuring the Covid-19 dynamics. The proposed methodology is based on the integration of Covid-19 variables collected from official UK sources with online generated social media insights, relevant to the same geographical area. Together with official Covid-19 information related to infection counts and the pressure on the national health service, we also gathered Google Trends searches and Twitter microblogging messages involving keywords related to the Covid-19 pandemic. From the tweets, we considered the volume as well as the sentiment scores, to investigate the feelings of people towards the pandemic. Our methodology is based on vine copulas, which are able to model the dependence structure between the marginals, and thus to take advantage of the association between official Covid-19 and social media variables. We tested our approach calculating out-of-sample predictions and comparing the vine copula method with two traditional approaches: the first based on a vine constructed with all Gaussian copulas, and the second based on independence between variables. The results show that the vine copula method outperforms the other two approaches for predicting the number of deaths, hospital admissions and tests, demonstrating that our methodology is able to leverage social media information to obtain more accurate predictions of Covid-19 effects than the other two approaches. In some cases, the Gaussian vine copula method is selected, showing that the vine data integration approach is still achieving the best performance, although some variables are less affected by asymmetries and tail dependence.

The proposed methodology will support policy makers to understand, monitor and combat the pandemic, assisting key medical and governmental actors to make informed decisions and to efficiently and effectively plan and allocate necessary resources.

Further investigations including additional social media information will be the object of future work. Also, the proposed approach could be extended using a 7 day rolling average in the model to adjust for the delay due to UK government figures being under reported at the weekend. Another extension will involve Bayesian inference, which would allow us to incorporate other information, such as experts' opinion, in the model. In addition, the use of more sophisticated machine learning approaches could be envisaged for deriving the sentiment variables to improve the proposed methodology.

Pair-copula constructions of multiple dependence

A framework for pandemic prediction using big data analytics

Social media integration of flood data: A vine copula-based approach

Analyzing dependent data with vine copulas

Official statistics data integration using copulas

Copula and vine modeling for finance

Copulas and vines

Data integration

Social media big data integration: A new approach based on calibration

Official statistics data integration for enhanced information quality

Building a covid-19 vulnerability index

Selecting and estimating regular vine copulae and application to financial returns

Generalized linear models with examples in R

On bayesian modeling of fat tails and skewness

Strictly proper scoring rules, prediction, and estimation

Mining and summarizing customer reviews

Forecasting: principles and practice

Predictive mathematical models of the covid-19 pandemic: underlying principles and value of projections

Multivariate models and multivariate dependence concepts

The estimation method of inference functions for margins for multivariate models

Sinh-arcsinh distributions

rtweet: Collecting and analyzing twitter data

Retrospective analysis of the possibility of predicting the covid-19 outbreak from internet searches and social media data, china

Covid-19 patients' clinical characteristics, discharge rate, and fatality rate of meta-analysis

Characteristics and outcomes of a sample of patients with covid-19 identified through social media in wuhan, china: observational study

Tracing the pace of covid-19 research: Topic modeling and evolution

Time-varying co-movement analysis between covid-19 shocks and the energy markets using the markov switching dynamic copula approach

gtrendsR: Perform and Display Google Trends Queries

VineCopula: Statistical Inference of Vine Copulas

An introduction to copulas

A google-wikipedia-twitter model as a leading indicator of the numbers of coronavirus deaths. Intelligent Systems in Accounting

Exploring urban spatial features of covid-19 transmission in wuhan based on social media data

Using network analysis and machine learning to identify virus spread trends in covid-19

Prediction of number of cases of 2019 novel coronavirus (covid-19) using social media search index

R: A Language and Environment for Statistical Computing

Analysis and prediction of covid-19 using sir, seiqr and machine learning models: Australia, italy and uk cases

Generalized additive models for location, scale and shape

Robust fitting of an additive model for variance heterogeneity

Distributions for modeling location, scale, and shape: Using GAMLSS in R

The covid-19 pandemic and speculation in energy, precious metals, and agricultural futures

tidytext: Text mining and analysis using tidy data principles in R

Forecasting time series with multivariate copulas

Fonctions de répartitionà n dimensions et leurs marges

Flexible regression and smoothing: using GAMLSS in R

Information needs mining of covid-19 in chinese online health communities

Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

Limited early warnings and public attention to coronavirus disease 2019 in china

This work was supported by the European Regional Development Fund project Environmental Futures & Big Data Impact Lab, funded by the European Structural and Investment Funds, grant number 16R16P01302 .