key: cord-0274423-kcze4dpd
authors: de Hoffer, A.; Vatani, S.; Cot, C.; Cacciapaglia, G.; Conventi, F.; Giannini, A.; Hohenegger, S.; Sannino, F.
title: Variant-driven multi-wave pattern of COVID-19 via Machine Learning clustering of spike protein mutations
date: 2021-07-24
journal: nan
DOI: 10.1101/2021.07.22.21260952
sha: a754b0400744ff8af13ce99cd241366dd91f307e
doc_id: 274423
cord_uid: kcze4dpd

Never before such a vast amount of data has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to answer a number of highly relevant questions, regarding the evolution of the virus and the role mutations play in its spread among the population. We focus on spike proteins, as they bear the main responsibility for the effectiveness of the virus diffusion by controlling the interactions with the host cells. Using the available temporal structure of the sequencing data for the SARS-CoV-2 spike protein in the UK, we demonstrate that every wave of the pandemic is dominated by a different variant. Consequently, the time evolution of each variant follows a temporal structure encoded in the epidemiological Renormalisation Group approach to compartmental models. Machine learning is the tool of choice to determine the variants at play, independent of (but complementary to) the virological classification. Our Machine Learning algorithm on spike protein sequencing provides a simple and unbiased way to identify, classify and track relevant virus variants without any prior knowledge of their characteristics. Hence, we propose a new tool that can help preventing and forecasting the emergence of new waves, and that can be used by decision makers to define short and long term strategies to curb the current COVID-19 pandemic or future ones.

in each cluster is counted, after pruning, to obtain the frequency for the occurrence of each cluster over time. The results are 135 shown in Fig. 2 for two choices of the cut in the Ward distance r W : r W = 50 in the left plots and r W = 100 in the right plots. 136 We consider as relevant clusters only the ones that contain at least 1% of the sequences within the full dataset. We identify, show in Fig. 3 the frequency and number of infections assigned to each clade, where we note that GRY corresponds broadly to 157 the Alpha VoC (or variant v2 in our results). These plots confirm that each wave is dominated by a single group of mutations.

By comparing the frequencies, we also observe that our variant v1 matches the clade GV, while v0 groups all the other ones 159 that mainly occurred during the first wave. We also note that our method allows to clearly identify the Delta VoC (v3), while 160 in the Nextstrain clades it is associated with the clade G. This validation demonstrates that the clustering based on the spike 161 protein sequences alone is able to identify relevant variants for the SARS-CoV-2.

Variants as time-ordered cluster chains 163 Having validated our method to define variants in terms of ML clusters, we now turn our attention to study the time evolution . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 24, 2021. VoC) which spins off from v0 closely to v1 and takes over by generating a third wave. We also see the emergence of short lived 189 variants that do not have the power to start a new wave and therefore die off without infecting a sizeable number of individuals. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 24, 2021. 

where I c is the cumulative number of infected, γ is the infection rate (in inverse days) and A is the total affected individuals 196 after the wave (per 100.000 inhabitants). The parameter B controls the timing of the wave, and is of no concern in this study. 197 We recall that the parameter γ encodes the effective diffusion speed of the variant, including not only its intrinsic viral power . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 24, 2021. In the left column we show the total number of sequencing available on GISAID (in colour the ones associated to the relevant VoC or VoI); the middle column shows the number of new daily infected (per 100.000 inhabitants); the right column shows the percentage of each VoC or VoI in the sequencing data. All plots show daily rates, with data smoothened over a period of 7 days. In the middle plots, the data are shown by dots, where blue corresponds to the total and the colours show the number of infected associated to each variant. The solid lines show the result of the fits to the MeRG model (note that only for the UK we fit the "standard variant" -in greenwith two logistic functions). In the left plots, the error derives from the expected statistical variation on the number of daily sequences (after smoothening). For all the plots, the classification in variants derived from the GISAID data.

Thus, we used the logistic function above to fit the epidemiological data, after distributing the new daily infected to each . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 24, 2021. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 24, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 24, 2021. 369 The current paper deals with the time evolution and spread of different variants of SARS-CoV-2 in a given population. A 370 theoretical study of the underlying processes in the framework of the epidemic Renormalisation Group (eRG) approach 19 has 371 recently been presented in a companion paper 22 . In this section, we shall briefly review the relevant formalism.

The eRG approach is inspired by the running of fundamental couplings as a function of the energy scale in particle physics.

Originally proposed 19 as an effective description of epidemic diffusion processes organised around time-scale invariances, 374 it has been extended to account for geographic mobility across different countries 20 , the multi-wave structure of the SARS-

CoV-2 pandemic 38 as well as the impact of the US vaccination campaign 34 . The predictive power of this approach has been 376 demonstrated by accurately describing the impact of non-pharmaceutical interventions 38, 42 and predicting the starting date of 377 the second wave in the fall of 2020 in Europe 21 . An interpretation of the eRG approach as a time-dependent SIR model has also 378 been discussed 35 , while the relation to other epidemiological approaches has been discussed at great depth in this review 36 .

In the companion paper 22 , the eRG framework has been further extended to describe the time evolution of two different variants of a disease. An epidemic coupling strength α i (I c,i ) is introduced for each variant (here i = 1, . . . , n labels the n different variants). The latter is a function of the cumulative number of individuals I c,i that have been infected by this ith variant. The time evolution of the different variants is then described by a set of β -functions

which constitute the core of the Mutation eRG (MeRG) approach. Inspired by the numerical study of a compartmental model and empirically validated by comparing with data from California, the United Kingdom and South Africa, the β -functions were written in the form of gradient equations 22 ( in detail for the case n = 2)

Here, γ k is a measure for the infection rate and A k is the asymptotic number of individuals infected by variant k. Solutions of the β -functions (E2) give cumulative numbers of infected individuals as functions of time for each variant that follow a logistics function

where κ i is a parameter that governs the time of appearance of the variant. For given i, the function β i in (E2) has two zeroes, sequences have 1271 amino-acids. Hence, to remove data with missing pieces, we only keep sequences with at least 1250 amino-acids. Furthermore, we dismiss all sequences containing at least one X in the sequence. This pruning allows us to work 399 only on a high purity dataset. The number of sequences before and after the pruning for England, Wales and Scotland are listed 400 in Table T1 . After the pruning, a significant number of sequences are in the dataset, many of which contain the same sequences 401 for the spike protein.

To accelerate the next step in the analysis, it is convenient to remove repetitions and thus only work on a 402 dataset containing only different sequences (see right column in Table T1 ). In particular, this is necessary when we do a global 403 analysis of the whole dataset, while the time-sampling automatically reduces the number of sequences in the dataset. Table T1 . Number of sequences in the datasets for England, Wales and Scotland before and after the pruning and selection.

The extraction and pruning of the data is done by the python program extraction_country.py, where the name of branches that are chosen to be regrouped (let us call them A and B) are the ones that have the smallest distance between them. 

This method is insensitive to the multiplicity of identical leaves, thus one can use the file country.bin as input.

• Complete Linkage Clustering, where dis(A, B) = max x∈A , y∈B d(x, y) .

As before, this is independent on the leaf multiplicity.

• Unweighted Average Linkage Clustering, where

where |X| is the number of leaves in the branch X. This method is sensitive to the multiplicity of identical leaves, thus is 447 requires country_complete.bin as input.

• Ward's Method, an agglomerative clustering method based on the measure of the average squared distance of points in the cluster to its centre of gravity, or centroid. Hence, the effective distance between two branches is defined by the increase in the above measure in the merged cluster with respect to the two separate ones. In practice:

This method is sensitive to the multiplicity of identical leaves, thus it requires country_complete.bin as input.

Note that we also used this method on the distance matrix country.bin to speed up the computation.

At each step, the algorithm joins branches together, until all leaves are grouped together. This approach can be schematically 451 represented by a dendrogram tree, as shown in Fig. F1 . This step is executed by the program linkage.py. In order to define 452 clusters and variants, we need to define a threshold in the effective distance, which will therefore be equivalent to a horizontal 453 cut of the tree branches. In practice, all clusters whose effective distance is larger than the threshold are used to define the 454 variants. The value of the threshold needs to be determined each time, as it crucially depends on the measure that is employed, 455 and on the samples. Once the threshold is fixed, the clusters and their time evolution can be obtained. This is done by the 456 executable called time.py.

Data from GISAID repository has been grouped on a monthly basis for the England, Wales and Scotland datasets, as shown 459 in Table T2 . Despite the rather large number of monthly recorded data, we found sequences with high replica rate, thus the 460 percentage of different sequences over the whole dataset is also reported in the Table T2 and shown by the orange line in 461 Fig. F2 . It has been noted that the most frequent sequence always represents a remarkable amount of the whole dataset, as 462 shown by the blue line in Fig. F2 for the England dataset. Furthermore, the Levenshtein distance between the most frequent Figure F1 . Dendrogram three for the England dataset, for month 9 (September 2020), built using the Ward's method. On the y-axis we show the effective distance between two clusters at the point when they merge. The branches in colour represent more than 1% of the total number of sequences for a cut at Ward distance 100. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 24, 2021. Table T2 . Number of GISAID recorded sequences from January 2020 (Month 1) to June 2021 (Month 18). Figure F3 . Dataset coverage (left) and number of clusters (right) with respect to threshold (Ward distance) value. 481 To define and follow the time evolution of a given candidate variant, we build time-series of clusters starting from month 1 482 using the following algorithm:

• For each cluster in month i we selected the most frequent sequence and we tried to find a cluster with the same sequence 484 in month i + 1 (strong link). We iterate to build a path until the procedure fails or the last month is reached.

• If the strong link association fails but we still have clusters in consecutive months that are free from strong links, we 486 connected them provided that the distance is less than a given threshold. If two clusters converge to the same node we 487 keep the nearest one (soft link).

The threshold cut used for the soft link definition has been optimised looking at the distribution of the distances between 489 clusters connected with strong links, as shown in Fig. F5 . To preserve the topology based on the LM we choose the soft link 490 threshold as the maximum distance found for a strong link. 491 We define a chain as a list of consecutive clusters connected by strong or weak links as best candidate to study the 492 evolutionary paths of a candidate variant. For each well defined chain we also assign a branching link taking the first cluster 493 of the chain and associate it with the cluster in the previous month that is the closest in terms of the Ward distance. In the 494 following we assign as chain identification number the one of the first cluster in the chain. while the chain #37 (v3) matches the Delta VoC. 499 We also found two other relevant chains, namely chain #11 (v1a) and chain #13 (v1b). Using the branching link it is 500 possible to track also the evolutionary path of each single variant candidate. The Alpha VoC is clearly connected to the chain 501 #1 via cluster #15 (indeed we found about 70 sequences of Alpha VoC in cluster #15). Similarly, the chains #11 and #13 are 502 connected to the original strain and it is likely that they appeared before of the Alpha VoC. Similar results have been found 503 using Scotland (and Wales) data in Fig. F6 and daily cases are shown in Fig. F7 and F8 for Scotland and Wales data.

A comparison of the evolutionary path of the England chains suggests a competitive mechanism between variants. To 507 further investigate this we study the spread of a chain using the distance between consecutive clusters, as shown in Fig. F9 . 508 The original variant and the Alpha VoC show more stable and lower distance values with respect to chains #11 and #13, thus 509 suggesting that such variable can be used as an early discriminant between different evolutionary paths of variants. Namely, 510 variants that show smaller distances between consecutive clusters appear to be more stable and able to ignite epidemiological 511 episodes of exponential increase (waves). A comparison of the evolutionary path for the original variant and the Alpha VoC for 512 England, Wales and Scotland is also shown in Fig. F10 . 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 24, 2021. ; https://doi.org/10.1101/2021.07.22.21260952 doi: medRxiv preprint Figure F6 . Chain results as candidate variants for the Scotland dataset (left) and the Wales one (right). Strong links (black line) and soft link (red line) are reported. Branching links (gray line) where defined are also shown. Figure F7 . Frequency and daily cases for chains as candidate variants (Scotland) Figure F8 . Frequency and daily cases for chains as candidate variants (Wales)

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 24, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 24, 2021. ; https://doi.org/10.1101/2021.07.22.21260952 doi: medRxiv preprint Figure F10 . Time evolution comparisono for chains associated with the original strain (left) and Alpha VoC B.1.1.7 (right)

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 24, 2021. ; https://doi.org/10.1101/2021.07.22.21260952 doi: medRxiv preprint

46685 (5%) 2196 (8%) 3508 (6%)