key: cord-0915450-rs83lgxy authors: Scherer, Christiane; Grover, James; Kammeraad, Darby; Rudy, Gabe; Scherer, Andreas title: Analysis of 46,046 SARS-CoV-2 whole-genomes leveraging principal component analysis (PCA) date: 2020-12-22 journal: bioRxiv DOI: 10.1101/2020.12.20.423682 sha: 5054293a79a4236671de5043582d92be69620dd9 doc_id: 915450 cord_uid: rs83lgxy Since the beginning of the global SARS-CoV-2 pandemic, there have been a number of efforts to understand the mutations and clusters of genetic lines of the SARS-CoV-2 virus. Until now, phylogenetic analysis methods have been used for this purpose. Here we show that Principal Component Analysis (PCA), which is widely used in population genetics, can not only help us to understand existing findings about the mutation processes of the virus, but can also provide even deeper insights into these processes while being less sensitive to sequencing gaps. Here we describe a comprehensive analysis of a 46,046 SARS-CoV-2 genome sequence dataset downloaded from the GISAID database in June of this year. Summary PCA provides deep insights into the analysis of large data sets of SARS-CoV-2 genomes, revealing virus lineages that have thus far been unnoticed. panel, it is also important to uncover geographical differences in mutation patterns (17) . However, in order to gain these insights, it is necessary to use suitable bioinformatic methods to analyze these gigantic data sets. So far, phylogenetic methods have been used the most often to analyze mutation patterns (18) (19) (20) . While these methods have advantages, they also have significant drawbacks-when we start analyzing thousands of SARS-CoV-2 genomes, the number of 5 mutations grows to thousands, which exacerbates this problem. In our study, we present another method for analyzing large data sets of gene sequences-namely, principal component analysis (PCA). The original applications of PCA to questions in the field of population genetics helped with deriving the underlying population structure and genetic ancestry of individuals from shared genetic mutations (21, 22) . 10 PCA is now a well-established building block in the search for SNPs in GWAS analyses in humans, animals and microorganisms (23). The validity of this method depends to a large extent on the numerical recoding of the SNPs, the PCA method used and the graphical representation of the resulting eigenvalues (24) . We conducted our investigations with the complete GISAID collection of 46,046 SARS CoV-2 15 sequences from human sample material downloaded on June 15th, 2020. Sequence alignment and variant calling was done with previously described methods (25, 26) using the ASM985889v3 sequence with GenBank identifier NC_045512.2 and RNA identifier MN908947.3 based on the original Wuhan, China sequence published in Nature (27) as reference sequence. The process of data acquisition is described in detail in the supplementary material. 20 In the 46,046 SARS-CoV-22 genomes 8,258 nucleotide positions were found in which variants occurred compared to the reference. After data trimming (see supplementary material), 62 nucleotide positions were excluded resulting in a dataset for PCA consisting of a matrix of 46,046 samples and 8,129 nucleotide positions. The sample metadata was organized into a matrix structure as well. 25 The differentiation ability of PCA is decisively dependent on the coding of the mutation events Two PCA methods were first applied to a high-quality portion of the dataset described above containing 20,750-genomes whose sequences had no ambiguous counts. The divergence score of 30 these samples was in a range of 0 to 0.0008. The matrix for performing PCA thus comprised 20,750 samples and 8,192 variant-bearing positions. This data was recoded to numeric values in the following ways in preparation for performing PCA: PCA1: A classical recoding of the variant matrix was carried out for this analysis. Each deviation from the reference was marked as 1, each match as 0. 35 PCA2: This recoding assigns a unique value to each variant, based on the specific change to the reference sequence: no change to the reference (A remains A, C remains C, etc.) was encoded as 0, and a modified nucleotide position was unambiguously encoded according to the defined mutation event as shown in Table S1 . To our knowledge, this type of coding in preparation for PCA has not been used before. 40 Figure 1 compares the 2D plots of the major eigenvalues obtained from PCA1 (Fig. 1A) and PCA2 (Fig. 1B) . The 2D plot of the two major eigenvalues according to PCA1 shows a clumsy four-arm central cluster with two short central arms and two outer arms, each of which diffusely rejuvenates in parallel with the x axis and the y axis. On the other hand, the 2D plot of the two major eigenvalues from PCA2, which in its numerical encoding of the variant matrix separates a total of 13 events very neatly, separates five clusters which we describe intuitively as center (ce), straight (st), middle right (mr), extreme right (er) and extreme left (el). Figure S1 shows a time sequence of the 2D plot described above which is colored according to the collection month of the sample. As Figure S1 illustrates, the cluster appearing first is the cluster ce, starting in December, followed by cluster st in January. By the end of February, all clusters 5 have been pre-formed, and from March on, all other data points have accumulated around these pre-formed clusters. The increase in the number of genome sequence data points can be found on all five clusters, suggesting that all five basic variants that make up these clusters were active at least until the beginning of June. The advantage of the PCA method is that the complete variant set is analyzed rather than every 10 single variant by itself. In order to be able to assess the effects of sequence quality, this study did not take the approach of excluding sequences on the basis of certain criteria, but instead included quality markers as properties of each sample in the analysis. So, a third PCA was performed: PCA 3: For PCA 3, recoding was performed in the same way as for PCA2, but the complete (human-sample) data set consisting of 46,046 samples and 8,196 positions was used. 15 Figure S2 shows the 3D plots of eigenvalues 1,2 and 10 after PCA 2 and 3. Essentially, PCA using the specific numerical recoding of the variant matrix of 46,046 non quality filtered SARS-CoV-2 genomes finds the same cluster pattern as does PCA using the specific numerical recoding of the 20,750 genomes with high sequence quality ( Fig. 1C and D) . Figure S3 shows a 3D plot of all SARS-CoV-2 sequences from the original data set that show at 20 least one ambiguous nucleotide position. The data points are colored according to the number of their ambiguous counts. You can see that the same five main clusters appear both in this dataset and in the data set without ambiguous counts. Some of the data points with ambiguous counts scatter more distantly around the main clusters. Having quality criteria as an attribute in the metadata, it is therefore now possible to apply validity criteria for assignment to a cluster: As a 25 consequence of the recoding process, a sequence may be shifted more towards reference clusters than it should be, but this can be taken into account if you want to analyze specific phenomena of individual sequences. A variant pattern which lies outside common clusters is, of course, as doubtful as the sequencing is. 30 In order to facilitate communication about SARS-CoV-2 mutation patterns, various nomenclatures have been established and are currently being used in parallel (28) (29) (30) . Alm et al have summarized, in a recently published communication (17) , the relationships between the most common currently used nomenclatures of lineages and clades according to Rambault et al (21) , Nextstrain (29) and 35 GISAID (30) . In our investigations, we use these nomenclatures in order to reference known properties and origins of sequence patterns. Using the 2-D cluster plot resulting from PCA2 of the 20,750 SARS-CoV-2 sequences without ambiguous counts, the clusters were manually separated into individual data sets according to their 40 eigenvalue coordinates ( Fig. 1 C-D) . Cluster ce contained 7714 samples, Cluster st, 2076, cluster mr, 1466, cluster er, 4514 and cluster el, 4900 samples. Eighty samples escaped manual clustering and were marked as "no cluster" (nc). The entire data set and each separate cluster itself were examined to find the frequencies of the variants (see the supplementary materials for the details). All variants found in up to 10% of the sample sets were considered for later analysis. In the entire 45 data set of 20,750 genomes, four nucleotide positions were found which show a dominance of the variant against the reference-namely, positions C241T (97.52%), C3037T (99.97%), C14408T (99.99%) and A23403G (99.95%). These are discussed below. However, the analysis also showed that within the clusters st, mr, er and el, variants were found which were cluster-specific and were present in the respective cluster up to almost 100%. Table S2 shows an overview of those variants found in each cluster that reached a frequency of at 5 least 10% of the cluster's samples. The cluster-specific nucleotide variants are marked in bold print, and are for cluster st, T28144C and C8782T, for cluster el, G28883C, G28882A and G28881A, and for the two right clusters, G25563T. These clusters differ by C1059T only being present in cluster er, while cluster mr keeps the reference nucleotide C1059. Remarkably this cluster marking nucleotide positions matched some marker nucleotide positions of clades and lineages described 10 in the above-mentioned nomenclature references (28) (29) (30) . Our studies show that the variants found in cluster st (see Figure 2 ) largely match the variant patterns of Nextstrain clade 19B, GISAID clade S and lineage A of Rambault et al, cluster el matches Nextsrain clade 20B and GISAID clade GR, and cluster er matches Nextstrain clade 20C and is, together with cluster mr, included in GISAID clade GH. Cluster ce did not match any clusters in their clade system while the clusters 15 we defined through principal component analysis did not have any matches with Nextstrain clades 19A or 20A or GISAID clade G. Nextstrain clade 20A differs from 19A bearing variants C3037T, C14408T and A23403G (29) and GISAID clade G is defined by variants C241T, C3037T and A23403G. These variant nucleotide positions were found across clusters in the majority of SARS-CoV-2 genomes. Figure S4 shows the distribution of different patterns of nucleotide positions 241, 20 3037, 14408 and 23403 with pattern 1: C241 and any reference/variant combination of 3037, 14408 and 23403 (in short C-X-X-X), pattern 2: T-X-X-X, pattern 3: C-T-T-G and pattern 4: T-T-T-G. Pattern 4 is, with 44267 samples, by far the most distributed combination of these positions. Pattern 3 occurs 1168 times, pattern 2, 450 times, and pattern 1, 161 times. All combinations can be found in each cluster. We analyzed the relative frequencies of the patterns in China, UK, USA, 25 and Australia, as well as the remaining countries as combined into their WHO regions. As figure S5 show, pattern 1,2 and 4 mirror more or less exactly the weighting of the GISAID submission countries, whereas pattern 3 has a remarkable weighting in Australia, Americas and the Eastern Mediterranean. The 14408 mutation is associated with changes in the RdRp gene (9), while the other variant positions are associated with variants of the spike protein S-D614G (31), which could 30 be associated with higher infectivity and higher viral loads (3), (32) . Based on our data, it should be noted that the combination of all these variants is the by far most prevalent variant pattern (96% of the samples) for the virus independently of all other variations found. However, due to the ubiquitous distribution of all variations of these nucleotide combinations in all clusters, we excluded them as markers for our further investigations. 35 Defining subclusters by extracting patterns of the variants found in the clusters obtained after PCA For further analysis, the clusters were not manually isolated based on the principal components, but rather the sequences were mapped to a cluster using the cluster-specific markers described 40 (Table S2 ) and then plotted in order to see, if they clustered at the expected cluster locations. Figures 2A and 2B show the 3D plots of eigenvalues 1, 2 and 10, highlighting the cluster mappings st, mr, er, and el. All sequences that did not bear any of the cluster-specific variants were assigned to cluster ce. Figure 2B shows that even in the entire 46,046-sample dataset, the clusters can also be distinguished by the cluster-specific variants obtained by means of PCA and subsequent variant 45 analysis. Subclusters for each such cluster were obtained by first, taking each sample's data for all variant positions of the cluster occurring at least 10% of the time within that cluster and merging that data into one metadata field, then using this metadata field to evaluate which combinations of these variants occurred for how many samples. The most common combinations were identified as subclusters. Details of this method and the systematics to name the subclusters are described in the 5 supplementary material. Figure 3B gives an overview of the subclusters, their variant patterns, and, for each subcluster, the number of genome sequences that could be mapped to it. Figure 3A shows a 2D plot according to the two major eigenvalues based on PCA 3 of the entire 46,046 data set, colored by assignment to these subclusters and shows how close the sequence lines are, especially in the central clusters ce, st and mr. 10 Cluster center (ce) Cluster Center is the first cluster to have formed in the course of the pandemic. Its origin lies in Wuhan (information from the GISAID metadata) and contains the least variant-bearing genome sequences. As a result of our study, we found, in this 19,629-genome cluster, 5 subclusters defined 15 by the pattern of their variant-bearing positions-this pattern is summarized in Figure 3B Cluster straight (st) Cluster st also formed very early in the pandemic in Wuhan. The nucleotide positions of the variants defining the cluster can be found in Table 3 . This 3,545-genome cluster is separated into 35 4 subclusters according to the pattern of their variant bearing positions, as summarized in Figure 3B . The following reference/variant combinations were most commonly found and were grouped into subclusters. Subcluster st-0 with 1,025 sequences has the reference sequence in all these positions, while subcluster st-3 with 1,637 genomes has the variants 18060T, 17858G, and 17747A, while the remaining nucleotide positions correspond to the reference. Subcluster st-5 has 40 variants 9477A, 14805T, 28657T, 28863T and 2979T while the remaining nucleotide positions correspond to the reference. 117 genomes did not fall into any of the subclusters and were grouped into subcluster st-v8. Cluster middle right (mr) The mr cluster was found as a new, previously undescribed cluster. This cluster became visible by means of principal component analysis as used in the present study. It comprises 2,463 genomes which had G25563T as a common feature, along with not having variant T in nucleotide position C1059. (Variant T would have placed these into cluster er-see the extreme right column in Fig. 3B .) The mr cluster is separated into 7 subclusters according to the pattern of their variant-bearing 5 positions, as summarized in Figure 3B . The following reference/variant combinations were most commonly found in cluster mr and grouped into subclusters: Subcluster mr-0 with 230 sequences shows the reference sequence in all these locations, while subcluster mr-2, which comprises 261 genomes, deviates from this by having variants 1594T and 11109T. Subcluster mr-A-0, with 770 genomes, differs from subcluster 30 These clusters and subclusters were also analyzed with regard to the temporal and geographical origin of the samples contained therein, taking into account the country or region of exposure and the month of sampling. To track the variant clusters over time, the collecting month for each sample was used if it had been included in the record. In order to get an impression of the geographical coverage of the data, the information on the 35 countries of exposure contained in the metadata has been categorized into 6 regions according to the WHO classification, with the exception that the three largest submission countries-the United Kingdom (UK), United States of America (USA), Australia (AUS)-and China (CH) as the country of origin of the pandemic were categorized separately from their respective WHO regions. The data was compared with the data status of the WHO Situation Report of 06/08/2020 (33), the 40 last sampling date of the GISAID dataset. It is important to note that the data collection has a strong weighting: the vast majority of isolates in GISAID came from the UK, which accounted for almost half of the total data set with 21,432 samples. The US ranked second in GISAID, with 8,820 sequences, which accounted for about 20% of the GISAID dataset. The remaining 30% of the sequences were accounted for by the other countries and regions. The Central and South American 45 countries of Brazil, Peru, Chile and Mexico were under-represented, as were the Eastern Mediterranean (EM) with countries such as Iran, which according to the WHO report of 06/08/2020 was among the first 10 countries with the most COVID-19 cases, but which had a total of just under 300 sequences-less than a fifth of the number of isolates from the Netherlands, which ranked 4th among the GISAID submitters. A complete analysis of the data is provided in the supplementary material. 5 The different geographical distribution of the clusters in different WHO-regions and in the UK, the USA, Australia and China are shown in Figure 4 with Figure 4A highlighting from which regions and countries the genome sequences assigned to each cluster originate and Figure 4B how the clusters are distributed across regions and countries. Because of the strong weighting of the 10 data sets, each graph shows the total distribution of the clusters or geographic regions for comparison. The sequences of the three most common exposure regions are distributed very differently between the clusters. Sequences from the UK account for just over 50% of the sequences of the cluster ce and about 70% of the sequences of the cluster el, but are found in the other clusters significantly less, namely between 5-15%. Sequences from the USA, on the other 15 hand, are disproportionately distributed between the clusters st, mr and er, where they make up about 45%, 30% and 60% of the sequences. Sequences from Europe are very evenly distributed across all clusters, where they make up about 15-25% of the sequences. The weights that are presented here fit with the results of other authors, who mainly assign cluster el (equivalent to Nextstrain 20C and GISAID GR) to Europe and cluster er ( equivalent to Nextstrain 20B and part 20 of GISAID GH) mainly in North America (29, 31) . The cluster mr, which is described here for the first time as a variant pattern, is the smallest cluster with a share of about 5%, but shows remarkable weightings in the WHO regions which are less represented in the GISAID dataset, such as EM, where it can be assigned 1/3 of the sequences, and such as SEA and Americas, where it can be assigned to 10-15% of the sequences. 25 The different geographical distribution of the subclusters in different WHO-regions and in the UK, the USA, Australia and China are shown in Figure 5 . In Figure 6 , the evolution of this geographical distribution broken down into the months of December through part of June is shown, with June covering only a few days. The subclusters with the most sample counts of up to 10% of the total data set are cluster ce-0 (approx. 25%), el-0 (25%), er-0 (15%) and ce-A-2 (about 10%). The Ce-0, one of the largest subclusters, which contains the fewest variants for reference, is found particularly strongly in China and Africa (AFR), in about 55% of the samples in those regions, and is otherwise found in all regions with shares between 20-25%, except in the USA, where it accounts for only about 10% of all isolates. The smaller fraction in the USA may be due to successful control 40 measures (12, 13). After starting from China, the virus was already found in January in the WHO regions WP, SEA, Europe, and AFR, as well as in the USA. Two subclusters of cluster ce no longer contain sequences that show exposure in China-these are subclusters ce-1 and ce-A-2. Subcluster ce-1 was first found in March, and 80% of its sequences are from the WHO region of Europe and from the UK. This subcluster is disproportionately 45 common in the WHO regions of Europe and the Americas, as well as in the UK. Subcluster ce-A-2 first appears in February, with the vast majority of sequences being from the UK. Figure 6 illustrates that variant-bearing sequence lines already started appearing in China in January with subcluster ce-v3 and ce-A-v3, which are the first to contain variant 11083T. The two Europeandominated subclusters ce-1 and ce-A-2 first appeared later in February (ce-A-2) and March (ce-1). This suggests the conclusion that subcluster ce-v3 consists of precursor sequences of subcluster 5 ce-1, while subcluster ce-A-v3 consists of precursor sequences of subcluster ce-A-2. Subclusters ce-1 and ce-A-2 thus represent a combination of variants that have emerged regionally and have gained regional dominance. In other regions and countries, line ce-v3, which has major shares in WHO regions WP (30%), SEA (25%), EM (15%), and Australia (10%), may have developed quite differently. 10 Subcluster el-0 occurred in February in Europe and developed into a predominantly dominant phenomenon in the UK during the months of March and April. Subcluster el-0 is also seen in May and June in the USA, SEA and Africa. Subclusters el-v2 and el-2 first appeared in March with most sequences coming from the UK. These clusters can thus be seen as further regional diversification from the precursor subcluster el-0. 15 As Cluster er is predominantly found in the US, the beginning of Cluster er-0 in February is remarkable in that it has most of its sequences coming mainly from Europe at that time. The mutation in cluster er-1 is mainly a phenomenon in the USA. While the clusters el and er and their associated subclusters already show a strong regional focus, the picture is much more heterogeneous for the smaller clusters st and mr. 20 The most heterogeneous distribution in relation to the exposure region show subclusters st-0 and st-v8. Subcluster st-0 launched in Europe in February and was found in all other WHO-Regions, mainly USA, Europe, China and SEA. Samples of clusters st-v8 came to a disproportionately high proportion from the USA, Americas, Australia, China and WP. Notable are subclusters st-3, which appears mainly in the USA, and st-5, which appears mainly in the UK and Europe. These two 25 subclusters occurred in February with subcluster st-3 already starting and gaining dominance in the USA and subcluster st-5 starting in Europe/UK. This may be consistent with results from Bedford et al (13) mapping GISAID Clade A, which they could assign to a laxly controlled introduction at the end of January and the beginning of February (13) . After only a few months of the pandemic, st subclusters from Europe/UK already show marked differences from st subclusters 30 from the USA. This shows that the virus continues to change from its initial sequence, resulting in regional sequence patterns that are more and more different from each other, as well as having differences within regions. The small cluster mr shows a particularly high heterogeneity with 7 subclusters. Four subclusters appear with a clear predominance in Europe: mr-0 and parallel mr-B-0 in February, and mr-2 and Although the GISAID data set under consideration is not representative in terms of submissions from regions outside Europe and the US, we were able to identify variant patterns that showed an 45 unusual accumulation for other WHO regions. These regions include countries which have been particularly hard hit by the pandemic, such as Brazil, India and Iran, which have their own dynamics in terms of their infection due to different public controls and different demographics (34) (35) (36) . These variants can develop into separate sequence lines in these respective regions-this fact should not fall out of the scope of scientific surveillance. 5 In this paper, we have described a workflow based on PCA performed on a special numeric coding of the variant matrix with which very large data sets of SARS-CoV-2 genomes can be neatly categorized. We defined 21 new variant patterns, some of which could account for 10 regional effects. As a method, PCA is significantly more robust when analyzing sequences of questionable quality, since it does not try to analyze every single variant individually, but instead takes into account the entire set of variants as a whole. A data quality classification can thus be entered as a sample 15 property in the metadata set and thus taken into account if, in the case of deeper cluster analyses, questionable results were obtained for the corresponding sample. In a data set weighted strongly according to origin and level of sampling, further understanding can still be achieved for underrepresented regions and times. 20 Our analysis, as well as other recent analyses, have shown a high diversity of SARS-CoV-2 genomes in different countries, reflecting the adaptation processes of the virus to its new hosts and their environment. Furthermore, the analysis shows how quickly regional differentiations are already taking place. Some variations may no longer be representative of worldwide phenomena as they are found in individual countries. It is therefore crucial for future research to determine 25 under which circumstances what changes take place. In this context, it is particularly important to have elegant and easy-to-use biomathematical methods to analyze this amount of data in a meaningful way and to support any public health efforts to coordinate sampling from sequences from all over the world in open-source databases like GISAID. Supplementary Materials: 30 -Sampling period of the of the GISAID data set from 06/15/2020 -Geographic and regional distribution of the GISAID data set of 2020-06-15 Figs. S1 to S6 35 Tables S1 to S4 Acknowledgments The authors gratefully acknowledge the very large number of scientists in originating and submitting laboratories who have readily made available SARS-CoV-2 genome sequences and related metadata to the research community. Since their data provide the basis of this study, a detailed acknowledgements table can be found together with the data obtained here in Table S4 . Funding: This study received no fundings. Author contributions: GISAID: Global initiative on sharing all influenza data -from vision to 40 reality Buckland-Merrett, Data, disease and diplomacy: GISAID's innovative contribution to global health Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus The establishment of reference sequence for SARS-CoV-2 and variation analysis Multiple assays in a real-time RT-PCR SARS-CoV-2 panel can mitigate the risk of loss of sensitivity by new genomic variants during the COVID-19 outbreak SARS-CoV-2 Vaccines: Status Report virus), a novel Coronavirus: human-to-human transmission, 20 travel-related cases, and vaccine readiness SARS-CoV-2/COVID-19: Viral Genomics, Epidemiology, Vaccines, and Therapeutic Interventions Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Transmission heterogeneities, kinetics, and controllability of SARS-CoV-2 The engines of SARS-CoV-2 spread The emergence of SARS-CoV-2 in Europe and North America Cryptic transmission of SARS-CoV-2 in Washington state Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California Coast-to-Coast Spread of SARS-CoV-2 25 during the Early Epidemic in the United States Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) On the origin and continuing evolution of SARS-CoV-2 Phylogenetic network analysis of SARS-CoV-2 genomes Genome-wide patterns of population structure and admixture in West Africans and African Americans Ancient human genomes suggest three ancestral populations for present-day Europeans Concepts and Relevance of Genome-Wide Association Studies Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure Minimap2: pairwise alignment for nucleotide sequences The Sequence Alignment/Map format and SAMtools A new coronavirus associated with human respiratory disease in China A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology Year-letter Genetic Clade Naming for SARS-CoV-2 on Nextstain.org (2020) SARSCoV2-clade-naming) Clade and lineage nomenclature aids in genomic epidemiology studies of active hCoV-19 viruses Geographic and Genomic Distribution of SARS-CoV-2 Mutations. Front Microbiol SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo WHO -World Health Organisation, Coronavirus disease (Covid-19) Situation Epidemiology and transmission dynamics of COVID-19 in two Indian states Evolution and epidemic spread of SARS-CoV-2 in Brazil An emergent clade of SARS-CoV-2 linked to returned travellers from Iran