key: cord-331066-ediowz4s
authors: Chechetkin, Vladimir R.; Lobzin, Vasily V.
title: Ribonucleocapsid assembly/packaging signals in the genomes of the coronaviruses SARS-CoV and SARS-CoV-2: detection, comparison and implications for therapeutic targeting
date: 2020-09-09
journal: Journal of biomolecular structure & dynamics
DOI: 10.1080/07391102.2020.1815581
sha: 
doc_id: 331066
cord_uid: ediowz4s

The genomic ssRNA of coronaviruses is packaged within a helical nucleocapsid. Due to transitional symmetry of a helix, weakly specific cooperative interaction between ssRNA and nucleocapsid proteins leads to the natural selection of specific quasi-periodic assembly/packaging signals in the related genomic sequence. Such signals coordinated with the nucleocapsid helical structure were detected and reconstructed in the genomes of the coronaviruses SARS-CoV and SARS-CoV-2. The main period of the signals for both viruses was about 54 nt, that implies 6.75 nt per N protein. The complete coverage of the ssRNA genome of length about 30,000 nt by the nucleocapsid would need 4.4 × 10(3) N proteins, that makes them the most abundant among the structural proteins. The repertoires of motifs for SARS-CoV and SARS-CoV-2 were divergent but nearly coincided for different isolates of SARS-CoV-2. We obtained the distributions of assembly/packaging signals over the genomes with nonoverlapping windows of width 432 nt. Finally, using the spectral entropy, we compared the load from point mutations and indels during virus age for SARS-CoV and SARS-CoV-2. We found the higher mutational load on SARS-CoV. In this sense, SARS-CoV-2 can be treated as a ‘newborn’ virus. These observations may be helpful in practical medical applications and are of basic interest. Communicated by Ramaswamy H. Sarma

To the end of July 2020, the COVID-19 pandemia was the cause of more than 18.4 million of coronavirus cases and more than 695,000 of deaths over all world (https://www. worldometers.info/coronavirus/). The pandemia is still continuing and the possibility of return of new disease waves is considered to be very high. The development of efficient medications and vaccines against coronaviruses needs the knowledge of main molecular mechanisms in the virus life cycle and virus-host interaction (Feng et al., 2020; Fung & Liu, 2019; Maier et al., 2015; Saxena, 2020; Xie & Chen, 2020; Ziebuhr, 2016) . In this article, we will discuss a specific interaction between nucleocapsid (N) proteins and genomic ssRNA in the coronaviruses SARS-CoV and SARS-CoV-2.

The ssRNA genome of the coronaviruses is packaged within a helical nucleocapsid, while the whole ribonucleocapsid is packaged within a membrane envelope (for a review see, e.g. Masters, 2019; Neuman & Buchmeier, 2016) . The term 'packaging signal' in the coronavirus papers is overwhelmingly attributed to the specific interaction between the genomic RNA and membrane (M) proteins ensuring the transport of the ribonucleocapsid into the membrane envelope (Fosmire et al., 1992; Madhugiri et al., 2016; Masters, 2019; Narayanan, & Makino, 2001; Woo et al., 2019) . The interactions between genomic RNA and N proteins are assumed to be nonspecific and governed mainly by electrostatic effects. The question of how N proteins recognize the related genomic RNA remains unanswered. A similar point of view was long-lastingly prevalent also in the virology community which studied ssRNA viruses with icosahedral capsids. The importance of cooperative weakly specific interactions between ssRNA and capsid proteins has been recognized not long ago (see discussion by Twarock et al. (2018) and references therein). Stockley et al. (2016) suggested and proved experimentally a two-stage model for the assembly of ssRNA viruses with icosahedral capsids. At the first, more rapid, stage RNA binds to the coat proteins to facilitate capsid assembly, whereas at the second, slower, stage RNA is compactly packaged within the capsid. The specific cooperative RNA-coat protein interactions play important role at the both stages. The two stages may be associated with different signals (Chechetkin & Lobzin, 2019) and the whole dynamic process may be called assembly/packaging. The generalization of these ideas on viruses with ribonucleocapsid within the membrane envelope like that for coronaviruses assumes three stages related to the complete packaging of the genomic RNA within the envelope: two-staged assembly/packaging of the helical ribonucleocapsid and packaging of the ribonucleocapsid within the envelope. This article is devoted to search for specific signals in the genomic ssRNA sequences related to the two-staged assembly/packaging of the helical ribonucleocapsid. As has been shown previously, the icosahedral symmetry of the capsid strongly affects the large-scale quasi-periodic segmentation in the related viral genomes (Chechetkin & Lobzin, 2019) . The whole ribonucleocapsid structure of coronaviruses also remains invariant under transition by one helical turn. Therefore, the putative weakly specific assembly/packaging signals in the genomic RNA of coronaviruses should be coordinated with the parameters of the helical nucleocapsid (such as the helix pitch, inner and outer diameters) which are established by cryoelectron microscopy (cryo-EM) and other structural methods. In this article, we provide methods for the detection and comparative analysis of assembly/packaging signals in the genomic RNA of the coronaviruses SARS-CoV and SARS-CoV-2 and describe main results of our study.

The quasi-periodic patterns in the genomic DNA/RNA sequences can be efficiently detected with the discrete Fourier transform (DFT). As the periodic patterns generate equidistant series of harmonics in the DFT spectra (see, e.g. Chechetkin & Turygin, 1995; Lobzin & Chechetkin, 2000) , long enough patterns can be detected by the iteration of DFT or by the discrete double Fourier transform (DDFT) (Chechetkin & Lobzin, 2017 , 2019 , 2020a , 2020b . Although, the correlation functions are the main tools in this article, our approach is based implicitly and explicitly on DFT and DDFT. Therefore, we begin with the definitions of these operations. Below, we follow the methods developed previously (Chechetkin & Lobzin, 2017 , 2020a , 2020b Chechetkin & Turygin, 1994 , 1995 Lobzin & Chechetkin, 2000) .

The DFT harmonics corresponding to the nucleotides of type a 2(A, C, G, T) in a genomic sequence of length M are calculated as

q m, a e Àiqnm , q n ¼ 2pn=M, n ¼ 0, 1, :::, MÀ1:

Here, q m,a indicates the position occupied by the nucleotide of type a; q m,a ¼ 1 if the nucleotide of type a occupies the mth site and 0 otherwise. The amplitudes of Fourier harmonics (or structure factors) are defined as

where the asterisk denotes the complex conjugation. Taking into account the symmetry relationship for the structure factors, the analysis of their spectra can be restricted by the range from n ¼ 1 to

where the brackets denote the integer part of the quotient.

The structure factors will always be normalized on the mean spectral values, which are determined by the exact sum rules,

where N a is the total number of the nucleotides of type a in a sequence of length M. Below, we will also use the sums,

which can be applied to the detection of quasi-periodic patterns or motifs composed of the nucleotides of different types. The period p is measured in terms of the number of nucleotides (these units will always be tacitly implied below) and is calculated as,

The harmonics in DDFT are calculated as U a ðq n 0 Þ ¼ ðNÀ1Þ À1=2 X N n¼2 f aa ðq n Þe Àiq n 0 n ,q n 0 ¼ 2pn 0 =ðNÀ1Þ, n 0 ¼ 0, 1, :::, NÀ2,

where N is defined by Equation (3) and f aa ðq n Þ are the normalized structure factors (see Equation (4)). The similar transform can be used for the sums defined by Equations (5)-(7). The amplitudes of harmonics are given by

Similarly to DFT, the analysis of the spectra for the amplitudes defined by Equation (10) can be restricted from n' ¼ 1 to

The DDFT amplitudes are normalized as

Generally, equidistant series in DFT spectra also generate the corresponding equidistant series in DDFT spectra with the spectral numbers k 0 n 0 , k 0 ¼ 1, :::, k 0 max ; k 0 max n 0 N 0 , where N' is defined by Equation (11). The number of quasi-periodic patterns can be assessed by the spectral number n' for the peak amplitude f II ðq n 0 Þ as

while their periods in nucleotides are given by

The nucleotide correlation functions (NCF) are determined as,

The circular NCFs used in this article are especially suitable for the detection of periodic patterns. Periodic patterns with a period p produce a series of equidistant peaks at the multiple spacings, m 0 ¼ kp, k ¼ 1, 2, … The corresponding mean value is given by

The correlation functions are symmetrical,

This allows us to restrict the analysis of NCF from m 0 ¼ 1 to N defined by Equation (3). The normalized deviations,

where

are Gaussian for the random sequences. Similarly to the sums defined by Equations (5)-(7), it is useful to introduce the combinations,

which are also Gaussian for the random sequences. The correlation functions and the DFT structure factors are not independent and are related by the Wiener-Khinchin relationship,

The normalized deviations for NCF can be expressed as,

f aa ðq n Þ À 1 ð Þ e Àiq n m 0 :

These deviations are insensitive to the nucleotide composition and genome length but may strongly depend on the dominating underlying periodicities in the genomic sequences. In the viral genomes this is the triplet periodicity p ¼ 3 inherent to the protein-coding regions (for a review and further references see, e.g. Lobzin & Chechetkin, 2000; Marhon & Kremer, 2011) . The relationship defined by Equation (27) facilitates the control of contribution from underlying periodicities into the normalized deviations for NCF by cutting-off dominating peaks and re-normalizing DFT spectra. Such a procedure can be used for detection of the weaker longer periodicities on the background of strong short periodicities.

Throughout this article, we will use the standard statistical criteria corresponding to the probability Pr ¼ 0.05. For the random sequences, the statistics for the DFT and DDFT normalized harmonics defined by Equations (4) and (12) is Rayleighian, whereas the statistics for the normalized deviations defined by Equations (20) and (22)-(24) is Gaussian. To study the distribution of periodic patterns over the genome, we will use a set of nonoverlapping windows of width w. Averaging of the DFT spectra over the windows provides the corresponding periodogram, while averaging of the normalized deviations for NCF over the windows provides the corresponding correlogram (see, e.g. Marple, 1987) . Averaging over windows diminishes the effects of indels on the periodicity phasing.

The motifs related to quasi-periodic patterns are presumably the most important for practical applications. For their reconstruction, we developed a method of transitional automorphic mapping of the genome on itself (TAMGI). The algorithm for TAMGI is as follows. Let a step length s be chosen (equal to the detected period of periodic patterns in the problem concerned). Then, the pairs of nucleotides (N m , N m þ s ) separated by the step s are mutually compared when moving one-by-one site m along the genomic sequence. If both nucleotides belong to the same type, they both are retained in the genomic sequence; otherwise, the nucleotide N m is replaced by void (denoted traditionally by the hyphen). Thus, the N m th nucleotide will be retained if it has at least one neighbor N ms or N m þ s of the same type and be replaced by void otherwise. The resulting sequence after TAMGI is composed of the nucleotides of four types (A, C, G, T) and the hyphens '-' denoting voids. Further analysis is reduced to the enumeration of all complete words of length k (k-mers) composed only of nucleotides (voids within the complete words are prohibited) and surrounded by the voids '-' at 5 0 -and 3 0 -ends, -N k -. By definition, the complete words are nonoverlapping. At the next stage, the mismatches to the complete words can be studied. If the presence of periodic patterns is ensured, e.g. by DFT or DDFT, TAMGI with the step s equal to the corresponding period p provides a sequence enriched by the periodic patterns. Thus, TAMGI contains the most frequent motifs related to quasiperiodic patterns and provides their distribution over the genome. As TAMGI contains also the quasi-random fraction, the latter can be partially filtered out by combining TAMGI with the steps s and 2 s. The TAMGI method is robust with respect to indels but may depend on the nucleotide content and underlying short periodicities. Generally, TAMGI may also be extended to noninteger steps s by the best integer approximation of transitional mapping with noninteger s. The latter can be obtained using a set of chains (N 1 

where fksg means rounding to the nearest integer and f(k max þ1)sg < M. The choice of consecutive pairs in the chains is performed by the algorithm similar to that as described above.

The abundance of quasi-periodic patterns in the genomic DNA/RNA sequences can be assessed by the spectral entropy (Balakirev et al., 2003 (Balakirev et al., , 2005 (Balakirev et al., , 2014 Chechetkin, 2011; Chechetkin & Lobzin, 1996; Chechetkin & Turygin, 1994) . The spectral entropy is defined as,

Its mean value,

where C is Euler constant; ð1ÀCÞ¼ 0.422785 … , attains approximate maximum for the random sequences. The corresponding variance for the spectral entropy is given by r 2 ðS a Þ random ¼ 0:289868 : : :N:

The abundance of quasi-periodic patterns in the genomes of different lengths can be assessed by the relative spectral entropies,

The relative spectral entropy serves also for the assessment of the load from point mutations and indels on the genomes or on the particular genes and pseudogenes (Balakirev et al., 2003 (Balakirev et al., , 2005 (Balakirev et al., , 2014 .

Early studies based on electron microscopy have revealed that the ribonucleocapsid of coronaviruses is helical, consisting of coils of 9-16 nm in diameter and a hollow interior of approximately 3-4 nm (Macneughton et al., 1978) . Chang et al. (2014) asserted that for the SARS-CoV nucleocapsid an outer diameter of 16 nm and an inner diameter of 4 nm are consistent with cryo-EM observations. The length of a helical turn per pitch is

where d is the diameter of the helix and h is the pitch. According to Chen et al. (2007) , the pitch for the SARS-CoV nucleocapsid is h ¼ 14 nm. Taking the distance between RNA bases as 0.34 nm, the positioning of RNA near the inner diameter of nucleocapsid provides the length of RNA turn about 54-56 nt, the positioning of RNA in the middle between the inner and outer diameters would provide the length of turn about 84-87 nt, whereas the positioning of RNA at the outer diameter would provide the length of turn about 153-154 nt. Chang et al. (2009) found multiple (at least three) nucleic acid binding sites in N proteins. Therefore, the intermediate dynamic positioning of RNA in the middle during assembly/packaging cannot be excluded. At the final stage of packaging, ssRNA is assumed to be positioned at the inner diameter of the nucleocapsid in accordance with cryo-EM observations (Chang et al., 2014) . We performed the complete combined analysis of the SARS-CoV and SARS-CoV-2 genomes based on DFT, DDFT, NCF and pattern correlation functions (Chechetkin & Lobzin, 2020b) and screened all range of putative periods from the shortest period of 2 nt to the large-scale periods comparable to the whole genome lengths. The most interesting results related to the ribonucleocapsid assembly/ packaging are presented below.

We took for analysis one genomic sequence for SARS-CoV (GenBank accession: NC_004718; M ¼ 29,751; N A ¼ 8481, N G ¼ 6187, N T ¼ 9143, N C ¼ 5940) as a reference and the genomic sequences for three isolates of SARS-CoV-2 (GenBank accessions:

to assess the impact of point mutations and indels on the detected patterns. Henceforth, the viruses will be denoted by their accessions. Taking into account the transitional invariance of a helix, the main results will be given for NCF. The presence of periodic components in NCF was proved by combining DFT and DDFT. The general overviews of the plots for the normalized NCF deviations defined by Equation (20) are shown in Figures 1 and 2 . The overview for MT371038 is closer to that shown in Figure 1 , while the corresponding plots for MT295464 are similar to those shown in Figure 2 . Then, all plots for NCF were recalculated using Equations (26) and (27) and replacing all highest harmonics in the DFT spectra by the peaks assessed by extreme value statistics (cf. Chechetkin & Lobzin, 2019) . The initial ranges of the recalculated plots are shown in the inserts to Figures 1 and 2 . The deviations corresponding to the most pronounced patterns are shown explicitly by arrows. Such patterns are quasi-periodic because the corresponding approximately equidistant series can be pursued in these plots (the next peaks are shown only for the most pronounced patterns with periodicity p ¼ 54).

For the further analysis and as a cross-check of the above results, the NCF and DFT spectra were computed for the set of nonoverlapping windows of width 432 nt. The 3 0 -end windows #69 were incomplete for the genomes of NC_004718, MT371038 and MT371037. The characteristics used in our analysis are robust with respect to the length of window. The normalized deviations for NCF were calculated using Equations (26) and (27) and replacing peaks corresponding to the triplet periodicity p ¼ 3 by the heights corresponding to Pr ¼ 0.05 in the Rayleigh spectra. The similar cut-off was used after the calculations of the DFT spectra within windows. The correlograms obtained by averaging of the plots for normalized NCF deviations for the sums defined by Equation (24) are shown in Figure 3 . The significance threshold of Pr ¼ 0.05 for the correlograms corresponds to 61:96=N 1=2 w , where N w is the total number of windows. In all genomes the deviations for m 0 ¼ 54 were the highest and the deviations with m 0 ¼ 108 were significant as well. For SARS-CoV the deviations with m 0 ¼ 216 (¼4 Â 54) were also significant. The next characteristic high deviations for SARS-CoV were for m 0 ¼ 87, while in the isolates of SARS-CoV-2 they were for m 0 ¼ 84.

The corresponding periodograms obtained by the averaging of the DFT spectra over windows were then re-computed by DDFT (Chechetkin & Lobzin, 2020a) . Due to the restrictions related to the applicability of DDFT, the left boundary (26) and (27). The initial ranges of plots shown in the inserts were re-calculated by replacing the highest Fourier harmonics by the peaks defined by extreme value statistics in the DFT spectra. The characteristic spacings m 0 are explicitly marked by the arrows. The horizontal lines correspond to the significance Pr ¼ 0.05 for the reshuffled random sequences. The panels A-D correspond to the nucleotides of particular types in the genome of SARS-CoV (accession NC_004718).

in the DDFT spectra is positioned at n' ¼10. Then, the DDFT spectra were renormalized in this range. The resulting DDFT spectra for the sums defined by Equation (7) are shown in Figure 4 . Again, the harmonic with n' ¼ 27, p' ¼ 54.2 was reproducibly significant and the highest in the range under study for all genomes. For SARS-CoV the harmonic with n' ¼ 43, p' ¼ 86.4 was also significant, whereas for the isolates of SARS-CoV-2 the harmonic with n' ¼ 42, p' ¼ 84.4 appeared to be insignificant. The harmonic with n' ¼ 57, p' ¼ 114.5 for SARS-CoV can be treated as a distorted and modified doubled period p ¼ 54 (typically of hidden fuzzy repeating patterns). Thus, combining correlograms for NCF with the analysis of periodograms by DDFT reveals clearly the persistently reproducible quasi-periodic patterns with the period p % 54 in all genomes and indicates the relevance of less robust patterns with p % 84 and 87.

To assess the distribution of the detected patterns over the genomes, the normalized deviations for NCF were computed in separate windows of width 432 nt as described above. The spacings for NCF m 0 were chosen by the correspondence with the periods of the detected patterns and were equal to 54, 84 and 87, respectively. The resulting plots for the sums defined by Equation (24) are shown in Figure 5 . The numerical data for the profiles in Figure 5 and for the profiles corresponding to the nucleotides of particular types as well as to the sums defined by Equations (22) and (23) are collected in Supporting Information S1. We assessed the correlations between different profiles by the Pearson correlation (26) and (27). The initial ranges of plots shown in the inserts were re-calculated by replacing the highest Fourier harmonics by the peaks defined by extreme value statistics in the DFT spectra. The characteristic spacings m 0 are explicitly marked by the arrows. The horizontal lines correspond to the significance Pr ¼ 0.05 for the reshuffled random sequences. The panels A-D correspond to the nucleotides of particular types in the genome of SARS-CoV-2 (accession MT371037).

coefficients. The NCF profiles for the different genomes were significantly correlated for the same spacings m 0 , while the profiles with the different spacings can be considered uncorrelated. The coefficients for correlations between profiles for SARS-CoV and three isolates of SARS-CoV-2 at m 0 ¼54 were 0.623, 0.491 and 0.636 (Pr < 2 Â 10 À5 for 69 components). The related coefficients for the correlations MT371038-MT295464, MT371038-MT371037 and MT295464-MT371037 were 0.817, 0.954 and 0.751. Similar but a bit lower values were obtained for the correlations at two other spacings. As supposed, the motifs detected at the different spacings m 0 are related to the different stages of ribonucleocapsid assembly/packaging. Such motifs can be incorporated into the genomic sequence by silent mutations due to the degeneracy of the genetic code. The regular near-by positioning of different assembly/packaging motifs would be too restrictive, because the main function of the genomic sequence is coding for proteins. Therefore, the windows enriched simultaneously by the motifs of different types are especially interesting as well as the windows enriched or depleted by the motifs of the same type. Despite evolutionary divergence between the two viruses and the action of point mutations and indels, some features appear to be remarkably reproducible in all genomes. In particular, in the window #3 (sites 865-1296) the normalized NCF deviations exceeded significance threshold Pr ¼ 0.05 for all genomes at m 0 ¼54. Similar but stronger effects were observed for the window #5 (1729-2160); in the latter case for SARS-CoV, this window was also enriched by the motifs with m 0 ¼87. The window #29 (12097-12528) was enriched by the motifs with m 0 ¼54; additionally, for all isolates of SARS-CoV-2 this window was enriched by the motifs with m 0 ¼87. An opposite example with depletion of motifs associated with m 0 ¼87 can be seen in the window #34 (14257-14688). These profiles may explain why the mean deviation with m 0 ¼84 exceeds the deviation with m 0 ¼87 in the genomes of SARS-CoV-2. In the latter case, despite significant enrichment by the motif with m 0 ¼87 in some of the windows, there are also the windows with significant depletion of this motif.

Similar profiles were also obtained for DDFT harmonics with the spectral numbers n' ¼ 27, 42 and 43. DDFT spectra in windows of width 432 were computed for the sum defined by Equation (7) as described above. The related profiles can be found in Supporting Information S2. The counterpart profiles for the normalized NCF deviations and DDFT harmonics appear to be significantly correlated in the same genomes. Therefore, the characteristic features in the both sets of profiles were approximately reproducible. In addition to these features, an extremely high peak for the DDFT harmonic with n' ¼ 27, p' ¼ 54.2 was observed in the window #60 (25489-25920) in the genome of SARS-CoV.

Reconstructed motifs and their positions on the genomes were obtained by TAMGI with the steps s ¼ 54, 84 and 87. The resulting sequences after TAMGI are explicitly reproduced in Supporting Information S3-S6. The data on the total fractions of nucleotides after TAMGI are summarized in Table 1 . A simple theoretical consideration shows that the partial fractions of nucleotides after TAMGI for the randomly reshuffled genomic sequences are given by

where / a is the frequency of nucleotides of the type a retained under reshuffling. Equation (33) was additionally verified by simulations. The frequencies given by Equation (33) are independent of steps and also are reproduced in Table 1 for reference. The variances of frequencies related to particular random realizations are about r 2 ðU a Þ ¼ U a ð1ÀU a Þ=M; r 2 total ¼ X a r 2 ðU a Þ:

Equation (34) yields for r total the value of 0.004 that is much lower than the differences between frequencies for viral and random sequences. In this sense, Table 1 reveals distinctly nonrandom character of the variations related to the detected quasi-periodic patterns in the viral genomes. The mutual comparison of the total frequencies of nucleotides after TAMGI for the different isolates of SARS-CoV-2 shows their robustness against point mutations and indels.

The general distributions of k-mers, -N k -, on the length k are presented in Table 2 . The period of p % 54 implies the association of 6.75 nt per one N protein (see Subsection 4.2 below). All motifs with k ! 6 and their positions on the genomes are enumerated in Supporting Information S7. The profiles for the total numbers of nucleotides within nonoverlapping windows of width 432 nt after TAMGI with the steps s ¼ 54, 84 and 87 are shown in Figure 6 . For the incomplete windows #69 these numbers were increased proportionally to obtain estimates for the width of 432 nt. The profiles in Figures 5 and 6 are close but differ in some features. The corresponding Pearson correlation coefficients between the counterpart profiles in Figures 5 and 6 were highly significant, 0.72-0.86. Nevertheless, the highest peaks and the lowest troughs may differ between the counterpart profiles. In particular, the highest peak in Figure 6 (A) was observed for the window #8 (sites 3025-3456). The profiles for s ¼ 54 and 87 were slightly biased from the higher values at 5 0 -end to the lower values at 3 0 -end, although, the extreme windows #1 (1-432) comprising 5 0 -UTR were depleted of motifs. The numerical values for all profiles in Figure 6 can be found in Supporting Information S7.

The comparison of repertoires of motifs with k ! 6 presented in Supporting Information S7 revealed nearly complete correspondence (up to one-two motifs) between motifs for three isolates of SARS-CoV-2. The divergence between motifs for SARS-CoV and SARS-CoV-2 appeared to be more significant. In particular at the step s ¼ 54, only 22 hexamer motifs from 102 different motifs (106 in total) in the SARS-CoV genome coincided with those for SARS-CoV-2 and 36 hexamers differed by one letter from the repertoires of hexamers for SARS-CoV-2. The similar comparison for the other steps yielded the coincidence of 18 from 89 different motifs (93 in total) and 38 motifs differing by one letter at s ¼ 84 and the coincidence of 15 from 93 different motifs (102 in total) and 37 motifs differing by one letter at s ¼ 87. This means that the repertoires of relatively long motifs are robust to point mutations and indels for the separate coronaviruses but diverge (and in this sense are specific enough) between the two viruses despite the conservation of the main helical periodicity p % 54 nt. The relationships between motifs found for the assembly/packaging and the other cisacting elements (Madhugiri et al., 2016) should be established separately. Our study showed that actually any cis-acting element should comprise contextual surrounding vicinity of several tens of nucleotides up-and downstream the element.

The occurrences of the motifs determined by TAMGI can be compared with their counterparts in the whole genome. The statistical significance of such motifs in the whole genome can be assessed by the related occurrences in the sequences obtained by the random reshuffling of the genome. Instead of modeling with genome reshuffling, the rigorous theory by Zubkov and Mikhailov (1974) and Karlin and Altschul (1990) can be used for the assessment of motif occurrences (see also Boeva et al., 2006; Suvorova et al., 2014) . The total frequencies of nucleotides after TAMGI for the randomly reshuffled genomic sequences were calculated by Equation (33). Table 2 . The occurrences of k-mers, -N k -, in the genomes of SARS-CoV and SARS-CoV-2 after TAMGI with steps s ¼ 54, 84 and 87.

Genome accession 54   1  3618  3617  3633  3609  2  1834  1823  1847  1828  3  862  898  901  901  4  458  448  449  446  5  218  226  227  226  6  106  128  125  126  7 5 1 5 1 5 2 5 2 8 2 8 2 3 2 3 2 3 9 1 6 2 1 2 0 2 0 10 5 10 10 10 11 4 2 2 2 12 5 2 2 2 13 1 --- 4. Discussion

Short tandem repeats in human genomes are widely used in the medical diagnostics and forensic (see, e.g. Baine & Hui, 2019; Butler, 2011; Grover & Sharma, 2016; Kayser, 2017; Sznajder & Swanson, 2019 ; and references therein). Similar patterns were also found in some prokaryotic genomes (Subirana & Messeguer, 2019) . Quasi-repeating patterns in viral genomes are present commonly in the hidden form on the background of frequent random point mutations and indels. Nevertheless, many quasi-repeating patterns remain persistent, robust and contain important information about molecular mechanisms of virus life cycle, including genome packaging. Such patterns can be detected and quantified by DFT, DDFT, NCF and other methods. Surprisingly, the quasirepeating patterns in viral genomes are usually completely ignored when discussing evolutionary and subtyping problems in virology (see, e.g. Andersen et al., 2020; Cagliani et al., 2020; Forster et al., 2020; MacLean et al., 2020; Tang et al., 2020) . The general abundance of quasi-periodic patterns in viral genomes can be conveniently assessed by the relative spectral entropy (Subsection 2.5). The more negative the spectral entropy, the higher the abundance of quasi-periodic patterns in the genome. The relevant data for the genomes of SARS-CoV and three isolates of SARS-CoV-2 are summarized in Table 3 . For the significance Pr ¼ 0.05, the difference between the total spectral entropies should exceed by the absolute value the threshold 1:96 ffiffi ffi 2 p rðS total, rel Þ% 0.055. This is actually fulfilled for all three differences between S total, rel for SARS-CoV and the isolates of SARS-CoV-2, whereas the mutual differences between S total, rel for the isolates of SARS-CoV-2 are less, that is in accordance with the evolutionary divergence of SARS-CoV and SARS-CoV-2. The values of S total, rel in Table 3 reveal the higher abundance of periodic patterns in the SARS-CoV-2 genomes in comparison with the SARS-CoV genome. It can also be said that during virus age the load from point mutations and indels on the genome of SARS-CoV was higher in comparison with the load on the genome of SARS-CoV-2. Within such interpretation SARS-CoV-2 may be treated as a 'newborn' virus.

4.2. How many N proteins are needed for the complete packaging of the SARS-CoV and SARS-CoV-2 ssRNA genomes?

The periods of ssRNA turns packaged within the helical ribonucleocapsid and detected via repeating motifs in the genomic RNA sequences proved to be persistent in the genomes of SARS-CoV and SARS-CoV-2, although, the repertoires of related motifs appeared to be divergent. Taking into account that the turn of nucleocapsid is composed of two octamers (Chen et al., 2007) polymerized from dimeric N proteins, the detected period of 54 nt implies that one N protein should be associated with 6.75 nt. This is very close to the estimate obtained by Chang et al. (2014) that one N protein should be associated with 7 nt. Consequently, for genomes of length 30,000 nt typical of coronaviruses, 4.4 Â 10 3 N proteins are needed for complete packaging of the genomic ssRNA. The latter estimate significantly exceeds the value suggested by Neuman and Buchmeier (2016), 0.7-2.2 Â 10 3 N proteins per virion and the association of each N protein with 14-40 nt of genomic RNA. The flower-like packaging of the helical nucleocapsid within the envelope (see, e.g. Gui et al., 2017; Masters, 2019 ; and further references therein) implies an integrity of the nucleocapsid and gives evidence against rods-on-a-string model for nucleocapsid. Therefore, such difference in estimates cannot be attributed to uncovering of a part of the genome. Presumably, the total number of N proteins per virion is underestimated and the number 4.4 Â 10 3 makes N proteins the most abundant in the active phase of the virus life cycle.

N proteins of the coronaviruses provide the promising therapeutic targets (Chang et al., 2014; Lin et al., 2020; Tilocca et al., 2020; Yadav et al., 2020) . The advantages of using N proteins for therapeutic targeting are as follows. (i) As N proteins are abundant, the antibodies against them can be used for early diagnostics and in vaccines. (ii) N proteins are multifunctional and participate not only in the assembly/ packaging of the ribonucleocapsid but also in the regulation of the replication-transcription processes (Hurst et al., 2010; McBride et al., 2014; Verheije et al., 2010) . The interaction between M and N proteins plays an important role in the packaging of the ribonucleocapsid within the envelope (Kuo et al., 2016) . (iii) Coronavirus M and N proteins stand out as being the most conserved among structural proteins (Neuman & Buchmeier, 2016) . They should be more stable against the load from point mutations and indels especially frequent in viruses. The most of vaccines are currently developed against spike (S) proteins. However, S proteins are rather variable and in any case the multitargeted vaccines will be more efficient in comparison with one-targeted. The other strategy is related to the development of RNA vaccines (Kramps & Elbers, 2017) or to targeting of specific motifs in the viral RNA. The latter can be performed by RNA aptamers, RNA interference (Min & Ichim, 2010) or by the specially designed RNA-binding proteins (Filipovska & Rackham, 2012; Hall, 2016; Lunde et al., 2007) . The assembly/ packaging signals look quite promising as the targets in the genomic ssRNA. The modified N proteins or their fragments can be used for similar purposes and may introduce defects in the nucleocapsid and make the virus less viable. The incorporation of assembly/packaging motifs into oligonucleotides immobilized on the surface of microarrays may facilitate the detection of coronaviruses by microarrays (for a review on microarrays see, e.g. Dufva, 2009 ).

Presumably, the most working motifs (or, more exactly, the complete words defined in Subsection 2.4) participating in ssRNA-N proteins specific interactions are of 2-4 nt in length. They are frequent enough (see Table 2 ) and their coordinate positioning over the genome may provide specific cooperative interaction with N proteins. The close incorporation of the longer motifs would be too restrictive because of the protein coding function of the genomic ssRNA. However, the longer and rarer motifs may be multifunctional and may play The standard deviations for the relative spectral entropies S a, rel in the random sequences of the same lengths are about 0.010. The standard deviation for S total, rel is twice of this value.

the role of cis/trans-elements for other molecular mechanisms during the virus life cycle. This conclusion looks nearly definite for the pairwise motifs at the step s ¼ 84 such as ATTATAATTATAAAT (SARS-CoV; the start sites 22711 and 22795) and ATTATAATTA (isolates of SARS-CoV-2; sites 22766 and 22850; 22810 and 22894; 22757 and 22841, respectively). Note that the positions of these motifs on the genomes are also closely conserved. The same concerns the longest motifs found at s ¼ 54 in the genome of SARS-CoV-2, TATTC AAACAATTGTTG (sites 3213, 3257 and 3204, respectively). The specific binding of N proteins with ssRNA results in the lowering of free energy, which may approximately be assessed by the Boltzmann factor,

Typically, the Boltzmann factor grows at the lower temperatures. This means that weakly specific effects should be more pronounced at the lower temperatures. Taking into account huge numbers of species in virus populations, even a small decrease in free energy may produce a significant impact and be advantageous for the natural selection.

The methods developed in this article are quite general and can be applied to the detection of assembly/packaging signals in all viral genomes packaged within helical capsids including the other infectious coronaviruses such as 229E, NL63, OC43, HKU1 and MERS-CoV. The ssRNA genomes of numerous filamentous and rod-shaped plant viruses are also packaged within capsids with helical symmetry (Solovyev & Makarov, 2016; Stubbs & Kendall, 2012) . As shown, combining NCF, DFT and DDFT provides efficient tools for the investigation of this problem. It is essential that dominating triplet periodicity p ¼ 3 typical of protein coding regions in the viral genomes should be suppressed to discern the longer periodic patterns related to the assembly/packaging signals. After detection of periodic patterns and determination of their periods, the underlying motifs can be explicitly reconstructed by TAMGI. Generally, TAMGI can be efficiently used for data mining and search for cis/trans-elements in genomic sequences. The mutual experimental and bioinformatic analysis and the knowledge about the assembly/packaging mechanisms in viral genomes should facilitate the choice of the most efficient strategy in practical medical applications. The regular study of hidden quasi-periodic patterns is of basic interest for the virology.

No potential conflict of interest was reported by the authors.

The proximal origin of SARS-CoV-2

Practical applications of DNA genotyping in diagnostic pathology

DNA polymorphism in the b-esterase gene cluster of Drosophila melanogaster

Entropy and GC content in the beta-esterase gene cluster of the Drosophila melanogaster subgroup

Computational methods of identification of pseudogenes based on functionality: Entropy and GC content

Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression

Advanced topics in forensic DNA typing: Methodology

Computational inference of selection underlying the evolution of the novel coronavirus, severe acute respiratory syndrome coronavirus 2

The SARS coronavirus nucleocapsid protein-forms and functions

Multiple nucleic acid binding sites and intrinsic disorder of severe acute respiratory syndrome coronavirus nucleocapsid protein: Implications for ribonucleocapsid protein packaging

Recent insights into the development of therapeutics against coronavirus diseases by targeting N protein

Spectral sum rules and search for periodicities in DNA sequences

Levels of ordering in coding and non-coding regions of DNA sequences

Large-scale chromosome folding versus genomic DNA sequences: A discrete double Fourier transform technique

Genome packaging within icosahedral capsids and large-scale segmentation in viral genomic sequences

Detection of large-scale noisy multi-periodic patterns with discrete double Fourier transform. Fluctuation and Noise Letters

Detection of large-scale noisy multi-periodic patterns with discrete double Fourier transform. II. Study of correlations between patterns. Fluctuation and Noise Letters

On the spectral criteria of disorder in non-periodic sequences: Application to inflation models, symbolic dynamics and DNA sequences

Search of hidden periodicities in DNA sequences

Structure of the SARS coronavirus nucleocapsid protein RNA-binding dimerization domain suggests a mechanism for helical packaging of viral RNA

DNA microarrays for biomedical research: Methods and protocols

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2): A review

Modular recognition of nucleic acids by PUF, TALE and PPR proteins

Phylogenetic network analysis of SARS-CoV-2 genomes

Identification and characterization of a coronavirus packaging signal

Human coronavirus: Host-pathogen interaction

Development and use of molecular markers: Past and present

Electron microscopy studies of the coronavirus ribonucleoprotein complex

De-coding and re-coding RNA recognition by PUF and PPR repeat proteins

An interaction between the nucleocapsid protein and a component of the replicase-transcriptase complex is crucial for the infectivity of coronavirus genomic RNA

Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes

Forensic use of Y-chromosome DNA: A general overview

RNA vaccines

Analyses of coronavirus assembly interactions with interspecies membrane and nucleocapsid protein chimeras

Structure-based stabilization of non-native protein-protein interactions of coronavirus nucleocapsid proteins in antiviral drug design

Order and correlations in genomic DNA sequences. The spectral approach

RNA-binding proteins: Modular design for efficient function

No evidence for distinct types in the evolution of SARS-CoV-2

Ribonucleoprotein-like structures from coronavirus particles

Coronavirus cisacting RNA elements

Coronaviruses. In Methods and protocols

Gene prediction based on DNA spectral analysis: A literature review

Digital spectral analysis with applications

Coronavirus genomic RNA packaging

The coronavirus nucleocapsid is a multifunctional protein

RNA interference

Cooperation of an RNA packaging signal and a viral envelope protein in coronavirus RNA packaging

Supramolecular architecture of the coronavirus particle

In Medical virology: From pathogenesis to disease control

Helical capsids of plant viruses: Architecture with structural lability

Bacteriophage MS2 genomic RNA encodes an assembly instruction manual for its capsid

Helical viruses

Satellites in the prokaryote world

Comparative analysis of periodicity search methods in DNA sequences

Short tandem repeat expansions and RNA-mediated pathogenesis in myotonic dystrophy

On the origin and continuing evolution of SARS-CoV-2

Comparative computational analysis of SARS-CoV-2 nucleocapsid protein epitopes in taxonomically related coronaviruses

A modelling paradigm for RNA virus assembly. Current Opinion in Virology

The coronavirus nucleocapsid protein is dynamically associated with the replication-transcription complexes

An in vivo cellbased assay for investigating the specific interaction between the SARS-CoV N-protein and its viral RNA packaging sequence

Insight into 2019 novel coronavirus -An updated interim review and lessons from SARS-CoV and MERS-CoV

Virtual screening and dynamics of potential inhibitors targeting RNA binding domain of nucleocapsid phosphoprotein from SARS-CoV-2

Coronaviruses

Limit distributions of random variables associated with long duplications in a sequence of independent trials