key: cord-0741236-a1t75vjk authors: Barik, Sailen title: Genus-specific pattern of intrinsically disordered central regions in the nucleocapsid protein of coronaviruses date: 2020-07-17 journal: Comput Struct Biotechnol J DOI: 10.1016/j.csbj.2020.07.005 sha: 886371c8216bdbdcc4dab92b0737211140c65589 doc_id: 741236 cord_uid: a1t75vjk The nucleocapsid (N) protein is conserved in all four genera of the coronaviruses, namely alpha, beta, gamma, and delta, and is essential for genome functionality. Bioinformatic analysis of coronaviral N sequences revealed two intrinsically disordered regions (IDRs) at the center of the polypeptide. While both IDR structures were found in alpha, beta, and gamma-coronaviruses, the second IDR was absent in deltacoronaviruses. Two novel coronaviruses, currently placed in the Gammacoronavirus genus, appeared intermediate in this regard, as the second IDR structure could be barely discerned with a low probability of disorder. Interestingly, these two are the only coronaviruses thus far isolated from marine mammals, namely beluga whale and bottlenose dolphin, two highly related species; the N proteins of the viruses were also virtually identical, differing by a single amino acid. These two unique viruses remain phylogenetic oddities, since gammacoronaviruses are generally avian (bird) in nature. Lastly, both IDRs, regardless of the coronavirus genus in which they occurred, were rich in Ser and Arg, in agreement with their disordered structure. It is postulated that the central IDRs make cardinal contributions in the multitasking role of the nucleocapsid protein, likely requiring structural plasticity, perhaps also impinging on coronavirus host tropism and cross-species transmission. ABSTRACT 27 The nucleocapsid (N) protein is conserved in all four genera of the coronaviruses, namely 28 alpha, beta, gamma, and delta, and is essential for genome functionality. Bioinformatic analysis 29 of coronaviral N sequences revealed two intrinsically disordered regions (IDRs) at the center 30 of the polypeptide. While both IDR structures were found in alpha, beta, and gamma- 31 coronaviruses, the second IDR was absent in deltacoronaviruses. Two novel coronaviruses, 32 currently placed in the Gammacoronavirus genus, appeared intermediate in this regard, as the 33 second IDR structure could be barely discerned with a low probability of disorder. 34 Interestingly, these two are the only coronaviruses thus far isolated from marine mammals, 35 namely beluga whale and bottlenose dolphin, two highly related species; the N proteins of the 36 viruses were also virtually identical, differing by a single amino acid. These two unique viruses 37 remain phylogenetic oddities, since gammacoronaviruses are generally avian (bird) in nature. 38 Lastly, both IDRs, regardless of the coronavirus genus in which they occurred, were rich in 39 Ser and Arg, in agreement with their disordered structure. It is postulated that the central IDRs [2, 3] . Such studies revealed that the alpha and beta coronaviruses are of bat origin and infect 55 only mammals, whereas gamma and delta are of avian origin and also infect birds, and rarely 3 56 mammals. The human pathogen SARS-CoV-2, deadly agent of the 2019 epidemic, is a 57 betacoronavirus, likely transmitted from bats in the Wuhan province of China [3] [4] [5] . 58 Such studies also established the diversity and commonality of the viral genes in the 59 coronavirus genera and their relative orders. All coronaviruses contain a single-stranded NTD and CTD are both involved in RNA-binding, and the CTD is additionally important for 69 dimerization of the N protein [13, 14] . In contrast, the central region of the N protein, 70 considered a linker between the two terminal domains, is a relatively uncharted territory. A 71 serine/arginine-rich (S/R-rich) stretch has been recognized close to the NTD and specific 72 amino acid residues have been shown to participate in binding viral RNA [15] . 73 Phosphorylation is a common mechanism of post-translational regulation of proteins, and N 74 protein has also been shown to be phosphorylated [10, [16] [17] [18] [19] ; in fact, several earlier entries in 75 the GenBank recognized it as 'nucleocapsid phosphoprotein'. The SR-rich sequence is 76 phosphorylated by glycogen synthase kinase-3 [10], but the full gamut of phosphorylation on 77 N and its significance and regulation deserves additional studies. 78 The higher order structure of recombinant NTD and CTD of the coronaviral N protein has 79 been determined by X-ray crystallography, but that of the full-length N remains elusive. While The PrDOS results were downloaded in the CSV format, and analyzed converted intoas 144 Excel files, and then the location of the IDR noted (Table 1) The disorder probability graphs ( Fig. 1) were also plotted using Excel. Since the N 202 sequences of different coronaviruses are of diverse length (~342-468 residues), their IDRs are 203 also positionally shifted. It was, therefore, necessary to manually align the IDR peaks in order 204 to compare the different sequences, and as a result the amino acid numbers on the X-axis, 205 spanning ~121 residues (Fig. 1 Before embarking on the analysis of N protein IDRs and their potential phylogenetic 216 distribution, it was important to ascertain that the full-length N sequences obey the phylogeny 217 of the viruses to which they belong. Thus, N protein sequences of all coronaviruses were 218 collected in a genus-blind manner, and subjected to multiple alignment by Clustal Omega. 219 When the sequences in the Clustal output were then labeled for each viral genus (as categorized 220 in the GenBank annotation), they were in fact found to be clustered by genus (data not shown). 221 In order words, all the alpha sequences were together, all beta were together, and so on. Thus, so on. In other words, the N protein sequence homology independently followed the viral 296 phylogeny, which added confidence to these studies. Only after this step, did I analyze each 297 the sequences were analyzed for disorder, which also showed a genus-specific pattern, as 298 presented in Fig. 1 Fig. 1 , because they were 305 not fully representative of any genus and would only blur the plot. 306 Two sequences that did not exactly fit the two-peak pattern of gammacoronaviruses, but 307 were classified as gamma in the literature, had distorted second IDR peaks, although they were 308 above the baseline probability (Fig. 1C) . These two coronaviral species were isolated from 309 beluga whale [38] and bottlenose dolphin [39] . Nonetheless, they were are not delta viruses, in 310 which the second peak dioesd not exist at all. This is why I have they are placed them under 311 the gamma genus, but tentatively, labeling them as 'deviant' (Fig. 1C, Fig. 2 ). The significance 312 of these two sequences are discussed later (Section 4). (Table 1) . 324 As shown (Fig. 2) , the Ser/ and Arg residues are most abundant in the first IDR peaks of all 325 types of coronaviruses, without exception. The occurrence of these residues in the second IDR 326 peak was more sporadic and showed some phylogenetic trend. Specifically, the second peak 327 in alphacoronaviruses was also S/R-rich ( Fig. 2A) , but the abundance largely disappeared in 328 the second peaks of beta and gamma (Fig. 2BC) . In the two 'deviant' gamma viruses, the 329 stretched first peak was S/R-rich but the density tapered off towards the C-terminal end of the 330 peak (Fig. 2E) . A similar trend was noticed in the C-terminal end of the second IDR of both 331 alpha and beta. In such S/R-poor IDR areas other polar amino acids seemed to predominate, 332 Lys being the most noticeable (unmarked in Fig. 2 ). the two IDR peaks shown here for SARS-CoV-2 and other betacoronaviruses (Fig. 1 was predicted by several methods, as described previously [28] and in Materials and Methods, which produced similar results, and those from PrDOS are presented (See Fig. 2 for the 448 corresponding primary structures). The cut-off threshold was set at the relatively stringent FPR 449 (false positive rate) of 5%, corresponding to a disorder probability of 0.5 (Y-axis). Thus, only 450 the areas of disorder probability >0.5 (above the red dotted line) were considered as significant. 451 The "Series" designations of the color-coded graphs are listed in Table 1 . As explained in the 452 Materials and Methods, the amino acid numbers (X-axis) do not refer to the full-length protein, 453 but only to the displayed sequence portion, so as to provide a scale of the relative location and 454 length of the two IDR peaks. However, the actual residue numbers are listed in Table 1 in the same box as gamma to clearly demonstrate the difference in the peak, but also see COVID-19: the first documented coronavirus pandemic 483 in history Discovery of seven novel Mammalian 486 and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as 487 the gene source of alphacoronavirus and betacoronavirus and avian coronaviruses as 488 the gene source of gammacoronavirus and deltacoronavirus Origin and evolution of pathogenic coronaviruses The COVID-493 19 pandemic: A comprehensive review of taxonomy, genetics, epidemiology, 494 diagnosis, treatment, and control A pneumonia outbreak associated with a new coronavirus of probable 497 bat origin The molecular biology of coronaviruses Modular organization of SARS 502 coronavirus nucleocapsid protein Multiple nucleic acid binding sites and intrinsic disorder of severe acute respiratory 506 syndrome coronavirus nucleocapsid protein: Implications for ribonucleocapsid protein 507 packaging The SARS coronavirus 509 nucleocapsid protein -Forms and functions Phosphorylation of the arginine/serine dipeptide-rich 512 motif of the severe acute respiratory syndrome coronavirus nucleocapsid protein 513 modulates its multimerization Identification of in vivo-interacting domains of 516 the murine coronavirus nucleocapsid protein An interaction between the 519 nucleocapsid protein and a component of the replicase-transcriptase complex is 520 crucial for the infectivity of coronavirus genomic RNA The coronavirus nucleocapsid is a 523 multifunctional protein Carboxyl terminus of severe acute 525 respiratory syndrome coronavirus nucleocapsid protein: Self-association analysis and 526 nucleic acid binding characterization Specific 529 interaction between coronavirus leader RNA and nucleocapsid protein The virion N protein of infectious bronchitis virus 532 is more phosphorylated than the N protein from infected cell lysates The severe acute 535 respiratory syndrome coronavirus nucleocapsid protein is phosphorylated and 536 localizes in the cytoplasm by 14-3-3-mediated translocation Glycogen synthase kinase-3 regulates the phosphorylation of severe 540 acute respiratory syndrome coronavirus nucleocapsid protein and viral replication Identification of two ATR-dependent 543 phosphorylation sites on coronavirus nucleocapsid protein with nonessential functions 544 in viral replication and infectivity in cultured cells Intrinsically unstructured proteins and their functions Understanding protein non-folding Function and structure of inherently 551 disordered proteins Natively unfolded proteins: An overview Analysis of multimerization of the SARS coronavirus nucleocapsid 557 protein Shell disorder analysis predicts greater 560 resilience of the SARS-CoV-2 (COVID-19) outside the body and in body fluids Protein 563 disorder-a breakthrough invention of evolution? Fast, scalable generation of high-567 quality protein multiple sequence alignments using Clustal Omega Bioinformatic analysis reveals conservation of intrinsic disorder in the linker 570 sequences of prokaryotic dual-family immunophilin chaperones PrDOS: prediction of disordered protein regions from amino 573 acid sequence Prediction of disordered regions in proteins based on the meta 576 approach PONDR-FIT: a meta-578 predictor of intrinsically disordered amino acids comprehensive overview of computational protein 581 disorder prediction methods MetaDisorder: a meta-server for the prediction of 584 intrinsic disorder in proteins Disorder prediction methods, their 587 applicability to different protein targets and their usefulness for guiding experimental 588 studies An overview of predictors for 590 intrinsically disordered proteins over 2010-2014 Comprehensive review of methods for prediction of 593 intrinsic disorder and its molecular functions Prediction of boundaries between 596 intrinsically ordered and disordered protein regions Identification 599 of a novel coronavirus from a beluga whale by using a panviral microarray Discovery of a novel bottlenose dolphin coronavirus reveals a distinct species of 603 marine mammal coronavirus in Gammacoronavirus RNA virus mutations and fitness for survival Biochemical characterization of 609 SARS-CoV-2 nucleocapsid protein SR proteins: a conserved family of pre-612 mRNA splicing factors The SR protein family of splicing factors: master regulators of 614 gene expression nucleocapsid protein (N) is an essential structural protein of the coronavirus. 626  N protein binds RNA and also interacts with several proteins in its multitasking role. 627  The central region of N contains segmented intrinsically disordered regions (IDRs). 628  The IDR segments exhibit coronavirus genus-specific arrangements