key: cord-0324834-mmibfhik authors: Lal, Avantika; Ferrarini, Mariana Galvao; Gruber, Andreas J. title: Investigating the human host - ssRNA virus interaction landscape using the SMEAGOL toolbox date: 2021-12-03 journal: bioRxiv DOI: 10.1101/2021.12.02.470930 sha: 1b3f492fe4a6c4bda068ef9f22af75fd12e6b9c4 doc_id: 324834 cord_uid: mmibfhik Viruses are intracellular parasites that need their host cell to reproduce. Consequently, they have evolved numerous mechanisms to exploit the molecular machinery of their host cells, including the broad spectrum of host RNA-binding proteins (RBPs). However, the RBP interactome of viral genomes and the consequences of these interactions for infection are still to be mapped for most RNA viruses. To facilitate these efforts we have developed SMEAGOL, a fast and user-friendly toolbox to analyze the enrichment or depletion of RBP binding motifs across RNA sequences (https://github.com/gruber-sciencelab/SMEAGOL). To shed light on the interaction landscape of RNA viruses with human host cell RBPs at a large scale, we applied SMEAGOL to 197 single-stranded RNA (ssRNA) viral genome sequences. We find that the majority of ssRNA virus genomes are significantly enriched or depleted in binding motifs for human RBPs, suggesting selection pressure on these interactions. Our analysis provides an overview of potential virus - RBP interactions, covering the majority of ssRNA viral genomes fully sequenced to date, and represents a rich resource for studying host interactions vital to the virulence of ssRNA viruses. Our resource and the SMEAGOL toolbox will support future studies of virus / host interactions, ultimately feeding into better treatments. According to Baltimore"s classification, class IV and class V viruses have single-stranded 32 RNA (ssRNA) genomes 1 . Whereas (+)ssRNA class IV viruses package the positive-sense 33 genome that can be directly translated into protein by the translational machinery of the host 34 cell, the (-)ssRNA class V viruses contain a negative-sense genome that needs to be 35 transcribed into a positive-sense message before translation. ssRNA viruses interact with 36 many host factors in the infected cells in order to facilitate viral replication, subgenomic RNA 37 transcription, and translation of viral proteins. At the same time, host cellular factors detect 38 viral RNA and activate intracellular signaling pathways leading to antiviral responses. 39 Interactions between viral RNAs and host RNA-binding proteins (RBPs) are key to these 40 processes. 41 42 ssRNA viruses such as the Hepatitis C virus (HCV), the Ebola virus, the Influenza virus, and 43 the SARS-CoV-2 virus responsible for the ongoing COVID-19 pandemic are of high 44 epidemiologic relevance. Understanding how these viruses interact with and impact host 45 cells is key for designing means to combat these infections. A currently prominent example 46 is the SARS-CoV-2 genome, which is bound by hundreds of human proteins 2,3 . More 47 broadly, coronaviruses are known to co-opt human RBPs to promote their stability, 48 translation and replication 4 . Furthermore, viral RNAs may also sequester RBPs to influence 49 gene expression in the host. For instance, the Sindbis virus was found to "sponge" ELAVL1 50 RBP molecules via uridine (U)-rich elements (UREs) in its 3" untranslated region (UTR) 51 causing changes in splicing, polyadenylation and stability of host messenger RNAs (mRNAs) 52 5 . Although studies on RBPs and viral genomes point to the importance of RBP interaction 53 networks in viral infections, genome-scale experimental and functional studies are relatively 54 sparse and are cell type and condition specific. 55 56 In order to bridge this gap using computational predictions, we have developed SMEAGOL 57 (Sequence Motif Enrichment And Genome annOtation Library), a python library to analyze 58 RBP binding motifs in nucleic acid sequences. SMEAGOL further identifies proteins whose 59 binding motifs are significantly enriched or depleted in a sequence, thus highlighting the 60 interactions that are most likely under evolutionary selection and therefore functionally 61 significant. By applying SMEAGOL to 197 Group IV and Group V viral genomes we have 62 constructed the first comprehensive resource for studying ssRNA virus/RBP interactions. 63 64 Identification of sequence motif enrichment / depletion using SMEAGOL 66 SMEAGOL (https://github.com/gruber-sciencelab/SMEAGOL) is a Python library designed 67 for comprehensive motif occurrence analysis in nucleic acid sequences using position weight 68 matrices (PWMs), which can represent the binding specificity of a variety of nucleic acid-69 interacting regulators. SMEAGOL can directly load PWMs from the ATtRACT and RBPDB 70 databases of RBP binding specificities 6,7 . As curated databases of RBP binding motifs 71 typically contain a mixture of high and low confidence PWMs, SMEAGOL also includes 72 modules to analyze, filter, compare, cluster, and visualize PWMs. Moreover, SMEAGOL 73 enables scanning of sequences with the curated PWMs and to filter these results. Post-74 processing modules enable the calculation and visualization of statistical enrichment or 75 depletion of sequence motifs, as well as predicted effects of sequence variants on PWM 76 sites. An overview of the SMEAGOL functionalities is provided in Fig. 1 cluster-representative PWMs is provided as Supplementary Fig. 3 . 157 158 159 Having observed significant differences between viral families, we next examined prominent 160 families individually. Coronaviruses, which have the longest genomes of all viruses in our 161 dataset ( Supplementary Fig. 4 , one-sided Wilcoxon rank-sum test U statistic = 1536, p = 8.5 162 x 10 -7 ), also show more enrichment and depletion of binding motifs than any other family 163 (Fig. 2a) . Strikingly, the number of RBP motifs depleted on the plus strand is much higher 164 than the number of enriched motifs. It is conceivable that given their long genomes, 165 coronaviruses have to actively prevent being bound by non beneficial host RBPs. On 166 examination, we found that the striking number of depleted motifs in these genomes reflects 167 depletion of U-rich elements bound by RBPs such as HNRNPC, RALY, CELF2, TIA1, 168 ELAVL1 and PPIE (Fig. 3 ). This is despite coronaviruses being the most U-rich of all viruses 169 in our dataset ( 186 Like cellular mRNAs, the genomes of (+)ssRNA viruses also contain 5" and 3" UTRs, which 187 have been shown to bind host RBPs. While the 5" UTR contains elements that regulate the 188 efficiency and timing of translation initiation and viral replication, host factors binding to the 3" 189 UTR can be critical to many aspects of the life cycle of a virus, including but not limited to 190 RNA replication and stability. Host RBPs also mediate 5" UTR -3" UTR interactions, resulting 191 in "circularization" of the viral genome 14,15 . 192 193 Since the UTR regions have distinct regulatory functions from the remaining genome and 194 their sequences are not constrained to code for proteins, we reasoned that they may be 195 enriched for binding sites of specific RBPs relevant to their functions. These enrichments 196 may not be detectable over the whole genome, and indeed may be canceled out since it 197 may be detrimental for some UTR-specific proteins to bind elsewhere in the genome. We 198 therefore repeated the analysis specifically for 5" and 3" UTR sequences of 89 (+)ssRNA 199 viruses whose UTR positions were annotated. Among the top ten significant RBPs whose motifs are enriched / depleted in the HCV 216 genome, four interactions have already been experimentally validated. The RBP that is most 217 significantly enriched in binding sites in the HCV genome is the ELAV-like RNA binding 218 protein 1 (ELAVL1), also called HuR (Fig. 4b) happen with the positive sense molecule. In the genome, we found motif depletion to be 296 more common than enrichment (motifs for 32 RBPs were depleted while motifs for 18 RBPs 297 were enriched), suggesting that SARS-CoV-2 has more antiviral interactions with human 298 RBPs than pro-viral interactions. This prediction is consistent with experimental observations 299 from CRISPR screens 2 . 300 301 To place our predictions for specific RBPs in the context of experimental data, we collected a 302 list of proteins that have been experimentally validated to bind to SARS-CoV-2 RNA in 303 infected human or monkey cell lines in three studies 3,47,48 . PWMs for 41 of these were 304 included in our study, and we computationally predicted binding sites for 40 of these 41 in 305 the SARS-CoV-2 genome. We found motifs for 8 of these RBPs to be enriched while 13 306 were depleted, indicating that while some interacting RBPs bind to longer regions or an 307 abundance of locations in the viral genome, others are overall depleted in binding sites in 308 order to guarantee highly specific binding to well defined genomic loci. 309 310 We also compiled a list of experimentally validated antiviral and pro-viral RBPs from 311 CRISPR or siRNA screens in SARS-CoV-2 infected cells 47,49 . PWMs for 17 known antiviral 312 and 4 known pro-viral RBPs were included in our dataset. While we did not observe motif 313 enrichment for the pro-viral proteins, motifs for 4 of the 17 antiviral proteins (RALY, ELAVL1, 314 FUBP3, PCBP2) were depleted in the SARS-CoV-2 genome, suggesting that the viral 315 genome may have evolved to avoid interaction with these defensive host proteins. Motifs for 316 an additional 4 antiviral RBPs (HNRNPA2B1, DAZAP1, TARDBP, PPIE) were also depleted at 317 a more permissive FDR-adjusted p-value threshold of 0.1. Out of these, RALY, ELAVL1, FUBP3 318 and PPIE bind to UREs. Interestingly, although the predicted binding sites for numerous URE-319 binding RBPs are strongly depleted overall in the SARS-CoV-2 genome (Fig. 4e) , the few binding 320 sites that are predicted are significantly concentrated within a region in the NSP6 gene. In 321 particular, an URE at position 11074 contains 3 of 5 predicted binding sites for the antiviral RBPs 322 RALY and ELAVL1 (Fig. 4f) . 323 324 Computational studies offer an opportunity to predict novel interactions that may not have 325 been covered in the limited range of cell types and conditions that were studied 326 experimentally. We identified strong (FDR-adjusted p-value < 0.05 and fold change >= 2) 327 enrichment of motifs for 5 RBPs (SART3, PABPC1, NUPL2, SRSF2, ZRANB2) and strong 328 depletion (FDR-adjusted p-value < 0.05 and fold change <= 0. Fig. 7) to the URE at position 11074 (Fig. 4f) . 350 As discussed above, this URE is one of very few regions predicted to bind to the known 351 antiviral RBPs RALY and ELAVL1, as well as CPEB4 which is validated to bind to the 352 SARS-CoV-2 genome. Interestingly, the much less common T>C mutation at the same 353 position is predicted to have a lesser effect on RBP binding ( Supplementary Fig. 7 We found numerous RBP-binding motifs to be enriched or depleted in ssRNA viruses, 374 including motifs that were enriched or depleted globally as well as in a family-or species-375 specific manner. The RBPs bound by these motifs include host splicing factors as well as 376 RBPs that are known to regulate RNA stability. We report differences in predicted host 377 interactions between viral families, with coronaviruses showing the highest levels of motif 378 enrichment and depletion in their genomes. Coronaviruses may have evolved to avoid being 379 bound by specific RBPs as, given their length, most RBPs will bind the viral genome by 380 chance in the absence of active selection against it. Further, we find an interesting pattern in 381 the occurrence of UREs which bind numerous RBPs that regulate viral infection. These 382 UREs are depleted in the genomes of coronaviruses (which are highly U-rich overall) and 383 enriched in a few flaviviruses including HCV (which are overall depleted in uridine). 384 Consistent with our findings, the ELAVL1 RBP that binds to these elements has been 385 experimentally found to be antiviral in SARS-CoV-2 infection but pro-viral in HCV infection. 386 387 HCV causes chronic infections of the liver, which further lead to diseases such as liver 388 cirrhosis and hepatocellular carcinoma. About 70 million people are chronically infected by 389 HCV and a significant number of these will develop very severe diseases, such as liver 390 cirrhosis and malignancies 57 . Previous studies confirm that at least four of the top ten RBPs 391 enriched / depleted in binding sites within HCV, do indeed bind to the viral genome. indicating an antiviral effect of this RBP against SARS-CoV-2, though the mechanism is 403 unclear. 404 405 We previously published an analysis of motif enrichment in the SARS-CoV-2 genome using 406 a similar procedure 60 . Here, we improve upon the previous findings with a more rigorous 407 statistical procedure including dinucleotide shuffling, using an expanded dataset of PWMs, 408 and by placing the results in the context of other Coronavirus genomes and more recent 409 functional studies. Our new analysis supports the observation that SARS-CoV-2 is more 410 likely to form antiviral interactions with RBPs than pro-viral ones. Further, we find depletion 411 of motifs for several known antiviral RBPs on the SARS-CoV-2 genome. The "scan_sequence" function of SMEAGOL scans a given nucleic acid sequence by 489 calculating the PWM match score for each position in the sequence. Specifically, at each 490 position in the sequence, the subsequence of length k (where k is the length of the PWM) 491 starting at the given position is taken, and the PWM match score is obtained by summing 492 over the PWM log-likelihood ratios at each of the k positions, each time selecting the PWM 493 element that corresponds to the nucleotide in the sequence. The score is then divided by the 494 maximum possible score that could be obtained using that PWM 55 . We used this function to 495 scan the downloaded ssRNA virus genome sequences, as well as their reverse complement 496 sequences, with the 362 selected RBP PWMs, and identified putative binding sites with a 497 score threshold of 0.8. We used the "enrich_in_genome" function in SMEAGOL to calculate a 498 p-value for enrichment or depletion of each PWM and viral genome. 499 500 The p-value is calculated as follows. For each genome and PWM combination, SMEAGOL 501 counts the number of predicted binding sites. It then generates 1000 background sequences 502 that have the same nucleotide and dinucleotide frequency as the genome, scans each 503 background sequence, and counts the number of predicted binding sites in the background 504 sequences to generate a background distribution. This is used to calculate the expected 505 probability of finding a binding site in the query sequence based on its sequence 506 composition alone. A two-sided binomial test is used to calculate the p-value, which is then 507 adjusted for multiple-testing using the Benjamini-Hochberg correction. For multi-segmented 508 viral genomes, SMEAGOL calculates a single enrichment score across all segments. 509 510 PWMs with FDR-adjusted p-value < 0.05 were considered to be significantly enriched / 511 depleted. The ratio of the real and expected number of binding sites in the query sequence 512 was used as a measure of effect size. 513 514 The local window enrichment plots in Fig. 4c and Fig. 4f were generated using the 516 `enrich_in_sliding_windows` function of SMEAGOL. This function creates windows tiling 517 over the entire genome (for the figures here, non-overlapping windows of 500 bp were used) 518 and tests whether the number of predicted binding sites for an RBP in each window is 519 significantly higher / lower than the expected number based on a model in which binding 520 sites for the RBP are uniformly distributed across the genome. P-values are calculated using 521 a two-sided Fisher"s exact test and adjusted using the Benjamini-Hochberg procedure. 522 523 We downloaded information on 36,688 SARS-CoV-2 mutations from the GESS database 525 (https://wan-bioinfo.shinyapps.io/GESS/) on September 14, 2021. We used the 526 "variant_effect_on_sites" function in SMEAGOL to select mutations that intersect with the 527 predicted binding sites of 10 selected RBPs, and calculate the PWM match score of each 528 predicted binding site with and without the variant. We selected variants that reduce the 529 score of a binding site to less than 0.5 as site-disrupting variants. 530 531 Expression of animal virus genomes Discovery and functional interrogation of SARS-CoV-2 RNA-host 569 protein interactions The SARS-CoV-2 RNA-protein interactome in infected human cells The interface between 573 coronaviruses and host cell RNA biology: Novel potential insights for future therapeutic 574 intervention Cellular mRNA Stability, Splicing, and Polyadenylation through HuR Protein 577 RNA-binding specificities RNA-binding proteins and associated motifs A large-scale binding and functional map of human RNA-583 binding proteins Complete Coding Genome Sequence for Mogiana Tick Virus, a Jingmenvirus Isolated 586 from Ticks in Brazil Temporally Distinct Host Factor Requirements Proteins Bound to Dengue Viral RNA In Vivo Reveals New Host Proteins Important for 592 Circularization of 594 flavivirus genomic RNA inhibits de novo translation initiation Post-transcriptional Regulatory 597 Functions of Mammalian Pumilio Proteins Structures and Functions of the 3′ Untranslated Regions of Positive-Sense 599 Single-Stranded RNA Viruses Infecting Humans and Animals Diverse roles of host RNA binding proteins in RNA virus replication HuR, a protein implicated in oncogene and 604 growth factor mRNA decay, binds to the 3" ends of hepatitis C virus RNA of both 605 polarities Identification of cellular factors 607 associated with the 3"-nontranslated region of the hepatitis C virus genome A Human Proteome Microarray Identifies that the Heterogeneous Nuclear 610 Ribonucleoprotein K (hnRNP K) Recognizes the 5" Terminal Sequence of the Hepatitis 611 C Virus RNA Inhibition of hepatitis C virus 613 translation and subgenomic replication by siRNAs directed against highly conserved 614 HCV sequence and cellular HCV cofactors Cellular cofactors affecting hepatitis C virus infection and replication The Elav-like protein HuR exerts translational control of viral 618 internal ribosome entry sites HuR Displaces Polypyrimidine Tract Binding Protein To Facilitate La 620 Untranslated Region and Enhances Hepatitis C Virus Replication Heterogeneous nuclear ribonucleoprotein I (hnRNP-I/PTB) 623 terminus of hepatitis C viral RNA hnRNP C and polypyrimidine tract-binding protein specifically 626 interact with the pyrimidine-rich region within the 3"NTR of the HCV RNA genome An internal polypyrimidine-tract-binding protein-binding site in the 629 hepatitis C virus RNA attenuates translation, which is relieved by the 3"-untranslated 630 sequence The polypyrimidine tract-binding protein (PTB) is required for 632 efficient replication of hepatitis C virus (HCV) RNA Specific interaction of polypyrimidine tract-binding protein with the 634 extreme 3"-terminal structure of the hepatitis C virus genome, the 3'X Interaction of polypyrimidine tract-binding protein with the 5" 637 noncoding region of the hepatitis C virus RNA genome and its functional requirement in 638 internal initiation of translation The internal ribosome entry site (IRES) 640 of hepatitis C virus visualized by electron microscopy Mechanism of translation initiation on 642 hepatitis C virus RNA Down-regulation of viral replication by adenoviral-mediated expression 644 of siRNA against cellular cofactors for hepatitis C virus Role of La 646 autoantigen and polypyrimidine tract-binding protein in HCV replication Polypyrimidine tract-binding protein binds to the complementary 649 strand of the mouse hepatitis virus 3" untranslated region, thereby altering RNA 650 conformation A role for polypyrimidine tract binding protein in the 652 establishment of focal adhesions The determinants of RNA-binding specificity of 654 the heterogeneous nuclear ribonucleoprotein C proteins Mutational definition of RNA-binding 657 and protein-protein interaction domains of heterogeneous nuclear RNP C1 iCLIP reveals the function of hnRNP particles in splicing at individual 660 nucleotide resolution The yin and yang of hepatitis C: 662 synthesis and decay of hepatitis C virus RNA The eIF-2 alpha protein kinases, regulators of translation in eukaryotes 664 from yeasts to humans Eukaryotic stress granules: the ins and outs of translation Hepatitis C virus 668 (HCV) induces formation of stress granules whose proteins regulate HCV RNA 669 replication and virus assembly and egress FUSE Binding Protein 1 Facilitates Persistent Hepatitis C Virus 671 Replication in Hepatoma Cells by Regulating Tumor Suppressor p53 A host YB-1 ribonucleoprotein complex is hijacked by hepatitis C 674 virus for the control of NS3-dependent particle production Role of RNA-677 binding proteins during the late stages of Flavivirus replication cycle YB-1 functions as a porter to lead influenza 680 virus ribonucleoprotein complexes to microtubules Investigation of function and regulation of the YB-1 cellular factor in 682 HIV replication The SARS-CoV-2 RNA interactome Global analysis of protein-RNA interactions in SARS-CoV-2-infected 685 cells reveals key regulators of infection Genome-wide CRISPR Screens Reveal Host Factors Critical for SARS-687 CoV-2 Infection GESS: a database of global evaluation of SARS-CoV-2/hCoV-19 689 sequences A Potential SARS-CoV-2 Variant of Interest (VOI) Harboring 691 Mutation E484K in the Spike Protein Was Identified within Lineage B.1.1.33 Circulating 692 in Brazil Simple combinations of lineage-determining transcription factors prime 694 cis-regulatory elements required for macrophage and B cell identities MOODS: fast search 697 for position weight matrix matches in DNA sequences TFBSTools: an R/bioconductor package for transcription factor 700 binding site analysis Applied bioinformatics for the identification of 702 regulatory elements Untreated Infected Population Could Be the Key to Elimination The RNA-binding protein HuR/ELAVL1 regulates IFN-β mRNA 708 abundance and the type I IFN response HuR keeps interferon-β mRNA stable Genome-wide bioinformatic analyses predict key host and viral 712 factors in SARS-CoV-2 pathogenesis Predicting the sequence 714 specificities of DNA-and RNA-binding proteins by deep learning Genome-wide landscape of RNA-binding protein target site 717 dysregulation reveals a major impact on psychiatric disorder risk DeepCLIP: predicting the effect of mutations on protein-RNA 720 binding with deep learning Technical Note on Transcription Factor Motif Discovery from 722 Importance Scores (TF-MoDISco) Database and Analytical Resources for 724 ViralZone: a knowledge resource to understand virus diversity 728 RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription 729 factor binding motif collections Supplementary Data Legends Supplementary Data 1: Strains and genotypes included in this study Supplementary Data 2: Locations of predicted RBP binding sites (sequences that match the 735 PWM) on all viral genomes for all PWMs Supplementary Data 3: Enrichment / depletion results for all viral genomes and PWMs Supplementary Data 4: Enrichment / depletion results for all UTR sequences and PWMs Supplementary Data 5: List of experimentally validated RBP effectors for SARS-CoV-2 Supplementary Data 6: Selected mutations in the SARS-CoV-2 genome The authors thank Mihaela Zavolan for many constructive discussions and ideas that greatly 548 improved this work.