key: cord-0846464-3j26iv5i authors: Farr, Elias B; Sattler, Julia M; Frischknecht, Friedrich title: SPOT: a web-tool enabling swift profiling of transcriptomes date: 2021-07-21 journal: Bioinformatics DOI: 10.1093/bioinformatics/btab541 sha: 46df44f53054bbd0d7d23ef5e319127e6d5d0e50 doc_id: 846464 cord_uid: 3j26iv5i : The increasing number of single cell and bulk RNAseq datasets describing complex gene expression profiles in different organisms, organs or cell types calls for an intuitive tool allowing rapid comparative analysis. Here, we present Swift Profiling Of Transcriptomes (SPOT) as a web tool that allows not only differential expression analysis but also fast ranking of genes fitting transcription profiles of interest. Based on a heuristic approach the spot algorithm ranks the genes according to their proximity to the user-defined gene expression profile of interest. The best hits are visualized as a table, bar chart or dot plot and can be exported as an Excel file. While the tool is generally applicable, we tested it on RNAseq data from malaria parasites that undergo multiple stage transformations during their complex life cycle as well as on data from multiple human organs during development and cell lines infected by SARS-CoV-2. SPOT should enable non-bioinformaticians to easily analyse their own and any available dataset. AVAILABILITY AND IMPLEMENTATION: SPOT is freely available for (academic) use at: https://frischknechtlab.shinyapps.io/SPOT/ and https://github.com/EliasFarr/SPOT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Data pre-processing SPOT has been preinstalled with three datasets, one single cell RNAseq dataset describing the whole Plasmodium life cycle, a bulk RNAseq dataset containing information about human organs in several developmental stages and a bulk RNAseq dataset from a SARS-CoV-2 infected cell line (Cardoso-Moreira, et al., 2019; Howick, et al., 2019; Wyler, et al., 2021) . For spot analysis of the Plasmodium life cycle, TMMlog normalized single cells were assigned by ShortenedLifestage4 (Howick, et al., 2019) and averaged to obtain a single expression value for every developmental stage. Human organ TPM values were also condensed by averaging between multiple replicates. Since the similarity of prenatal developmental stages is high, only 3 prominent stages (4, 10, 20 weeks post conception) were kept for further analysis. TPM values of the SARS-CoV-2 infected cell line were calculated by the gene length specified by the authors (Wyler, et al., 2021) . The pre-loaded samples show one duplicate of every timepoint per series. For spot and correlation ranking, data is scaled by standardization to obtain comparability between datasets. To lower the influence of outliers every standardized value higher than 4 was set to 4 and every standardized value below -1 was set to -1. In contrast to spot, for differential expression analysis raw counts were used and filtered for cells having expression in more than 3 features. The spot algorithm consists of two factors: (i) the difference between the weighted mean of selected (slider values > 0) or unselected entities (slider values = 0) and (ii) the difference between 1 and the mean of unselected entities. For an optimal result, both factors should be as high as possible. The first factor measures the distance between the selected and unselected entities, by calculating the difference of their mean expression values. To give the user the possibility to adapt the results to his or her needs, the mean value of the selected entities can be weighted. For this purpose, individual entities can be assigned to the mean value several time by adjusting the slider values, thus increasing the influence of the respective entity. The higher the slider value, the more frequently the variable is assigned to the mean, the lower the slider value, the less frequently the variable is assigned. A showcase how this weighting can improve the results is shown in Figure S1 . The second factor measures how close the expression values of the unselected entities are to zero (desired in this approach). The closer they are, the less gets subtracted from one and therefore the second factor increases. Due to the scaling it is also possible to obtain negative values, for the mean. Taken together both factors result in the spot score which is therefore a measure for the proximity to the user defined input. According to the definition and data pre-processing spot-scores can lie between -1 and 10, however SPOT displays only positive ones. The highest values obtained in the pre-loaded datasets are around 8. With as the number of entities or columns, as the weights from the slider values and as the respective expression values, spot is defined as: To enable complex profiles and medium values between high and low expression, we implemented an option in the user interface simply calculating the Pearson correlation between the user defined profile and the genes in the dataset. As a result of the cutoff (see chapter 1.1) the highest values in the datasets are equal to 4, while the lowest generally lie close to 0. The POI defined by the slider values is matched with these normalized values, which allows searching for complex profiles with several subcategories between very low and high expression. The results of the ranking after the correlation calculation when only one slider is moved can be seen in Figure S2 . Compared to SPOT, the mean expression values in unselected columns are higher than in Spot or DEA approaches ( Figure S4 ). Figure S1 : Profile tuning highlights stage specifically expressed genes (a) A SPOT search for genes highly expressed in ookinetes, oocysts, sporozoites and liver stage parasites; with all sliders of the respective stages set to 1 (all stages count equally for the weighted sum). Consistent with this POI, the top genes, shown as a dot plot on the right, show high expression in ookinetes and oocysts and slightly lower expression in sporozoites and liver stages. Blood stages show almost no expression as entered in the POI. (b) Searching for the same stages as in (a), but with different weights for each stage. Expression in liver stage parasites counts double in the weighted sum, while expression in sporozoites counts once and in ookinetes and oocysts only half. As a result, the top genes have high expression in liver stages, medium expression in sporozoites, ookinetes and oocysts and again almost no expression in blood stages. (c) A search similar to (b) but setting the slider for expression in sporozoites to two and the slider for liver stages to one. This POI yields top genes with strong expression in sporozoites, while they have medium high values in liver stages, ookinetes and oocysts. Again, blood stages show almost no expression. The circle size in the dot plot corresponds to the number of single cells in which the respective gene is detected; the color of the dot represents the average expression as indicated by the heat bar on the right. . Single cell datasets such as the data derived from the Malaria Cell Atlas are loaded to a Seurat object and identities are specified according to the ShortenedLifestage4 clustering (Howick, et al., 2019) . The Seurat function FindMarkers then compares two groups of entities entered by the user with different test methods (see Section 2). In the output table the Bonferroni corrected p-value and the log fold change are displayed. If desired by users, implementation of further analysis methods is readily feasible. Since there are several approaches to detect significantly up-regulated genes, we compared the spot and correlation algorithm with the state-of-the-art methods of differential analysis in speed and accuracy. The speed was tested on 3 example predictions, which are described in the further sections. We measured the time between input of the profile and output of the results with several CPUs. The results shown in Figure S3 depict an inverse association between the simplicity of the ranking method and the speed. While the spot and correlation ranking finished the analysis within 3 seconds, the Wilcoxon Rank Sum test outperformed the other DEA methods such as DESeq2 and Mast. Please note that performance may vary between different operating system/browser combinations. Accuracy tests were performed by measuring the mean expression of the best ranked candidates in selected and unselected entities of the example predictions ( Figure S4 ). In general, the more entities are selected, the closer the values between the selected and unselected values tend to be in the example predictions. The results in the three predictions show similar accuracy of all methods with only subtle differences. Figure S4 : Comparison of selected and unselected entities reveal best results for spot Since the Wilcoxon Rank Sum test is the fastest DEA method ( Figure S3 ), has best accuracy amongst unselected entities ( Figure S4 ) and has also shown satisfactory results in the literature (Soneson and Robinson, 2018) it is used as the default in the web tool. Ranking accuracy in example predictions (EP) was determined through mean expression calculation of selected (SE) and unselected (UE) entities. There is overall similar accuracy revealed with MAST seemingly the best algorithm to get genes with high expression in selected entities; spot and the Wilcoxon test score best in unselected entities. Since there are multiple tools available for DEA (Ge, et al., 2018; Reyes, et al., 2019) , we would like to point out the strengths and weaknesses of the approaches (Table S1 ). While both approaches show satisfactory results in ranking of genes ( Figure S4 ), there are differences in speed, visualization and interface design. Although the multitude of available visualization methods gives iDEP and GENAVi an advantage over SPOT, easier handling and faster calculations make SPOT a viable alternative. SPOT might well be the better choice for users more interested in a brief overview, than in an in-depth analysis. Thus, both approaches have their advantages and disadvantages and give users the opportunity to choose according to their needs. There is still no efficient malaria vaccine available (2015; Duffy and Patrick Gorres, 2020). Different approaches are currently being explored aiming to target parasites in the disease-causing blood stages or during transmission to and from the mosquito. Transmission blocking vaccines focus e.g. on generating antibodies against proteins present on the surface of gametes, which can block the transmission of Plasmodium to the mosquito (Carter and Chen, 1976; Gwadz, 1976) , while blood stage vaccines aim e.g. at blocking parasite entry into red blood cells. In addition to such subunit vaccines also attenuated parasites are explored. Attenuated parasites are most often generated by genetic modification of genes functional during liver stage development, which can lead to a developmental arrest in hepatocytes (Kumar, et al., 2016; Mueller, et al., 2005) . The specific genes are usually determined by differential expression analysis (Kaiser, et al., 2004; Matuschewski, et al., 2002) . However, none of the approaches have yet succeeded in inducing sufficient protective immune responses (Duffy and Patrick Gorres, 2020) . Here we present the results of a SPOT search for genes highly expressed exclusively in the liver stages. Since the genes with such expression profiles are well known (Caldelari, et al., 2019; Stanway, et al., 2019) , this prediction can serve as proof-of-concept. The SPOT search for genes with high expression during liver stage development revealed a top ten of genes shown in Table S2 . Liver specific proteins 1 and 2 appear in the first 3 ranks/positions, while 3 genes with unknown functions appear in the top 6. For all known genes except for the gametocyte specific protein and pyruvate dehydrogenase E1 component subunit beta, literature suggests strong liver stage expression (Ishino, et al., 2009; Orito, et al., 2013; Stanway, et al., 2019; Vaughan, et al., 2009) . For PBANKA1003900, previously annotated as gametocyte-specific protein, recent analysis led to the suggestion of renaming it liver-specific protein due to its liver specific expression profile (Caldelari, et al., 2019; Deligianni, et al., 2018) . To check the quality of the spot ranking, we compared the best performing genes of the spot ranking with those of the DEA methods. While the first 100 genes in the spot ranking are significantly upregulated according to the Wilcoxon test, 14 of the top 20 overlap in both rankings ( Figure S5 ). The best performing candidates of spot and Wilcoxon ranking are displayed in Figure S6 . They reveal strong expression in the Plasmodium liver stage as well as weaker expression in trophozoites and oocysts. While the genes performing best in the spot ranking show higher expression values in oocysts, the Wilcoxon test ranked genes have more expression in trophozoites. As shown in Figure S6 , there is less overall expression in unselected columns of the Wilcoxon ranking, while the spot ranked genes have higher expression in the liver stage. Drug development against malaria focuses mainly on the blood stages of malaria, as these cause the symptoms of the disease and are easily accessible in the laboratory. Since the eradication of malaria has come into the focus of drug design, drugs are also developed against sexual stages to prevent transmission to the mosquito or against liver stages to prevent development of disease-causing blood stages (Kappe, et al., 2010) . Drugs targeting the liver stages are routinely used to cure Plasmodium vivax infections (Flannery, et al., 2018) . However, there are no specific drugs available that target gametocytes. Identifying additional target proteins is therefore of interest (The malERA Consultative Group on Drugs, 2011). Here were present a SPOT search for genes only expressed in sexual blood stages, leaving a high probability for functionality in these stages. Sexual stage specific spot ranking revealed 3 unknown proteins as well as a wide variety of enzymes, surface proteins and proteins involved in osmiophilic body formation. P. falciparum plasmepsin VIII is known to be active specifically in gametocytes while the p48/45 protein (PBANKA_1359600) is a well characterized surface protein and still in consideration as a target for transmission blocking vaccines (Acquah, et al., 2019; Jiang, et al., 2020; Lee, et al., 2020; van Dijk, et al., 2001; Weißbach, et al., 2017) . Two other interesting candidates are associated with the emergence of osmiophilic bodies, specialized vesicles essential for parasite egress from blood cells and another one was shown to be important in the cell cycle in ookinetes and oocysts (Bushell, et al., 2009; Kehrer, et al., 2016; Ponzi, et al., 2009) . It is therefore hard to predict the function or localization of the 3 unknown genes. However, the interesting phenotypes of proteins with a similar transcription profile suggests that these proteins may have a functional role in mosquito infection as well. 5 Example prediction C: Genes expressed in asexual and sexual blood stages of Plasmodium Asexual and sexual blood stages share the red cell as a host, yet their biology differs dramatically as has also been shown by shifting expression profiles. Here we present a prediction of genes only active in asexual or sexual blood stages but not in other life cycle stages. This should counter select genes involved in general replication mechanisms. The results of the third example prediction shown in Table S4 display a multitude of membrane proteins or proteins associated with membrane trafficking and virulence. While the membrane associated histidine-rich protein 1a (MAHRP1a) and the skeleton-binding protein 1 (SBP1) are known to play a role in the transport of parasite proteins to the surface (Blisnick, et al., 2000; De Niz, et al., 2016; Maier, et al., 2007; Spycher, et al., 2003) , the function of the ETRAMP protein family is more diverse and needs further evaluation (MacKellar, et al., 2011; Spielmann, et al., 2003) . This also applies to the fam-b protein and the p1/s1 nuclease, about which little is known, as well as to the two unknown proteins that are candidates for further analysis. In contrast to the previous predictions, a substantially lower number of genes was significantly upregulated according to the Wilcoxon test ( Figure S9 a) . This might reflect the fact that many genes needed for proliferation will also be expressed in oocysts and liver stages. Nevertheless, the large majority of genes derived from spot ranking still overlapped with these genes. The overlap between the top 20 genes from Wilcoxon and spot ranking is 50 %. Transmission-Blocking Vaccines: Old Friends and New Prospects Pfsbp1, a Maurer's cleft Plasmodium falciparum protein, is associated with the erythrocyte skeleton Paternal Effect of the Nuclear Formin-like Protein MISFIT on Plasmodium Development in the Mosquito Vector Transcriptome analysis of Plasmodium berghei during exo-erythrocytic development Gene expression across mammalian organ development Malaria transmission blocked by immunisation with gametes of the malaria parasite The machinery underlying malaria parasite virulence is conserved between rodent and human malaria parasites Sequence and functional divergence of gametocytespecific parasitophorous vacuole membrane proteins in Plasmodium parasites Malaria vaccines since 2000: progress, priorities, products Assessing drug efficacy against Plasmodium falciparum liver stages in vivo iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data Successful immunization against the sexual stages of Plasmodium gallinaceum The Malaria Cell Atlas: Single parasite transcriptomes across the complete Plasmodium life cycle LISP1 is important for the egress of Plasmodium berghei parasites from liver cells An intracellular membrane protein GEP1 regulates xanthurenic acid induced gametogenesis of malaria parasites Differential transcriptome profiling identifies Plasmodium genes encoding pre-erythrocytic stage-specific proteins That Was Then But This Is Now: Malaria Research in the Time of an Eradication Agenda Proteomic Analysis of the Plasmodium berghei Gametocyte Egressome and Vesicular bioID of Osmiophilic Body Proteins Identifies Merozoite TRAP-like Protein (MTRAP) as an Essential Factor for Parasite Transmission Protective efficacy and safety of liver stage attenuated malaria parasites A C-terminal Pfs48/45 malaria transmission-blocking vaccine candidate produced in the baculovirus expression system Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 A systematic analysis of the early transcribed membrane protein family throughout the life cycle of Plasmodium yoelii Skeleton-binding protein 1 functions at the parasitophorous vacuole membrane to traffic PfEMP1 to the Plasmodium falciparum-infected erythrocyte surface Infectivity-associated changes in the transcriptional repertoire of the malaria parasite sporozoite stage Genetically modified Plasmodium parasites as a protective experimental malaria vaccine Liver-specific protein 2: a Plasmodium protein exported to the hepatocyte cytoplasm and required for merozoite formation GENAVi: a shiny web application for gene expression normalization, analysis and visualization Egress of Plasmodium berghei gametes from their host erythrocyte is mediated by the MDV-1/PEG3 protein Efficacy and safety of RTS,S/AS01 malaria vaccine with or without a booster dose in infants and children in Africa: final results of a phase 3, individually randomised, controlled trial Bias, robustness and scalability in singlecell differential expression analysis etramps, a new Plasmodium falciparum gene family coding for developmentally regulated and highly charged membrane proteins located at the parasite-host cell interface MAHRP-1, a novel Plasmodium falciparum histidine-rich protein, binds ferriprotoporphyrin IX and localizes to the Maurer's clefts Genome-Scale Identification of Essential Metabolic Processes for Targeting the Plasmodium Liver Stage A central role for P48/45 in malaria parasite male gamete fertility Type II fatty acid synthesis is essential only for malaria parasite late liver stage development Transcript and protein expression analysis of proteases in the blood stages of Plasmodium falciparum Transcriptomic profiling of SARS-CoV-2 infected human cell lines identifies HSP90 as target for COVID-19 therapy