key: cord-0897565-52h6fahn
authors: Fujisawa, Kota; Shimo, Mamoru; Taguchi, Y.-H.; Ikematsu, Shinya; Miyata, Ryota
title: PCA-based unsupervised feature extraction for gene expression analysis of COVID-19 patients
date: 2021-08-30
journal: Sci Rep
DOI: 10.1038/s41598-021-95698-w
sha: cc21af0b0c02ab1e9e18622f66408973757653fb
doc_id: 897565
cord_uid: 52h6fahn

Coronavirus disease 2019 (COVID-19) is raging worldwide. This potentially fatal infectious disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, the complete mechanism of COVID-19 is not well understood. Therefore, we analyzed gene expression profiles of COVID-19 patients to identify disease-related genes through an innovative machine learning method that enables a data-driven strategy for gene selection from a data set with a small number of samples and many candidates. Principal-component-analysis-based unsupervised feature extraction (PCAUFE) was applied to the RNA expression profiles of 16 COVID-19 patients and 18 healthy control subjects. The results identified 123 genes as critical for COVID-19 progression from 60,683 candidate probes, including immune-related genes. The 123 genes were enriched in binding sites for transcription factors NFKB1 and RELA, which are involved in various biological phenomena such as immune response and cell survival: the primary mediator of canonical nuclear factor-kappa B (NF-κB) activity is the heterodimer RelA-p50. The genes were also enriched in histone modification H3K36me3, and they largely overlapped the target genes of NFKB1 and RELA. We found that the overlapping genes were downregulated in COVID-19 patients. These results suggest that canonical NF-κB activity was suppressed by H3K36me3 in COVID-19 patient blood.

Application of PCAUFE to gene expression in COVID-19 patients. This section describes how we used PCAUFE to analyze the gene expression patterns of multiple COVID-19 patients.

The first example (data set 1, GSE152418) was obtained by Arunachalam et al 28 . It includes five severity categories: ICU patients (IP), severe patients (SP), moderate patients (MP), convalescent patients (CP), and healthy controls (HC). By investigating the principal component (PC) loadings that statistically differentiated the group of IP + SP + MP (16 patients) from the group of CP + HC (18 non-patients), we found the second and third PCs (PC2 and PC3). The P values computed with a t-test rejected the null hypothesizes that the mean loadings within the group of IP+SP+MP and within the group of CP+HC were identical: 9.69 × 10 −5 for PC2 and 3.67 × 10 −3 for PC3. Although the PC1 loadings were the significantly different between patients and non-patients, the P value (1.83 × 10 −2 ) was larger than those of PCs 2 and 3. As also shown in Fig. S1 , the 2nd and 3rd PCs more clearly separated samples into patients and non-patients than the first one. This is the reason why we chose PCs 2 and 3, but not 1. On this plane, we selected 141 probes embedded in the PC scores as outliers according to a χ 2 test with the P values adjusted by the Benjamini and Hochberg (BH) criterion 29 . Table 1 lists all 123 genes associated with the 141 probes.

To confirm that we successfully selected critical genes representing the relationship between samples, we built the model to predict the COVID-19 patients or not from only the 123 genes selected with PCAUFE. We used data set 2, which consisted of 100 COVID-19 patients and 26 non-COVID-19 ones, to calculate the area under the curve (AUC) 30 . We used logistic regression (LR) 31 , support vector machine (SVM) 32, 33 , and random forest (RF) 34 as classification models. Table S1 shows each hyperparameter of the three models. We performed 5-fold cross-validation by randomly shuffling the samples of data set 2 for the three classifiers. Figure 2 (a) shows the receiver operating characteristic (ROC) curves 35 ,36 of each model. As shown in this figure, the average AUC for each model was derived to be above 0.9. From these results, we could use the 123 selected genes for the prediction of the COVID-19 outcome.

Comparison with other gene selection methodologies. To confirm the robustness of our results, we performed gene selection with two other classical methods: significance analysis of microarrays (SAM) 37 and linear models for microarray data (LIMMA) 38 . By applying both SAM and LIMMA to data set 1 (GSE152418), we identified genes associated with adjusted P-values below 0.01. We confirmed that the majority of the genes selected by PCAUFE were included among those selected by SAM and LIMMA. Every time SAM was applied, the selected genes changed. Thus, the results of SAM are not shown in this report. As for LIMMA, the selected Figure 1 . Outline of this study. First, we select genes related to the disease by using unsupervised machine learning; then, we enrich them by applying biological knowledge to identify the transcription factors and histone modifications of the selected genes. www.nature.com/scientificreports/ probes included all of the genes selected by PCAUFE. It was noteworthy that PCAUFE could limit the candidate genes to a much smaller number than the common gene expression analysis tools could; for example, 18,458 probes were selected by LIMMA. By further limiting the 18,458 probes to almost the same number as in PCAUFE from the smallest adjusted P-values, we also performed the patient/non-patient classification from the genes selected by LIMMA using data set 2. The results using LR, SVM and RF are shown in Fig. 2 (b). We confirmed that the classification performances of each model were comparable to those in PCAUFE.

To increase the robustness of our results, we also selected genes using some more recent R packages: edgeR 39 and DESeq2 40 . As with SAM and LIMMA, by applying both edgeR and DESeq2 to data set 1, we identified genes associated with adjusted P-values below 0.01. The numbers of probes selected by edgeR and DESeq2 were 4452 and 5696, respectively. Thus, these methods selected much more genes than PCAUFE. The genes selected by edgeR and DESeq2 contained the 59 and 64 genes selected by PCAUFE, respectively. The common genes selected among the three methods were 57. For further comparison with PCAUFE, we conducted the classification analysis to predict whether the sample was the COVID-19 patient or not based on the genes selected by edgeR and DEseq2. In the classification analysis, we limited the same number of probes selected by each of edgeR and DEseq2 with the smallest adjusted P-values as in PCAUFE (i.e., 141), since both methods selected too many genes to use them for explanatory variables of the prediction models (i.e., 4452 and 5696). The numbers of genes associated with those probes were 111 for edgeR and 113 for DEseq2, respectively. Figures 2(c) and (d) shows the ROC curves of the patient-prediction models using the limited probes. Their AUCs were approximately equal to those in PCAUFE. As described above, we found that a smaller number of genes selected by PCAUFE than the other methods were significant in predicting the COVID-19 patients or not.

As the last part of this subsection, we conducted a weighted gene co-expression network analysis (WGCNA). Following the analytic procedure used in 41 and 42 , we applied the WGCNA R package 43 to data set 1. We here selected power of β = 7 as the soft threshold for constructing a scale-free network ( Supplementary Fig. S2(A) ). We then obtained 99 modules in the co-expression network as shown in Supplementary Fig. S2 (B). These modules included 18882 probes, which were much more than those selected by PCAUFE (i.e., 141). Moreover, almost half (i.e., 58) of the genes selected by PCAUFE were contained in those by WGCNA. For reference, the proteinprotein interaction (PPI) networks consisting of the genes that belonged to the top 3 modules with the smallest P values and the results of enrichment analyses for these genes are also shown in Supplementary Fig. S3 and Table S2 , respectively.

As demonstrated above, we verified our approach, in which PCAUFE was adopted for gene selection, could narrow down the candidate genes more effectively than ordinary methods such as WGCNA, edgeR, and DEseq2.

In this study, we first identified 123 genes related to COVID-19 patients using PCAUFE. We justified the use of the linear dimensional reduction method by the following supplementary analysis: To compare with PCAUFE, we also applied t-distributed stochastic neighborhood embedding (t-SNE) 44 and uniform manifold approximation and projection (UMAP) [45] [46] [47] , two typical nonlinear dimension reduction methods, to dataset 1 for gene Table 1 . One hundred and twenty-three genes selected by PCAUFE. All of these genes were also selected by LIMMA. SUN2  TAGAP  TLN1  TMEM66  TNFAIP3  TNFRSF1B   TSC22D3  TUBA1A  TYMP www.nature.com/scientificreports/ selection. In the same manner with the PCAUFE algorithm, we tried to search the probes which passed the χ 2 test with the P value adjusted by the BH criterion. As shown in Fig. S4 , however, we found no probes with the adjusted P values less than 0.01 on the planes. The first algorithm of PCAUFE searches for principal components (i.e., axes) whose loadings statistically separate two groups such as patients and non-patients. Therefore, the outliers through the χ 2 test in the second algorithm of PCAUFE could be regarded as the probes that were abnormally up-or down-regulated in some patients compared to non-patients. On the other hand, t-SNE and UMAP emphasize preserving the similarity of the probes as a distance when reducing a high-dimensional data set to two dimensions, the axes in the low-dimensional space do not always correspond to the two groups. Thus, it is not suitable for gene selection to naively use these nonlinear dimensional reduction methods because it does not mean that the probes located far from the origin in the low-dimensional space can serve as the biomarkers for the diagnosis of COVID-19. Furthermore, it should be noted that the use of state-of-the-art deep learning techniques is not always successful in the gene expression analysis of a new type of disease such as COVID-19: We also applied a recently , respectively. Note that since the number of probes respectively selected by LIMMA, edgeR, and DESeq2 were all above 4000, too much more than the samples, the probes with the smallest adjusted. P values were further restricted to be almost the same number as in PCAUFE, 141, for the explanatory variables. Using each classification model, we respectively performed fivefold cross-validation and reported the averages and standard deviations of the 5 runs. 48 , that was described as reliably extracting representative information of each cell, to data set 1 for comparison to PCAUFE. Supplementary Figure S5 displays the scatter plot of samples of data set 1 in the dimension reduction space of scDHA. As shown in this figure, scDHA could not separate the COVID-19 patients from non-patients based on gene expression profiles of PBMCs at all. The reason is probably that this data set has too few samples and too many variables for training the autoencoder. PCAUFE is computationally less expensive than other methods because it only requires one application of PCA to a gene expression matrix in a peculiar way. Therefore, it has been successfully used to tackle a variety of gene selection problems (for detail, see e.g., 49 ). As described above, we consequently demonstrated the usefulness of this non-novel but powerful gene selection method for the data sets of gene expression profiles from COVID-19 patients and proposed a novel mechanism underlying the COVID-19 development.

We second implemented three independent models to classify COVID-19 patients and non-patients based on the 123 genes selected by PCAUFE: LR, SVM, and RF, and confirmed that all the models archived the high AUCs of over 90%. The reason why we did not use the state-of-art deep learning techniques for the classification model was that the sample size in the dataset used for the cross-validation (i.e., 126) was not large enough for the number of explanatory variables (i.e., 123). For example, it may be possible to build the classification model with deep learning by virtually increasing the number of samples as in 50 . Liu et al. 51 used convolutional neural networks (CNNs) to predict Alzheimer's patients based on the fMRI images of their hippocampus, but we did not use them because the explanatory variables in our patient prediction model were gene expression levels, in which the similarities could not be assumed between elements close in location as in images. However, as mentioned above, the statistical relevance of the genes selected by PCAUFE is already guaranteed because we found the conventional machine learning models had sufficient prediction accuracies.

To show the further robustness of our results, we also perform gene selection using data set 2 and clustering analysis to validate the separability by the selected genes using data set 1. The numbers of genes selected from data set 2 by PCAUFE, LIMMA, edgeR, and DESeq2 were 145, 7360, 4809, and 5018, respectively. LIMMA, edgeR, and DESeq2 respectively included the 79, 82, and 82 of 145 genes selected by PCAUFE using data set 2. On the other hand, the number of genes overlapping between the 123 and 145 genes selected by PCAUFE was 38. For the clustering analysis, we adopted an unsupervised learning model, UMAP, since data set 1 included only 34 samples, too few to train and test the supervised learning models using it. The results shown in Supplementary Fig. S6 indicate that the genes selected by PCAUFE could classify the COVID-19 patients or not as well as LIMMA, edgeR, and DESeq2. In the end, we got similar results even if we switched the data sets for gene selection and patient/non-patient classification.

Because we successfully confirmed the robustness of our results, we next investigated the biological reliability of the 123 selected genes. First, we uploaded the 123 genes to three enrichment analysis servers, GeneSetDB 52 , Metascape 53 , and TargetMine 54 , to compensate for the bias introduced by each individual enrichment. Multiple immune-related enrichments were detected. For example, Gene Ontology (GO) biological process (BP) terms GO:0019221 (cytokine-mediated signaling pathway), GO:0060333 (interferon-gamma-mediated signaling pathway), and GO:0060337 (type I interferon-mediated signaling pathway) were identified by all three servers. GO cellular component (CC) term GO:0042613 (MHC class II protein complex) was identified by GeneSetDB and TargetMine. Reactome pathways R-HSA-877300 (interferon gamma signaling), R-HSA-6785807 (Interleukin-4 and Interleukin-13 signaling), R-HSA-449147 (signaling by interleukins), and R-HSA-1280218 (adaptive immune system) were identified by Metascape and TargetMine (for more details, see supplemental S1_File).

Second, we confirmed biological validation of the identified genes by examining the interactions between them. Tight relationships between the genes would indicate that the gene selection was reliable, because single proteins rarely function without collaboration with other proteins. Thus, we uploaded the 123 genes to the STRING server 55 , which detected 659 protein-protein interactions among the products of these genes. Therefore, the 123 genes were also enriched for protein-protein interactions because of the functional collaborations between their products. These enrichment analyses suggested that PCAUFE could successfully identify a biologically feasible set of genes related to COVID-19.

To investigate the upstream transcription factors (TFs) that regulate the 123 genes selected by PCAUFE, we also uploaded them to Enrichr 56,57 , a multi-functional enrichment analysis server. Among the results given by Enrichr, NFKB1 and RELA had smaller adjusted P-values for "TRRUST Transcription Factors 2019, " as shown in Fig. 3 . We also noticed the three highest-ranked TF bindings for "ENCODE TF ChIP-seq 2015": NELFE, RELA, and KAT2A (for more details, see Fig. S7 in the supplemental S2_File).

The 123 genes were also enriched for multiple histone modifications, and the results are listed in Table 2 . Furthermore, as shown in Fig. 4 , the genes associated with the histone modifications largely overlapped the TF target genes.

The nuclear factor-kappa B (NF-κB) TFs play an evolutionarily conserved and critical role in the triggering and coordination of both innate and adaptive immune responses 58 . The NF-κB family of transcription factors consists of five members: p50, p52, p65 (RelA), c-Rel, and RelB, which are encoded by NFKB1, NFKB2, RELA, REL, and RELB, respectively 59 . The primary mediator of canonical NF-κB activity is the heterodimer RelA-p50, which consists of the RelA transcriptional activator and the nfkb1 protein p50 60,61 .

Nakshatri et al. 62 suggested that NF-κB activity is suppressed by H3K36me3, which is consistent with the observed enrichment of NFKB1-and RELA-binding sites in these 123 genes. Many studies have also reported that the expression levels of genes associated with immune signaling are downregulated in naso/oropharyngeal swabs and peripheral blood mononuclear cells (PBMCs) in patients with COVID-19. Mick et al. 63 showed that COVID-19 is characterized by a diminished innate immune response, with reduced expression of genes involved in toll-like receptor and interleukin signaling, chemokine binding, neutrophil degranulation, and interactions with lymphoid cells, as compared to other viral acute respiratory illnesses. Meckiff et al. 64 65 reported that the genes that are underexpressed in severe cases mainly involve Th17-cell differentiation, cytokine-mediated signaling pathways, and T-cell activation. Li et al. 66 reported that proteins mediating T-cell receptor signaling are downregulated in severe COVID-19 PBMCs.

To investigate the expression variation of the five overlapping genes in Fig. 4 , we confirmed the PC2 and PC3 scores for data set 1 via the scatter plot shown in Fig. 5 . The probes that were negatively located for both PC2 and PC3 were mainly upregulated for COVID-19. On the other hand, the probes that were positively located for both PCs were mainly downregulated for COVID-19. As shown by the red squares in Fig. 5 , the overlapping genes in Fig. 4 were positively located for both PCs. Therefore, those overlapping genes were downregulated for COVID-19. These analysis results are consistent with the above references [62] [63] [64] [65] [66] . www.nature.com/scientificreports/ NF-κB is the subject of much active research among pharmaceutical companies as a target for anti-cancer therapy 67 . Abnormal expressions of NFKB1 and RELA are mediated through mRNA modifications 68 . The recent progress in N4-Acetylcytidine (N4A) on RNA expression is also playing key role on the cancer development 69 . Duan J et al. 70 reported that N4A, a nucleoside metabolite, activated microglia and sustained NLRP3 inflammasome activation by inducing HMGB1 signaling. Released HMGB1 through N4A activated NF-κB and induced NLRP3 expression 70 . NLRP3 inflammasome appropriately activated and enabled to release mature IL-1β [71] [72] [73] . IL-1β is an important mediator of the inflammatory response, and is involved in a variety of cellular activities, including cell proliferation, differentiation, and apoptosis 74, 75 . www.nature.com/scientificreports/ Although we identified the 123 genes and the upstream TFs related to COVID-19 by using PCAUFE and enrichment analyses, we have yet assessed whether these genes have the potential causal effects on the COVID-19 development. Mendelian randomization (MR) [76] [77] [78] [79] [80] approach has been widely used to investigate causality between genes and disease outcomes. For example, Zhang et al. 79 investigated the causal relationships between PTSD and the depressive phenotypes using an MR approach. In another of their studies 80 , The results of MR analysis indicate that genetic variation mediates the causal influences of neuroticism on mental health and cardiovascular diseases. This method uses genetic polymorphism information as an operating variable, but unfortunately, the data sets we used do not include that information. Moreover, polymorphisms may have several phenotypic effects associated with the disease. Thus, we leave the application of the MR approach to the gene expression profiles from COVID-19 patients as future work.

In conclusion, we selected 123 COVID-19-related genes by applying PCAUFE to the gene expression levels of PBMCs from COVID-19 patients and healthy subjects. Then, by enrichment analysis, we identified the transcription factors and histone modifications that regulate the expression of these genes. Four transcription factors, NELFE, RELA, KAT2A and NFKB1, and a histone modification, H3K36me3, may be involved in the expression of the 123 genes. NFKB1, RELA, and H3K36me3 were found to overlap in the genes regulating expression. These two transcription factors are associated with NF-κB, and H3K36me3 may repress it. In fact, when we compared the expression levels of the genes duplicated in NFKB1, RELA, and H3K36me3 in GSE152418 between the COVID-19 patients and healthy subjects, we observed a decrease in expression levels in the COVID-19 patients. These results suggest that canonical NF-κB activity is suppressed by H3K36me3 in the PBMCs of COVID-19 patients.

Gene expression profiles. Two in vivo gene expression data sets, GSE152418 28 and GSE157103 81 , were downloaded from Gene Expression Omnibus 82 . Hereafter, we denote these as data sets 1 and 2, respectively. PCAUFE was applied to data set 1, which described the expression level of each kind of mRNA in each subject's PBMCs. The number of probes was 60,683. Data set 2 was then used to confirm the statistical validity of the genes selected by PCAUFE. This data set also described the expression level of each gene in each subject's PBMCs. The data included both COVID-19 patients and non-COVID-19 patients who suffered from acute respiratory distress syndrome (ARDS) that was not associated with SARS-CoV-2. The number of genes was 19,472. The expression level of each gene i (= 1, 2,…, N) was standardized for PCAUFE, i.e., we set 1

For the details of the samples included in these gene expression profiles, see Table S3 in the supplemental S2_File.

The following briefly explains the PCAUFE procedure used in this study (for more details, see 9, 12, 17, 27, 49 ) . Let x i j be the expression of the i-th mRNA probe of the jth sample, and let 1 N i x ij = 0 and 1 N i x 2 ij = 0 , where N is the number of m-RNA probes. First, we applied PCA to the dataset whose rows and columns were genes and samples, respectively. In contrast to the usual use of PCA, where samples are embedded, the genes were embedded in this implementation. By using a t-test, we specified two principal components (PCs) whose loadings statistically differentiated the patients from healthy control samples in order from the smallest P-value. Note that, unlike ordinary PCA, this operation does not guarantee that the first two PCs will be selected. Second, by using a χ 2 test with the P values adjusted by the BH criterion 29 , we identified outlier PC scores (i.e., genes associated with the adjusted P-values less than 0.01) along with the specified PCs as candidates for the disease-related genes. Note that the PC scores in PCAUFE were associated with features (i.e., mRNA probes), not with samples, in contrast to the ordinary usage of PCA.

Patient/non-patient classification models. To verify the genes selected by PCAUFE were useful for the diagnosis of COVID-19 patients, we performed the patient/non-patient classification based on the selected genes using three standard prediction models: logistic regression (LR 31 ), support vector machine (SVM 32, 33 ) and random forest (RF 34 ). For the details of each hyperparameter of the three models, see Supplementary Table S1 . The objective variable was given the value 0 or 1 for each sample depending on a non-COVID-19 or a COVID-19 patient, respectively. The explanatory variables were given the gene expressions of the probes associated with the genes selected by PCAUFE, edgeR, or DESeq2. We randomly allocated 80% of data set 2 to the training set and the remains to the test one. Receiver operating characteristic (ROC) curves of each model were drawn to calculate the area under the curves (AUCs). 

A new coronavirus associated with human respiratory disease in China

A pneumonia outbreak associated with a new coronavirus of probable bat origin

A novel coronavirus from patients with pneumonia in China

Clinical characteristics of 2019 novel coronavirus infection in China

Temporal dynamics in viral shedding and transmissibility of covid-19

Presymptomatic transmission of sars-cov-2-Singapore

Comparison of clinical characteristics of patients with asymptomatic vs symptomatic coron-avirus disease 2019 in Wuhan

Principal component analysis based unsupervised feature extraction applied to publicly available gene expression profiles provides new insights into the mechanisms of action of histone deacetylase inhibitors

Principal component analysis based unsupervised feature extraction applied to budding yeast temporally periodic gene expression

Sfrp1 is a possible candidate for epigenetic therapy in non-small cell lung cancer

Identification of more feasible MicroRNA-mRNA interactions within multiple cancers using principal component analysis based unsupervised feature extraction

Identification of aberrant gene expression associated with aberrant promoter methylation in primordial germ cells between E13 and E16 rat F3 generation vinclozolin lineage

Integrative analysis of gene expression and promoter methylation during reprogramming of a non-small-cell lung cancer cell line using principal component analysis-based unsupervised feature extraction

Heuristic principal component analysis-based unsupervised feature extraction and its application to bioinformatics

Heuristic principal component analysis-based unsupervised feature extraction and its application to gene expression analysis of amyotrophic lateral sclerosis data sets

Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease

TINAGL1 and B3GALNT1 are potential therapy target genes to suppress metastasis in non-small cell lung cancer

Comprehensive analysis of transcriptome and metabolome analysis in Intrahepatic Cholangiocarcinoma and Hepatocellular Carcinoma

Comparison of hepatocellular carcinoma miRNA expression profiling as evaluated by next generation sequencing and microarray

Comprehensive miRNA expression analysis in peripheral blood can diagnose liver disease

The aberrantly expressed miR-193b-3p contributes to preeclampsia through regulating transforming growth factor-β signaling

Principal component analysis based feature extraction approach to identify circulating microRNA biomarkers

Genes associated with genotype-specific DNA methylation in squamous cell carcinoma as candidate drug targets

Bioinformatic screening of autoimmune disease genes and protein structure prediction with FAMS for drug discovery

Principal component analysis for bacterial proteomic analysis

Principal components analysis based unsupervised feature extraction applied to gene expression analysis of blood from dengue haemorrhagic fever patients

Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans

Controlling the false discovery rate: a practical and powerful approach to multiple testing

The use of the area under the ROC curve in the evaluation of machine learning algorithms

The regression analysis of binary sequences (with discussion)

Pattern recognition using generalized portrait method

Support-vector networks

Associations between maternal vitamin D status during three trimesters and cord blood 25(OH)D concentrations in newborns: a prospective Shanghai birth cohort study

LEPR hypomethylation is significantly associated with gastric cancer in males

Significance analysis of microarrays applied to the ionizing radiation response

limma powers differential expression analyses for RNA-sequencing and microarray studies

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Co-expression network analysis identified hub genes critical to triglyceride and free fatty acid metabolism as key regulators of age-related vascular dysfunction in mice

Genetic regulatory subnetworks and key regulating genes in rat hippocampus perturbed by prenatal malnutrition: implications for major brain disorders

WGCNA: an R package for weighted correlation network analysis

Visualizing Data using t-SNE

UMAP: Uniform Manifold Approximation and Projection

Dimensionality reduction for visualizing single-cell data using UMAP

UMAP: uniform manifold approximation and projection for dimension reduction

Fast and precise single-cell data analysis using a hierarchical autoencoder

Unsupervised Feature Extraction Applied to Bioinformatics: A PCA Based and TD Based Approach

Gene expression data based deep learning model for accurate prediction of drug-induced liver injury in advance

A multi-model deep convolutional neural network for automatic hippocampus segmentation and classification in Alzheimer's disease

Genesetdb: A comprehensive meta-database, statistical and visualisation framework for gene set analysis

Metascape provides a biologist-oriented resource for the analysis of systems-level datasets

TargetMine, an integrated data warehouse for candidate gene prioritisation and target discovery

STRING v10: protein-protein interaction networks, integrated over the tree of life

Enrichr: interactive and collaborative html5 gene list enrichment analysis tool

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update

Missing pieces in the NF-kappaB puzzle

Shared principles in NF-κB signaling

The 20S proteasome processes NF-κB1 p105 into p50 in a translation independent manner

Generation and activation of multiple dimeric transcription factors within the NF-kappaB signaling system

NF-κB-dependent and -independent epigenetic modulation using the novel anti-cancer agent DMAPT

Upper airway gene expression differentiates COVID-19 from other acute respiratory illnesses and reveals suppression of innate immune responses by SARS-CoV-2. medRxiv 4, e1608

Imbalance of regulatory and cytotoxic SARS-CoV-2-reactive CD4 + T cells in COVID-19

Downregulated gene expression spectrum and immune responses changed during the disease progression in patients with COVID-19

Virus-host interactome and proteomic survey reveal potential virulence factors influencing SARS-CoV-2 pathogenesis

The transcription factor nuclear factorkappa B and cancer

NLRP3 inflammasome-mediated inflam-mation in acute pancreatitis

The processing, gene regulation, biological functions, and clinical relevance of N4-acetylcytidine on RNA: a systematic review

N4-acetylcytidine is required for sustained NLRP3 inflammasome activation via HMGB1 pathway in microglia

Targeting the NLRP3 inflammasome in inflammatory diseases

The Nlrp3 inflammasome: contributions to intestinal homeostasis

Immunodeficiency promotes adaptive alterations of host gut microbiome: an observational metagenomic study in mice

The role of IL-1B in breast cancer bone metastasis

Effects of early-life malnutrition on neurodevelopment and neuropsychiatric disorders and the potential mechanisms

Apolipoprotein E isoforms, serum cholesterol, and cancer

Multi-trait analysis for genome-wide association study of five psychiatric disorders

Genetic support of a causal relationship between iron status and type 2 diabetes: a Mendelian randomization study

Causal influences of neuroticism on mental health and cardiovascular disease

Genetic evidence suggests posttraumatic stress disorder as a subtype of major depressive disorder

Large-scale multi-omic analysis of COVID-19 severity

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

This study was supported by Okinawa Prefecture's Project to Promote the Use of Information Technology in the Health and Medical Industries (20G1000012).

K.F., Y.T., and R.M. designed the research; K.F., M.S., and R.M. wrote the manuscript; K.F. and M.S. performed the analysis; K.F., Y.T., S.I., and R.M. interpreted the results; S.I. and R.M. supervised the research; All of the authors reviewed the manuscript.

The authors declare no competing interests.

The online version contains supplementary material available at https:// doi. org/ 10. 1038/ s41598-021-95698-w.Correspondence and requests for materials should be addressed to K.F. or R.M.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.