key: cord-0285152-ti1nl5fu authors: Jackson, Heather; Menikou, Stephanie; Hamilton, Shea; McArdle, Andrew; Shimizu, Chisato; Galassini, Rachel; Huang, Honglei; Kim, Jihoon; Tremoulet, Adriana; de Jonge, Marien; Kuijpers, Taco; Wright, Victoria; Burns, Jane; Casals-Pascual, Climent; Herberg, Jethro; Levin, Mike; Kaforou, Myrsini title: Kawasaki Disease patient stratification and pathway analysis based on host transcriptomic and proteomic profiles date: 2021-03-19 journal: bioRxiv DOI: 10.1101/2021.03.18.435948 sha: eef2ed5853e7113b229a72b5844f6238e8a9ae76 doc_id: 285152 cord_uid: ti1nl5fu The aetiology of Kawasaki Disease (KD), an acute inflammatory disorder of childhood, remains unknown despite various triggers of KD having been proposed. Host ‘omic profiles offer insights into the host response to infection and inflammation, with the interrogation of multiple ‘omic levels in parallel providing a more comprehensive picture. We used differential abundance analysis, pathway analysis, clustering and classification techniques to explore whether the host response in KD is more similar to the response to bacterial or viral infection at the transcriptomic and proteomic levels through comparison of ‘omic profiles from children with KD to those with bacterial and viral infections. Pathways activated in patients with KD included those involved in anti-viral and anti-bacterial responses. Unsupervised clustering showed that the majority of KD patients clustered with bacterial patients on both ‘omic levels, whilst application of diagnostic signatures specific for bacterial and viral infections revealed that many transcriptomic KD samples had low probabilities of having bacterial or viral infections, suggesting that KD may be triggered by a different process not typical of either common bacterial or viral infections. Clustering based on the transcriptomic and proteomic responses during KD revealed three clusters of KD patients on both ‘omic levels, suggesting heterogeneity within the inflammatory response during KD. The observed heterogeneity may reflect differences in the host response to a common trigger, or variation dependent on different triggers of the condition. 157 as the optimal number of clusters for downstream analyses. In the transcriptomic analysis, 3 clusters 170 were identified as optimal, whereas on the protein level, 2 clusters were identified as optimal. We assessed the proportion of KD, bacterial and viral patients in each of the clusters for the 173 transcriptomic (A) and proteomic (B) datasets (Fig. 2) . In the transcriptomic analysis, an over- representation of viral patients was observed in cluster 1 and, to a lesser extent, cluster 2. An over- pathways, such as interferon and cytokine signalling (Fig. 3) . In cluster 2, although the majority of 206 patients were viral, their proportion was not quite as high as it was in cluster 1 (Fig. 2) . The majority 207 of the RSV (15/27) patients were in cluster 2. In the KD patients in cluster 2, various pathways 208 associated with the anti-viral response were downregulated (Fig. 3) . Cluster 3 had the highest 209 proportion of bacterial patients and KD patients (Fig. 2) . Similarly to cluster 2, the top pathways 210 downregulated for KD patients in cluster 3 were associated with the anti-viral response, while the 211 inflammatory response pathway was strongly upregulated suggesting that the KD patients in this 212 cluster were different to those in cluster 1 and that their response was not as viral-like as those in 213 cluster 1 (Fig. 3 ). Three pathways -response to biotic stimulus (i.e. a stimulus caused or produced by a living transcriptomic samples (Fig. 1a) and also in the KD samples in the viral-like cluster 1 (Fig. 3 ). Furthermore, four pathways, including two associated with interferon signaling, were upregulated 218 in viral transcriptomic samples (Fig. 1a) upregulated in KD transcriptomic samples in cluster 1 (Fig. 3) , including two related to cytokine 221 signaling. For the proteomic dataset, two proteins were SDA between cluster 1 and 2: serum amyloid A1 (SAA1) and retinol binding protein 4 (RBP4). Both of these proteins have been identified previously elevated in KD [28] . The two KD patients in cluster 2 displayed the opposite pattern, with higher 226 RBP4 and lower SAA1 abundance than the other KD patients. Through using two independent classifiers, the possibility of a patient being neither bacterial nor viral was allowed. The classifiers were trained using the 'omic data that was corrected for age, sex and, for the transcriptomic dataset, immune cell proportions. The Lasso model selected 38 genes for the bacterial classifier, of which 26 had increased 241 abundance and 12 had decreased abundance in bacterial patients compared to viral patients and 242 healthy controls (Table S4) . The viral classifier included 32 genes, of which 13 had increased 243 abundance and 19 had decreased abundance in viral patients compared to bacterial patients and 244 healthy controls (Table S5) The Lasso model selected 26 proteins for the bacterial classifier, of which 12 had increased 249 abundance and 14 had decreased abundance in bacterial patients compared to viral patients and 250 healthy controls (Table S6) . The viral classifier included 20 proteins, of which 11 had increased 251 abundance and 9 had decreased abundance in viral patients compared to bacterial patients and 252 healthy controls (Table S7 ). When testing the classifiers trained in the proteomic discovery dataset For both 'omic levels, the 90% sensitivity of the classifiers in classifying these samples was used to 256 determine the DRS threshold above which a sample would be classified as bacterial or viral. The classifiers were applied to KD patients from the discovery and validation datasets for both To further examine the 'omic profiles of the KD patients with DB-DRS and DV-DRS too low for 270 them to be classified as either bacterial or viral, we performed pathway analysis on the genes or 271 proteins SDA between these KD patients and healthy controls. Amongst the pathways upregulated 272 on the transcriptomic level, were 'defense response to fungus' (p-value: 6.6e-08) and and 'response 273 to fungus' (p-value: 6.7e-07). The associations Boxplots are shown for each disease group. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0 (cluster KD1-T), 23 (30%) were in cluster 2 (cluster KD2-T), and 22 (29%) were in cluster 3 (cluster 307 KD3-T). Of the 26 proteomic KD samples, 4 (15%) were in cluster 1 (cluster KD1-P), 7 (27%) were in 308 cluster 2 (cluster KD2-P), and 15 (58%) were in cluster 3 (cluster KD3-P). There was high overlap between the samples in cluster KD1-T and those in the transcriptomic all KD samples except two clustered together in cluster 1, however the two remaining samples that 314 were previously in cluster 2, were not assigned to the same cluster. The association between cluster membership and various clinical variables was tested. CRP 316 levels were significantly associated with cluster membership for both 'omic layers (transcriptomics In the transcriptomic analysis (Fig. 6a) , cluster KD2-T had features in common with an anticharacterised by the upregulation of viral pathways, including those associated with cytokine 342 signaling. The response to biotic stimulus and type I interferon signaling pathways were previously 344 identified as being upregulated in viral transcriptomic samples (Fig. 1a) and in KD samples in the 345 viral-like cluster 1 when K-Means was applied to KD, DB and DV (Fig. 3) . These pathways were 346 downregulated in clusters KD1-T and KD3-T (Fig. 6a) , indicating that the transcriptomic response 347 in these samples was less viral-like than samples in cluster KD2-T. Four pathways previously identified as being downregulated in bacterial transcriptomic 349 samples (Fig. 1a) , including two pathways associated with cytokine signaling, were upregulated in 350 cluster KD2-T. In addition, six pathways upregulated in cluster KD2-T had already been identified 351 as being upregulated in the viral-like cluster 1 identified previously (Fig. 3) . Five pathways of the 9 352 top upregulated pathways in cluster KD3-T (Fig. 6a) were also upregulated in cluster 2 when K- Means was applied to KD, DB and DV (Fig. 3) . These were blood coagulation, hydrogen peroxide 354 catabolic process, antibiotic catabolic processes, hydrogen peroxide metabolic process and 355 regulation of biological quality. In the proteomic analysis (Fig. 6b) proteomic KD samples (Fig. 6b) , 21 were previously identified as enriched in proteomic samples 361 (Fig. 1b) . Of these, 6 were enriched in proteomic viral samples (Fig. 1b) and samples in cluster KD2-362 P (Fig. 6b) with concordant directions. Furthermore, 7 pathways enriched in cluster KD1-P (Fig. 6b) were also enriched in bacterial proteomic samples (Fig. 1b) with concordant directions. These results suggest that cluster KD1-P is a more bacterial-like cluster, whereas cluster KD2-P is a more viral-like cluster. Some pathways were enriched on both 'omic levels, including those associated 366 with blood coagulation, the response to stress and immune effector processes. Within the host response profiles, some elements of KD appeared more viral-like and some 388 elements appeared more bacterial-like. This is shown through overlapping pathways that were The heterogeneity of the host response during KD was also apparent when K-Means was applied to 408 KD patients alone. Three distinct clusters were identified on both 'omic levels, and, in each cluster, A two-way classifier approach highlighted that it is not a simple dichotomous question as to 429 whether the response during KD more closely resembles the responses to bacterial or viral 430 infections, when focusing on key discriminatory features. We found that 145 of the 178 431 transcriptomic KD samples were not assigned DRS high enough for them to be classified as either bacterial or viral, and amongst pathways upregulated in these KD patients compared to healthy controls were two pathways associated with the fungal response. This finding is intriguing, given the evidence suggesting that KD could be caused by a fungal trigger that has been reported elsewhere [36, 37] . The heterogeneity and the different clusters of responses to KD which have elements shared 437 with bacterial, viral or fungal responses, could indicate multiple microbial triggers of KD, as has been suggested by Rypdal et al. [11] . An alternative explanation for the heterogeneity observed here 439 in the response during KD could be that a single pathogen that causes KD leads to heterogeneous 440 responses in different hosts, as has been observed in children infected with SARS-CoV-2, where develop PIMS-TS/MIS-C [13, 14, 40] . Variations in the host condition, such as epigenetic differences 443 and differences in prior pathogen exposure, could cause the spectrum of host responses to KD 444 observed here. Differences in host genetics could also be responsible for the heterogeneity in host 445 response during KD as the severity of KD, including the formation of CAAs, is already known to be 446 impacted by the host's genetic background [41] . This study has certain limitations. The proteomic discovery dataset was a lower dimensional 448 dataset (n= 867) than the transcriptomic discovery dataset (n=47,323) with high rates of missingness, 449 as is common in quantitative proteomics. Only proteins with no missingness were used for the 450 clustering and classification, so key proteins for distinguishing KD could be absent from the 451 analysis. On the proteomic level, many pathways were enriched in multiple disease groups ( Fig. 452 1b), making it difficult to identify a disease-specific pathway signature. This could be caused by 453 plasma samples, which were used in this dataset, capturing a noisy signal due to the release of 454 substances from various tissues into the bloodstream. The proteomic response during KD shared 455 more similarities with the proteomic response to bacterial infection, with more pathways 456 overlapping between KD and bacterial infections (Fig. 1b) and all but two KD proteomic samples 457 clustering with bacterial proteomic samples (Fig. 2) . This follows observations of striking clinical 458 similarities between KD and bacterial streptococcal and staphylococcal toxic shock syndromes 459 [42, 43] , and could reflect the hypothesis that the proteome is closer to the observed phenotype than 460 the transcriptome [44] . Bacterial samples with known viral coinfections were removed from the analysis at this stage to 531 ensure that the signal from the bacterial samples was not diluted. KD samples that had been 532 administered IVIG treatment were also removed at this stage, but their inclusion was irrespective of 533 coincident viral or bacterial detection, for which data was not available. A KD sample previously 534 identified as an outlier [18] was removed. As mentioned, two microarray gene-expression datasets were merged to form the validation 536 dataset. Background subtraction and RSN normalisation were applied to these two datasets 537 independently, using the R package lumi [47] , prior to using ComBat [48] To explore the effects of sex and age on clustering in the proteomic dataset, the contribution of 576 these variables was removed by regressing out their effects on every protein and taking the residual values as the 'corrected' abundance. This process was also followed in the transcriptomic dataset but the contributions of the immune cell proportions listed in 4.3.3.1 were also removed. Prior to clustering, and after correction, features were removed if their variance was lower than 0.25. To [65], Ptbiserial [66,67] and Ratkowsky [68] . The number of clusters tested by NbClust ranged 584 between 2-10 clusters. The most frequently selected k by the 12 indices was used for downstream 585 analyses. The lowest k selected the most frequently was taken in cases where there were multiple 586 values of k selected the most frequently. Once clusters were identified, features that were SDA (5% FDR) between KD samples in the 588 different clusters were identified. Pathway analysis was done using these lists of SDA features to 589 determine the pathways upregulated and downregulated in KD samples in the different clusters. The R package g:Profiler2 [24] was used for pathway analysis, with pathways with p-values < 0.01 591 considered significant. Redundancy in the pathways identified was removed using REVIGO [54] . The association between cluster membership and various clinical variables was tested. For 593 categorical variables, Fisher's Exact test was used. For continuous variables, One-Way ANOVA was then tested on the KD patients from the discovery and validation datasets and the thresholds identified by pROC were used to determine if the KD patients were classified as DB or DV. If 627 patients were classified as neither bacterial nor viral according to their DRS, differential abundance 628 analysis followed by pathway analysis (as described in 4.3.3.2) was done to identify the pathways was the same as outlined in 4.3.3.3. The association between cluster membership and clinical 634 variables was tested. The clinical variables and the statistical tests used were the same as outlined in Kawasaki disease: a comprehensive review Treatment Response in Kawasaki Disease Is Associated with 677 Sialylation Levels of Endogenous but Not Therapeutic Intravenous Immunoglobulin G Lifetime cardiovascular management of patients with previous 684 Kawasaki disease The epidemiology of Kawasaki disease: A global update Causes of Kawasaki Disease-From Past to Present Dissecting 690 Kawasaki disease: a state-of-the-art review Aetiological significance of infectious stimuli in Kawasaki 692 disease. Front. Pediatr Kawasaki disease with tropospheric wind patterns Clustering and climate associations of Kawasaki Disease Childhood Multisystem Inflammatory Syndrome -A New Challenge in the Pandemic Clinical Characteristics of 58 Children with a Pediatric Inflammatory 702 Multisystem Syndrome Temporally Associated with SARS-CoV-2 Multisystem Inflammatory Syndrome in Children in 15 SARS-CoV-2-Related Inflammatory Multisystem Syndrome in Children African Adults Using Whole Blood RNA Expression Signatures: A Case-Control Study Diagnosis of Bacterial Infection Using a 18 Whole-Blood Gene Expression Signature Global gene expression profiling identifies new therapeutic targets in acute 723 Kawasaki disease Aptamer-based multiplexed proteomic technology for biomarker discovery limma powers differential 736 expression analyses for RNA-sequencing and microarray studies Profiler: a web server 739 for functional enrichment analysis and conversions of gene lists (2019 update) Determining cell type abundance and expression from 743 bulk tissues with digital cytometry Nbclust: An R package for determining the relevant 745 number of clusters in a data set Identification of candidate diagnostic serum biomarkers for Kawasaki 748 disease using proteomic analysis Regression Shrinkage and Selection via the Lasso Nonimmune cells in inflammatory bowel disease: from victim to villain Role of the Major Histocompatibility Complex Class I 33 Sustained activation of neutrophils in the course of Kawasaki disease: an association 762 with matrix metalloproteinases Expression of IL-8 in Kawasaki disease Discrimination of Acute Kawasaki Disease From Infections in Childhood. Front Environmental epidemiology of Kawasaki disease: Linking disease 772 etiology, pathogenesis and global distribution Tropospheric winds from northeastern China carry the 775 etiologic agent of Kawasaki disease from its source to Japan SARS-CoV-2 Infection in Children COVID-19 in children Europe: a multinational, multicentre cohort study Intensive care admissions of children with paediatric 785 inflammatory multisystem syndrome temporally associated with SARS-CoV-2 (PIMS-TS) in the UK: a 786 multicentre observational study A genome-wide association study identifies novel and functionally related 789 susceptibility Loci for Kawasaki disease Evidence for a superantigen mediated process in Kawasaki 43 Antibiotic use in children with Kawasaki disease Proteomics in evolutionary ecology: Linking the 45 Accurate proteome-wide label-free Cell COCONUT: COmbat CO-Normalization Using conTrols (COCONUT) Controlling the False Discovery Rate: A Practical and Powerful Approach 818 to Multiple Testing REVIGO Summarizes and Visualizes Long Lists of Gene 820 Ontology Terms Algorithm AS 136: A K-Means Clustering Algorithm A Criterion for Determining the Number of Groups in a Data Set Using 57. Caliñski, T.; Harabasz, J. A Dendrite Method For Cluster Analysis A Program to Test for the Quality of Clustering of a Set of Objects Well-separated clusters and optimal fuzzy partitions Quality scheme assessment in the clustering process Clustering validity assessment: Finding the optimal partitioning of a data 63 A general statistical framework for assessing categorical clustering in free 64 Silhouettes: A graphical aid to the interpretation and validation of cluster analysis An examination of the effect of six types of error perturbation on fifteen clustering 845 algorithms A monte carlo study of thirty internal criterion measures for cluster analysis A Criterion for Determining the Number of Groups in a Classification Many thanks to the patients and their families who took part in the studies that the data 665 presented here originated from.