key: cord-0291570-oqg0lww0 authors: Astore, C. A.; Zhou, H.; Skolnick, J. title: LeMeDISCO: A computational method for large-scale prediction & molecular interpretation of disease comorbidity date: 2021-07-03 journal: nan DOI: 10.1101/2021.06.28.21259559 sha: fabd2c47d57175eabecb8156c177349c6ce16dac doc_id: 291570 cord_uid: oqg0lww0 Often different diseases tend to co-occur (i.e., they are comorbid), which yields the question: what is the molecular basis of their coincidence? Perhaps, common proteins are comorbid disease drivers. To understand the origin of disease comorbidity and to identify the essential proteins and pathways underlying comorbid diseases, we developed LeMeDISCO (Large-Scale Molecular Interpretation of Disease Comorbidity), an algorithm that predicts disease comorbidities from shared mode of action (MOA) proteins predicted by the AI-based MEDICASCY algorithm. LeMeDISCO was applied to predict the general occurrence of comorbid diseases for 3608 distinct diseases. To illustrate the power of LeMeDISCO, we elucidate the possible etiology of coronary artery disease and ovarian cancer by determining the comorbidity enriched MOA proteins and pathways and suggest hypotheses for subsequent scientific investigation. The LeMeDISCO web server is available for academic users at: http://sites.gatech.edu/cssb/LeMeDISCO. Disease comorbidity, the cooccurrence of distinct diseases is an interesting medical phenomenon. For example, individuals having one autoimmune condition likely develop another. Interestingly, rheumatoid arthritis, autoimmune thyroiditis, and insulin-dependent diabetes mellitus cooccur, but rheumatoid arthritis and multiple sclerosis do not 1 . Previously, there have been several efforts to investigate the molecular features responsible for human disease comorbidities [2] [3] [4] [5] [6] [7] [8] . Some studies focused on particular subsets of diseases 3 or ethnic groups, while others investigated the entire human disease network [4] [5] [6] [7] . For example, Zhou et. al. 5 applied text mining to search the literature for disease-symptom associations. They then predicted the entire human disease-disease network based on a calculated symptom similarity score. While this approach covers almost all human diseases, it only explains one phenotype (disease) by another phenotype (symptom). Menche et. al. 6 utilized known disease-gene associations from GWAS 9 and OMIM combined with a protein-protein interaction network to identify connected disease gene clusters or modules. Another study also utilized known disease-gene associations and protein-protein interaction networks to characterize disease-disease relationships without requiring gene clusters 7 ; thus, its disease coverage is better than in ref. 6 . A deeper analysis of these studies, which have low recall rates, demonstrates that focusing entirely on shared genes is insufficient to predict disease comorbidity or identify its origins. They miss collective effects arising from both direct and indirect protein-protein interactions and pathway correlations. Existing studies that use known disease-gene associations are limited by data availability. Indeed, only a small fraction of diseases has known associated genes. For example, ref. 7 only covers 1,022 of the 8,043 diseases in the Disease Ontology database 10 , with just 6,594 pairs of diseases having a non-zero number of shared genes. Similarly, ref. 6 found that most (59%) of their 44,551 disease pairs do not share any genes. To address these limitations, we developed LeMeDISCO which extends our recently developed MEDICASCY machine learning approach 11 for predicting disease indications and mode of action (MOA) proteins (as well as small molecule drug side effects and efficacy) to predict disease comorbidities and the proteins and pathways responsible for their comorbidity. We then show that LeMeDISCO covers a broader spectrum of comorbid diseases than existing approaches. Assuming that the most enriched comorbid proteins are responsible for disease comorbidity, we determine the most frequent comorbidity enriched MOA proteins. These proteins are then employed in pathway analysis 12 . As examples, we predict the comorbid diseases, comorbidity enriched MOA proteins, and pathways associated with coronary artery disease (CAD) and ovarian cancer (OC). To assess its relative performance, we compared the results of LeMeDISCO to three other methods, the XD score 7 , the S AB score and the Symptom Similarity score 5 . Table 1 summarizes the results. We define a positive comorbidity pair when their log(RR) > 0, XD score > 0, S AB score < 0, or the symptom similarity score > 0. 1 13 and p-value < 0.05 for J-score. The relative risk RR is defined in eq. 4a and is the probability that two diseases occur in a single individual relative to random. The φ -score is the Pearson's correlation for binary variables and is defined in eq. 4b. Mapping the DOIDs from the Human Disease Ontology database to the ICD-9 IDs of ref. 14 , we obtain 198,149 disease pairs for use in LeMeDISCO benchmarking. All correlations of the J-score with the log(RR) score and φ -score are statistically significant (p-value < 0.05). To compare LeMeDISCO to the XD score, we mapped their ICD-9 disease code to the DOIDs and obtained a subset of 29,783 pairs from their dataset of 97,665 pairs 7 . As shown in Figure S1 , their NG score (the number of shared genes) essentially has no significant correlation with log(RR) and only shows a correlation with the φ -score for unbinned data. When the data are binned, both the XD score and NG score lack significant correlations. J-score has much better correlations and recall rate than XD score. For comparison with the S AB score 6 , the MeSH 15 disease names 16 were mapped to their DOIDs. A consensus set of 947 disease pairs from their dataset and ours was obtained. As shown in Figure S2 and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint Table 1 , compared to S AB 6 , for the 947 disease pairs, LeMeDISCO's J-score is better than the S AB score for both unbinned and binned data, and it has a much better recall rate than the S AB score. The S AB score has no significant correlation for the binned φ -score data. The reason for the worse performance of S AB for binned data is as follows: unbinned data for S AB are dominated by cases where no disease comorbidity is predicted (S AB >0), whereas for binned data, both S AB <0 and S AB >0 is important. Next, a common dataset of 2,630 disease pairs was obtained for comparison with the symptom similarity score. As shown in Figure S3 , the symptom similarity score has a better correlation than LeMeDISCO for the log(RR) and is almost identical to the φ -score for unbinned data. However, the symptom similarity score only explains the relationship of one phenotype (symptom) to another phenotype (disease). Nevertheless, all correlations of the J-score are statistically significant, and its recall rate is close to 70%. The advantage of the J-score over the symptom similarity score is that it has a clear molecular interpretation. Moreover, LeMeDISCO does not rely on prior knowledge or symptomatic information. Hence, it provides a much larger coverage of comorbidity predictions for the 198,149 disease pairs, each ranked by its J-score and the corresponding p-value to reflect the statistical significance. The correlation of the J-score and the log(RR) and φ -score for this large set of disease pairs is shown in Figure S4 . The ICD-10 main classification coverage of the 3,608 diseases is shown in Figure 1A . We first examine the number of predicted MOA proteins per indication from MEDICASCY 11 We next examine the overall characteristics of the predicted comorbidity network of 3,608 diseases. There are a total of 6,508,832 possible pairwise disease associations. From this, there are 5,987,682 significant pairwise disease associations excluding the diagonals and 3,009,095 significant non-redundant pairwise disease associations given by LeMeDISCO. Only one disease, esophageal atresia, did not have any significant comorbidities predicted. Thus, 3,607 diseases contained significant comorbidities (p-value < 0.05). The density and frequency of the J-score for the significant pairs is in Figure 1C , and the density and frequency of the degree (number of edges) for each node (disease) is represented in Figure 1D . Using a p-value cutoff of 0.05, the average (median) number of comorbidities per disease is 1,650.7 (1806). The largest (smallest) number of comorbidities is 2,958 for gastric mucosal hypertrophy (17 for Canavan disease). Thus, the disease network is very dense. The cumulative distribution for the J-score and p-values for all of the comorbidities and the top 100 are shown in Figure S5 and S6, respectively. The summary statistics of the scores for these thresholds are shown in Table S1 . What is clear from these figures and Table S1 , particularly for the top 100 ranked comorbidities, is that the 98.5% top ranked 100 comorbidities have a p-value < 0.005. In other words, while a p-value threshold of 0.05 is used, in reality the actual p-values employed for subsequent analysis are far more significant. Around 46% of the disease pairs have a p-value < 0.05. This result is consistent with the ~50% recall of large scale benchmarking (see Table 1 ) and the observed comorbidity from Medicare insurance claim data that 78.8% of the total 3,634,744 disease pairs have an RR > 1 14 . As shown in Figure 1E , the giant component (GP) of the disease-disease network covers the entire network when the J-score is < 0.1 and the p-value < 0.05, i.e., starting from any disease, one can walk to any other disease on the network. As the J-score cutoff increases, the number of diseases in the giant component decreases; however, the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint decrease is very slow. The rapid decrease only happens around a 0.45 J-score corresponding to an average p-value ~3.6x10 -30 . Thus, the disease network is not only dense, but it is also strongly and highly significantly connected. In addition to the comorbidity predictions, LeMeDISCO also identifies comorbidity enriched Table S1 . For the comorbidity enriched MOA proteins ranked by their CoMOAenrich score, 58% have a p-value < 0.05. However, if one only assesses the top 100 comorbidity enriched MOA proteins, 94% have a p-value < 0.05, which are the proteins used for the global pathway analysis. Of the top 100 proteins used for pathway analysis, 82% have a p-value <0.005. The cumulative distribution of the p-values for the pathways and the top 100 are shown in Figure S9 and the summary statistics are provided in Table S1 . As shown in Figure S9 , 62% of the pathways have a p-value < 0.015. We further note that there are some MOA proteins (e.g., SF3B1, BTAF1, and FAM160A1) and pathways (e.g., the nuclear receptor transcription pathway, SUMOylation of intracellular receptors, and PP2A-mediated dephosphorylation of key metabolic factors) that are enriched in approximately a third of the diseases in our library. This implies that there are homogenous molecular features across a subset of complex diseases. This has significant implications to disease interrelationships that will be explored elsewhere. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint By way of illustration, we applied LeMeDISCO to two disparate diseases, coronary artery disease (CAD) and ovarian cancer (OC). CAD, a leading cause of death worldwide, is caused by narrowed or blocked arteries due to plaques composed of cholesterol or other fatty deposits lining the inner wall of the artery. These plaques result in decreased blood supply to the heart 17 . We find 2,747 significant comorbid diseases (p-value < 0.05), Table 2 . There are several cardiovascular-related significant comorbidities such as cardiovascular system disease, and myocardial infarction. Kidney disease, diabetes, obstructive lung disease and Alzheimer's disease are also in the top ten with known comorbidities to CAD. Furthermore meta-analysis indicates an association between CAD and asthma, particularly in females with adult-onset asthma 19 . Prostanoid ligand receptors is the third most significant pathway found for CAD, which may be due to the number of COX-related comorbidity enriched proteins found. COX are involved in the synthesis of prostanoids. Prostanoids are structurally like lipids and are involved in thrombosis and other undesirable cardiovascular events 20 . CAD is also known to be comorbid with proteinuria, Alport syndrome, glomerulonephritis, liver disease and mitral valve insufficiency. The above results were obtained without any extrinsic knowledge of CAD. Next, we show how additional information can be used. A GWAS study identified 155 CAD associated genes 21 . These GWAS genes associated with CAD were then used as input to GWAS-driven LeMeDISCO. The top 20 disease . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint comorbidities, top 20 comorbidity enriched MOA proteins, and top 20 pathways are shown in Table 3 . There were 136 predicted significant comorbidities (p-value < 0.05) by LeMeDISCO. There were 3,039 comorbidity enriched MOA proteins (score > 0.01) and 57 significant pathways (p-value < 0.05) found from global pathway analysis of the top 100 comorbidity enriched MOA proteins. The top comorbidities are anuria and renal artery disease, both associated with dysfunction of the kidneys. Anuria is attributed to failure of the kidneys to produce urine, and renal artery disease occurs when the arteries that supply blood and oxygen to the kidneys narrows. A study found an increase in renal artery stenosis in patients with is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. We next examined a set of 11 genes associated with OC risk from a study that assessed the multiple-gene germline sequences in 95,561 women with OC into LeMeDISCO 27 . The results for the top 20 comorbidities, MOA proteins, and pathways associated are shown in in Table 5 . There were 207 significant comorbidities (p-value < 0.05) predicted, 2,895 comorbidity enriched MOA proteins and 5 significant pathways associated with the top ranked 100 proteins (p-value < 0.05). The top hit comorbidity associated with OC was Sertoli-Leydig cell tumor, a rare cancer of the ovaries, which can yield an increase in the male sex hormone, testosterone 28 . Sex cord-gonadal stromal tumor is a rare type of ovarian cancer. There is some evidence for the comorbidity of diffuse scleroderma, severe acute respiratory syndrome, hyperuricemia, coronary stenosis, lymphatic system disease, germinoma, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. embryonal cell carcinoma and OC. Germinoma, another comorbidity predicted to be associated with OC, is a tumor often found in the brain is typically formed due to dysfunctional localization of germ cells to their respective locations. Furthermore, hemoglobinopathy, a disease(s) of the blood, was also a comorbidity associated with OC. A study found a relationship between hemoglobin levels and interleukin-6 levels in individuals with untreated epithelial ovarian cancer, which indicated the inflammatory role in cancer-associated anemia 29 . There is scant literature evidence for the comorbidity of adenosquamous carcinoma or acinar cell carcinoma with OC. One of the top comorbidity enriched MOA proteins found is GMPR, also found to be upregulated in metastatic serous papillary ovarian tumors from a differential gene expression analysis 30 . RAB-related pathways are provided by the pathway analysis. Rab35, a protein associated with modification of actin remodeling 31 , is a top 20 comorbidity enriched MOA protein that has been shown to be upregulated in individuals with OC under androgen treatment. Notably, there was a significant overlap p-value from the GWAS driven LeMeDISCO results to the MEDICASCY MOA protein driven LeMeDISCO results for the predicted comorbidities (p-value = 5.3x10 -12 ) and MOA proteins (p-value < 0.0001). The LeMeDISCO web service allows researchers to query our library of 3,608 diseases or input a set of pathogenic human genes/proteins and compute their predicted comorbidities, MOA proteins, and pathways associated. The web service is freely available for academic users at http://sites.gatech.edu/cssb/LeMeDISCO. LeMeDISCO is a systematic approach for studying and analyzing possible features underlying the common proteins underlying a set of comorbid diseases. The resulting predicted driver proteins and pathways for each disease or input gene set can allow researchers to generate new diagnostic and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint treatment options and hypotheses. Interestingly, there were some MOA proteins and pathways present across approximately a third of the diseases, implying common disease drivers. The implications of this observation and its relationship to disease origins will be pursued in future work. We do note that the current comorbid disease analysis strongly suggests that the "one target-one disease-one molecule" approach often used in developing disease therapeutics 32 is likely somewhat too simplistic. To fully understand the complexities of a disease, one must trace the origin of its pathogenesis, which may be due to a variant that is somehow related to the condition. However, such variants may also be associated with a disease not previously known to be associated with that condition. Such interrelations can be further investigated by identifying high confidence comorbidity predictions from LeMeDISCO not only has applications to the study of the underlying etiology behind a disease but may also be used during the early stages of drug discovery to identify efficacious drugs. Rather than starting with a small molecule or protein target of choice, LeMeDISCO allows one to begin at the level of disease biology, often termed phenotypic drug discovery. In future work, we shall demonstrate the utility of LeMeDISCO in identifying efficacious drugs to treat a given disease. Overall, the results of the current analysis and preliminary applications to drug discovery suggest that LeMeDISCO provides a set of tools for elucidating disease etiology and interrelationships and that a more systems wide, comprehensive approach to both personalized medicine and drug discovery is required. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. CA, HY and JS conceived of the method; CA and HY implemented the method, CA, HY and JS analyzed the data and wrote the paper. The web service is freely available for academic users at http://sites.gatech.edu/cssb/LeMeDISCO. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint A flowchart of LeMeDISCO is shown in Figure 2 . LeMeDISCO employs MEDICASCY 11 to predict possible disease MOA proteins. Here, MEDICASCY is applied in prediction mode (i.e., any training drugs having a Tanimoto-Coefficient =1 to a given input drug is excluded from training) to avoid a strong bias towards drugs in the training set on a set of 2,095 FDA-approved drugs 32 . For each of the 3,608 indications, we rank the 2,095 probe drugs according to their Z-scores, Z d , defined using the raw score computed by MEDICASCY from: To predict a drug as having the given indication, we applied a Z . This latter probability serves as the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint background probability that an arbitrary drug will bind to T. When no drug is predicted to bind to protein T, RR(D,T) is set to zero. RR(D,T)=F1/F2 > 1 means that a drug having indication D is more likely to bind to T than arbitrary drugs not having the predicted indication D will bind to T. We then compute the statistical significance of RR(D,T) by calculating a p-value using Fisher's exact test 34,35 on the following contingency table: We define a protein target T as predicted to be a possible MOA target for indication D if its pvalue < 0.05 because it is more likely to be targeted by efficacious drugs than arbitrary drugs. Thus, for each of the 3,608 indications, there is a list of predicted possible MOA proteins. To reduce false positive MOAs, we utilized the human protein atlas database (https://www.proteinatlas.org/about/download, normal_tissue.tsv) of expression profiles for proteins in normal human tissues based on immunohistochemisty using tissue micro arrays 36 to filter those proteins that are "not detected" and not "uncertain" in all tested tissues related to an indication. To determine the tissues related to an indication, tissues are mapped to their ICD-10 main codes and indications having the same main codes are related to the tissue. Using the input of two sets of putative MOA proteins having a p-value of < 0.05 calculated by Fisher's exact test 34 , we calculate their Jaccard index 37 J(D 1 ,D 2 ) (J-score) defined in eq. 3a as We then calculate the p-value for significance by Fisher's exact test for the contingency . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. can be calculated using Fisher's exact test on the table in eq. 3b 35 . We will use the J-score for predicting comorbidity and compare it with the observed comorbidity. In large scale disease-disease comorbidity calculations, we use the MOAs predicted by MEDICASCY 11 . In addition, MOA targets between disease pairs can also be derived from experimental data; examples include differential gene expression (GE), Mendelian or somatic mutation profiles comparing disease vs. control normal samples, better vs. worse prognosis samples, or drug treated vs. control untreated samples 39 . We validated LeMeDISCO's J-score by correlating it with the observed comorbidity as quantified by (a) the logarithm of relative risk log(RR) score and (b) the φ -score (Pearson's correlation for binary variables) 14 . The relative risk (RR) is the probability that two diseases cooccur in a single individual relative to random. Since RR scales exponentially with respect to the strength of two interacting diseases, we use log(RR) for correlation analysis. The log(RR) and φ -score are computed from US Medicare insurance claim data using 14 : where n tot = total number of patients; n A , n B = number of patients diagnosed with disease A and B, and n AB = number of patients diagnosed with both diseases A and B. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint After determining the significant comorbidities for each disease, the p-value weighted frequency of shared MOA proteins across the top 100 predicted comorbidities are calculated. We define a p-value weighted frequency of an input MOA as follows (i.e. CoMOAenrich score): If MOA protein T is shared by a comorbid indication D and the p-value of T associated with D is P, then the weight defined by the min(1.0,-αlogP) is counted as T's frequency. In practice, we used 10 cancer cell line data 40 Figure 2B ). The LeMeDISCO web service allows users to query the LeMeDISCO database as well as input their own set of pathogenic genes to assess the associated comorbidities, MOA proteins, and pathways. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint Tables Table 1. Comparison of LeMeDISCO's J-score with the XD score, NG, S AB score and symptom similarity for correlations with comorbidity quantified by the log(RR) score, φ -score and recall a . a Numbers in parentheses are the p-values of the corresponding correlation. Bold indicates the best results for the given data set. b Unbinned means raw data; each pair is a data point. 10 bins: partitioning the prediction scores into 10 equal size bins. In each bin, the log(RR) & φ -score are averaged over data points in the bin. This gives equal weight to the rare prediction scores in the correlation analysis. c Mapping the DOID IDs from the human DO database to ICD9 IDs of Ref. 14 , gives a set of 198,149 disease pairs d Mapped the ICD9 disease code to our DOID of DO and obtained a consensus subset of 29,783 disease pairs from Table 1 's dataset of 97,665 disease pairs in Ref. 7 . e NG is the number of shared genes between disease pairs in Ref. 7 . f Consensus set of 947 disease pairs from the dataset of Ref. 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 3, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 3, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 3, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 3, 2021. ; https://doi.org/10.1101/2021.06.28.21259559 doi: medRxiv preprint Are individuals with an autoimmune disease at higher risk of a second autoimmune disorder? The Behavioral and brain sciences Genetic similarity between cancers and comorbid Mendelian diseases identifies candidate driver genes The implications of human metabolic network topology for disease comorbidity Human symptoms-disease network Disease networks. Uncovering disease-disease relationships through the incomplete interactome Identification of disease comorbidity through hidden molecular mechanisms Analysis of disease comorbidity patterns in a large-scale China population Phenotype-Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources Disease Ontology: a backbone for disease semantic integration MEDICASCY: A Machine Learning Approach for Predicting Small Molecule Drug Side Effects, Indications, Efficacy and Mode of Action The reactome pathway knowledgebase An Elementary Mathematical Theory of Classification and Prediction A dynamic network approach for the study of human phenotypes Medical subject headings Human Disease Ontology 2018 update: classification, content and workflow expansion The Pathogenesis of Coronary Artery Disease and the Acute Coronary Syndromes Cardiac disease in chronic obstructive pulmonary disease Association of asthma with coronary heart disease: A meta analysis of 11 trials Cardiovascular Biology of Prostanoids and Drug Discovery Identification of 64 Novel Genetic Loci Provides an Expanded View on the Genetic Architecture of Coronary Artery Disease from Contingency Tables, and the Calculation of P FEXACT: a FORTRAN subroutine for Fisher's exact test on unordered r×c contingency tables Tissue-based map of the human proteome Analyzing gene expression data in terms of gene sets: methodological issues A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data NCI-60 Human Tumor Cell Lines Screen