key: cord-0955988-pw0byt3i authors: Xu, X.; Kawakami, J.; Indika Millagaha Gedara, N.; Riviere, J.; Meyer, E.; Wyckoff, G. J.; Jaberi-Douraki, M. title: Data-driven methodology for discovery and response to pulmonary symptomology in hypertension through AI and machine learning: Application to COVID-19 related pharmacovigilance date: 2021-06-12 journal: nan DOI: 10.1101/2021.06.07.21258497 sha: 9aa485a1dc21f0dad8b6e29a7f6395c1a0400d15 doc_id: 955988 cord_uid: pw0byt3i Potential therapy and confounding factors including typical co-administered medications, patient's disease states, disease prevalence, patient demographics, medical histories, and reasons for prescribing a drug often are incomplete, conflicting, missing, or uncharacterized in spontaneous adverse drug event (ADE) reporting systems. These missing or incomplete features can affect and limit the application of quantitative methods in pharmacovigilance for meta-analyses of data during randomized clinical trials. In this study, we implemented adaptive signal detection approaches to correct spurious association, hidden factors, and confounder misclassification when the covariates are unknown or unmeasured on medications affecting the renin-angiotensin system (RAS), potentially creating an increased risk of life-threatening outcomes in high-risk patients. We consider pulmonary ADE (pADE) profiles in a long-standing group of therapeutics, RAS-acting agents, in patients with hypertension associated with high-risk for COVID-19. Using these techniques, we confirmed our hypothesis that drugs from the same drug class could have very different pADE profiles affecting outcomes in acute respiratory illness. Following multiple filtering stages to exclude insignificant and noise-driven reports, we found that drugs from antihypertensives agents, urologicals, and antithrombotic agents (macitentan, bosentan, epoprostenol, selexipag, sildenafil, tadalafil, and beraprost) form a similar class with a significantly higher incidence of pADEs. Macitentan and bosentan were associates with 64% and 56% of pADEs, respectively. Because these two medications are prescribed in diseases affecting pulmonary function and may be likely to emerge among the highest reported pADEs, in fact, they serve to validate the methods utilized here. Conversely, doxazosin and rilmenidine were found to have the least pADEs in selected drugs from hypertension patients. Nifedipine and candesartan were also found by our signal detection methods to form a drug cluster, shown by several studies an effective combination of these drugs on lowering blood pressure and appeared an improved side effect profile in comparison with single-agent monotherapy. The coronavirus disease 2019 pandemic continues with 115,094,614 confirmed cases and over 2.6 million deaths as of March 5, 2021 (1, 2) . Surprisingly, it is estimated that as high as 45% of infected individuals may remain asymptomatic, contributing to disease transmission and underlying the disparity in symptomology (3) . A commonality of severe clinical course and mortality is comorbid conditions such as diabetes, heart disease, obesity, and hypertension (4) . Hypertension was recognized early on as being a prevalent risk factor (5) , possibly due to its pervasiveness. Hypertension affects 23% of adults in China, where the original study was conducted, but affects 45% of US adults. Moreover, specific antihypertensive medications, namely angiotensin-converting enzyme inhibitors (ACEIs) and angiotensin-II receptor blockers (ARBs), target proteins of the renin-angiotensin system (RAS) (6) . The RAS is intricately linked to initial infection and possibly the progression of COVID-19 through a RAS receptor, angiotensinconverting enzyme 2 (ACE2), which acts as the viral entry point of coronavirus SARS-CoV-2 (7, 8) . In recent years, data science has emerged as a new and important discipline in medicine and healthcare. Different quantitative therapeutic efforts in drug repurposing or repositioning combined with adverse drug event (ADE) identification have led to more efficient therapies while improving the clinical course, lowering fatality, and decreasing cost burden (9) . Our previous work focused on the incidence of pulmonary ADEs associated with ACEI and ARB use in patients with hypertension and other comorbidities (10, 11) . Our findings indicate that specific drugs-rather than entire classes-have higher incidences of pulmonary ADEs, which may have implications for treating patients diagnosed with COVID- 19 . Most epidemiological studies are not this granular as All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint they do not analyze drug effects at the individual drug level but rather compare pharmacological classes. The current study examines additional drugs that more broadly target hypertension, including pulmonary hypertension, to describe methods used to identify clinically important patterns of ADE data. We utilized the Anatomical Therapeutic Chemical (ATC) classification system from the World Health Organization (WHO) Collaborating Center for Drug Statistics Methodology (https://www.whocc.no/). The ATC system classifies drugs based on site of action in addition to chemical, pharmacological, and therapeutic properties (12) . Here we identify a clear signal distinct from different drugs in patients with hypertension as an underlying medical condition which helps to quantify the anomaly and unexpectedness of an ADE reported for a drug through disproportionality analysis. For this purpose, we proceeded with a specific pairwise analysis of individual drugs compared to the drug classes using a modified empirical Bayes method to identify any distinctions between drugs within a class and compared to other classes. In our previous work, thirteen different pulmonary ADEs were selected based on clinical importance, and as they were prevalent among the top reported symptoms in patients with COVID-19, to assess the related variation due to adverse event differences (10, 11) . In the present work, we include 25 pulmonary, infectious disease, or cardiac-associated ADEs. Our novel method identifies extraneous causes of differential reporting including sampling variance and selection biases by reducing the effect of covariates. This method is both adaptive (it removes different covariates for different drugs) and appropriate for the systematic application and routine analysis (13) . We hypothesize that drugs from the same class based on the Anatomical Therapeutic Chemical (ATC) classification system could have different ADE profiles. For this purpose, All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint penalized regression method will be used to detect clusters of drugs, may differ from the ATC classification, and will be validated by the Friedman test (14) (15) (16) . Safety signals for a specific drug and associated adverse events are then identified and evaluated through different methods, such as the proportional reporting ratio (PRR) (14) , the relative reporting ratio (RR) (17) , the information component (IC) (18) , and the empirical Bayes geometric mean (EBGM) (17) . These methods are utilized to calculate the ratio of an ADE compared to the same event occurring with other drugs, however, PRR or RR is more liberal when an event incidence is small (19) . Here we briefly explain the data preprocessing and cleansing that will be used in different subsections. The focus of each subsection is given by the amount of data that will be used. A total of 480,236 spontaneous ADE reports for patients with hypertension were retrieved from our 1DATA databank of the FAERS database from the first quarter of 2004 until the first quarter of 2020. Alternatively, ADEs can be categorized by drug for a total of 612,733 reports (Table 1) arising from patients taking more than one drug. For example, a single ADE reported for a patient taking 2 different drugs, will generate one ADE report for each drug. This hypertension dataset was aggregated to 1520 ADEs in HLT codes corresponding to 1131 drugs with unique active substances. Next, drugs were excluded when the number of ADEs due to the fact that each drug was reported less than 500 times, accounting for approximately less than 0.1% of the data. Furthermore, 98.8% of the data corresponded to 134 of the 1131 drugs ( Table 1 with the column excluded from the analysis since we did not have any reports for these ADEs: congenital lower respiratory tract disorders, lower respiratory tract radiation disorders, parasitic lower respiratory tract infections, respiratory tract neoplasms NEC, and viral lower respiratory tract infections. One of the frequentist methods, the relative reporting ratio (RR), based on the disproportionality measures of a drug-ADE occurrence compared to other drug-event combinations was applied to evaluate the weighting of drugs. To start our first analysis, we constructed a large contingency table for the entire data from 134 selected drugs based on their frequencies with respect to all 1520 reported ADEs in HLT codes from MedDRA. We imposed the assumption that an ADE is selected when RR >2 for a specific drug to assess the drug disproportionality in pharmacovigilance data by observed-expected ratios prior to the EBGM analysis, a more conservative and accurate way of disproportionality evaluation. Taking into account only 25 pulmonary ADEs in HLT codes, we then obtain the results of Table 2 displaying the top 22 drugs with their corresponding number of pulmonary ADEs when RR >2. The order from the number of pulmonary ADEs is arranged based on the EBGM results after GLASSO elimination and the clustering given in Table 1 that will be explained below. RR is also utilized to calculate the baseline frequency for EBGM and to construct the PCA as explained below. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Table S1 in Supporting Information and will be reviewed in the discussion. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint While the RR method is widely utilized due to its simplicity and user-friendly processing, it is difficult to dismiss high variability for infrequent occurrences. The assessment of drugs or ADEs based on RR is variable because of information that the RR methodology does not include, including underreported or overreported events. To assess the effect that the RR methodology has when a small number of ADE occurrences are compared to the whole database, the 5th percentiles from the lower confidence interval of EBGM (EB05) were used as a very conservative alternative, and the results are compared to RR. This assessment was performed using EBGM, is reported similar to the prevalence evaluation using RR values from above. The frequencies of a single drug having multiple ADEs in HLT groups or a single HLT ADE occurrence in multiple drugs were calculated. We then found that the top ten drugs with pulmonary ADEs consisted of AHAs, ATAs, and UAs. Bosentan, tadalafil, treprostinil, and beraprost based on EBGM were ranked substantially higher than their corresponding ranks when using RR, with respect to pulmonary ADEs. This suggests that the conservative, EBGM method with a 5 th percentile cut-off will allow for the examination of large datasets of ADEs when high variability is present in the number of ADEs (2) quinapril, (3) trandolapril, (4) nilvadipine, (5) azosemide, (6) azelnidipine, and (7) treprostinil. An interactive figure can be found on the 1DATA home page. Click the following URL to see the figure: https://1data.life/pages/publication/figure1B.html. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint across drugs or drug classes, and still allow for a robust reporting methodology as compared to the RR methodology. This allows analysis of very large sets of drugs and ADEs (such as approximately 500,000x134 matrix here) without loss of sensitivity or imparting an over-emphasis on ADEs from infrequently prescribed drugs. The total number of distinct drugs used by patients with hypertension was 134 after filtering out drugs with very low frequency (<0.001) in the PCA section. EBGM data were used to construct the new feature matrix for different drug classes. Then 44 drugs were selected based on two conditions: (1) the lower confidence interval of EBGM, EB05, of drugs was larger than one, and (2) a minimum of two different pulmonary ADEs is associated with each drug, Table 1 . We found that few drugs in ACEIs, diuretics, and combinations tended to cause pulmonary issues. More than half of the drugs were in ARBs, AHAs, ATAs, and CCBs when considering two different pulmonary ADEs in the HLT level. After two filtering steps, 44 drugs were set as the input for the penalized regression GLASSO. To have an adequate number of correlated drugs, the tuning parameter of GLASSO was adjusted to shrink the less associated drugs to 0, which accounted for 50% of the selected drugs. The (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The drug-drug correlation matrix with shrinkage is displayed in a circular layout, depicting drug class and associations between drugs from different classes (Fig 2) . For drugs in ACEIs, ARBs, AHAs, and BBAs, no association was observed between drugs within the same class. More withinclass associations were depicted in AHAs, CCBs, and combinations. Fig 2A shows the association between the remaining 22 drugs after then the elimination process from the penalized regression GLASSO. After these stringent filtering methods, drug classes exhibit very low significant All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint of drugs listed in Table 2 is calculated based on the original 44 drugs from the EBGM scores and here we only show the arrangement for the remaining 22 drugs out of 44 drugs. Beraprost showed 13 pulmonary ADE profiles reported more commonly than other drugs used for patients with hypertension based on the estimated RR. Macitentan and Selexipag were equally located in the second most commonly reported drugs, each of which with 10 pulmonary ADEs. In contrast, beraprost was corrected from being the top drug with most pulmonary issues and then ranked down to the tenth location by EBGM. The assessment for bosentan and tadalafil also changed radically when the comparative analysis was done using RR or EBGM. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint From GLASSO and Table 2 , we can now obtain the ADE profiles in HLT groups for each drug in the newly identified group class, which we called GLASSO (GL) Clusters. The ADEs together with the drug classes from ATC and GL Clusters based on EB05>1 are arranged in Table 3 and depicted by an arc diagram in Fig S3, Supporting Information. It is apparent from Fig 2 and Table 3 that GL Cluster 1 consists of most associated drugs with most pulmonary ADEs assessed by EBGM. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. To test the significant difference between drugs grouped by the original ATC classes and the GL Clusters, which were from a shrinkage correlation matrix, a non-parametric Friedman test was applied to compare separately the magnitude of difference when drugs in the same group for the ATC classes or the GL Clusters. Table 4 summarizes the results of the p-value for different comparative analyses in the ATC classes or the GL Clusters. A p-value of 0.199 indicates that no differences in EBGM of pulmonary ADEs for different drugs in GL Cluster 1 when excluding Tadalafil. Similarly, GL Clusters 2, 3, 4, 5, and 6 showed no significant differences in EBGM respectively ( Table 4) . However, given the original ATC class drugs belonging to, the Friedman test did show significant differences in six of the ATC class before GLASSO. The same test was applied to 22 drugs selected from GLASSO, only drugs in UAs showed no significant differences in EBGM of pulmonary ADEs. This shows that instead of grouping drugs from the same ATC class, isolated groups from GLASSO showed homogeneity. ACEIs, ARBs, AHAs, ATAs, BBAs, CCBs, COMBs, TDAs, and UAs) in Table S5 -A in All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. = 0.0072, Fig 3) . Pairwise comparisons showed no significant differences among any two ATC classes from the adjusted pvalue (Table S5-A in Supporting Information). However, using drug class determined by GLASSO, Wilcoxon signed-rank test between groups revealed significant differences in EBGM of pulmonary ADEs between GL Cluster 1 and GL Clusters 3, 4, and 5, respectively, compared to the pairwise comparisons between ATC groups, Table S6 (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The future of large-scale biomedical science is data-driven decision-making and AI knowledgebased development and validation. AI-enabled technologies can help in better understanding disease indication occurrence and disease determinants or patterns. Quantitative methods have countlessly been applied in various medical fields of study, e.g. measurement of disease frequency, prevalence or incidence; evaluation of source of bias and variation of observational studies; multivariate data analysis of risk factors such as applied logistic regression analysis; machine learning for survival analysis or analysis of time at risk (survival) data; boosting power for clinical trials using AI-assisted analysis, etc. In our study, we aimed to apply AI-driven methodologies involving EBGM and GLASSO techniques in predicting SARS-Cov-2 comorbidity for high-risk populations with hypertension. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint Quantitative methods, i.e., PPR, RR, ROR, EBGM have been used to detect signals for spontaneously reported data. After filtering data by quantitative methods, we proposed that selected drug-ADE based on drug association mechanism would be a valuable procedure for clinical review and comparison of similar drugs with similar ADE profiles. In this study, we demonstrated a systematic way of filtering and selecting data that addresses the noise inherent to such data. None of these methods are free from including false positive and false negative signals, however, EBGM and the Information Component (IC) are recommended over other quantitative methods when evaluating by mean average precision (16). This helped us to build a model to understand the bias-variance tradeoff to achieve a balance between the two desirable but incompatible features. Given the absence of a gold standard, no available method is overwhelmingly better than the others (18) . The confirmatory methods proposed in this study (GLASSO and Friedman test) for assessing quantitative methods could reveal the strengths and drawbacks of the methods. Drugs from different branches in the 3D plot represent distinctive effects of pulmonary ADEs on the separation. For example, PC3 is dominated by fungal, PC2 by more pleural and vascular, and PC1 by respiratory tract effects (see Table S1 in Supporting Information). PCs were constructed using the expected counts of a drug and a pulmonary ADE through a linear combination. The spatial separation of drugs indicated that drugs at the perimeter of each branch (numbered) performed disparately regarding pulmonary ADE profiles, suggesting they may not best be managed as having ADE profiles defined by their class. This figure shows the optimal representation of three active variables in biplots acquired by PCA by diminishing the effect of supplementary variables that have no or little influence on the pulmonary ADEs. Using the All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Friedman test, we found that these separated drugs have significant differences between their drug classes and compared to other drug classes. The consistency of the Friedman test and GLASSO to capture EBGM signals of drugs used in small and large populations could be a beneficial tool for drug comparative analysis. Xu et al. (20) and Stafford et al. (10) have already applied two methods in pharmacovigilance to animal and human data separately. This study proposed and successfully combined penalized regression together with the non-parametric Friedman test in considering to better visualization of drug-drug and drug-ADE associations. The RR method is widely utilized due to its simplicity and userfriendly processing. RR, however, may be highly variable for small occurrences of an event. Our assessment of drugs or ADEs based on RR showed unstable performance, especially for hidden information. The estimates of small occurrences compared to the whole database were also inflated for events. To correct these issues, we introduce 5 th percentiles from the lower confidence interval of EBGM (EB05) used as a conservative alternative compared to RR. EBGM detected that 16 out of 25 pulmonary ADEs in MedDRA databases were associated with macitentan, followed by bosentan with 14 pulmonary ADEs. Both of these drugs belong to the endothelin receptor antagonist class of drugs and are utilized in pulmonary arterial hypertension to prevent vasoconstriction, fibrosis, and inflammation on vascular endothelium and smooth muscle (32). Both drugs are proposed to curb the pulmonary vascular resistance to prevent right heart failure and death, however, pulmonary ADEs of both drugs can be of major concern compared to the outcomes of several other antihypertensives agents we utilized in this study. At the same time, because these two medications are used in a disease affecting pulmonary function All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint and commonly reported ADEs to include therapeutic failure, these drugs were not surprising to emerge among the highest with reported pulmonary ADEs and, in fact, they serve to validate the methods utilized in this paper. Conversely, doxazosin and rilmenidine were found to have the least pulmonary ADEs in selected drugs from hypertension patients since only two ADE signals were detected based on EBGM. Although it can be used in hypertension, doxazosin is primarily utilized for men with benign prostatic hyperplasia and works by blocking alpha-adrenergic receptors in the vascular smooth muscle, resulting in vasodilation (33) . Additionally, studies in countries outside of the US suggest that rilmenidine, a sympatholytic, has a favorable ADE profile for patients with hypertension and diabetes, it is not approved in the US (34) . After excluding GL Cluster 1, we did see almost the same results for the remaining GL clusters. It is also worth mentioning here that the results are shown in Tables S2, S3, S4, SB-5, and S6-B as well as Figs S1 and S2 in Supporting Information. The second group found by EBGM and GL clustering consisted of two drugs from CCBs (nifedipine) and ARBs (candesartan) grouped in combination (Fig 2) and showed four similar pulmonary ADEs: parenchymal lung disorders NEC, pneumothorax and pleural effusions NEC, lower respiratory tract inflammatory and immunologic conditions, and fungal lower respiratory tract infections. Several studies based on these drugs showed effective combination and blood pressure lowering effects in patients with hypertension and appeared an improved side effect profile in comparison with single-agent monotherapy (35) (36) (37) (38) . This is undoubtedly an interesting finding resulted from our EBGM analysis and demonstrated how these two drugs can be combined and investigated for pharmacokinetic assessment in drug development including bioavailability All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint and bioequivalence, drug safety pharmacovigilance, and efficacy and comparative tolerability of the combination of nifedipine and candesartan (39, 40) . Our previous work showed that quinapril and trandolapril were significantly different from other ACEI and ARB drug classes (11) . Separating from its drug class was initially observed in Fig 1 when the PCA biplot was performed. However, these two drugs will not be present when more precautionary methods are applied for several reasons: (1) the dataset is no longer the same as before which contain only ACEIs or ARBs. 5) The whole purpose of this study was to use EBGM as a much more accurate method compared to RR and RR estimation is also better than the PRR method used before. (6) The implementation of the filtering process of penalized regression GLASSO helps eliminate the insignificant and noise-driven reports. Two drugs, tadalafil and sildenafil, are also used for the modulation of dopaminergic pathways and modifying risk factors to prevent and treat erectile dysfunction. Using our database when curating the data for the medicinal products of these drugs and checking their active ingredients of tadalafil and sildenafil, the top products are found to be Adcirca (n=32446) and Revatio (n=21358) marketed for the treatment of pulmonary arterial hypertension, respectively, and Cialis (n=15623) and Viagra (n=20820) marketed to treat erectile dysfunction, respectively. We also assessed All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint whether these drugs only show up at high doses or not. This also confirmed that the dose has an insignificant effect on the outcome of ADEs, data are given in Table S7 in Supporting Information. As part of our future work, it is worth mentioning that this study aimed to reveal the potential risk of patients using hypertensive drugs in terms of pulmonary issues. Our database will be updated with MedDRA 24.0 that contains the new COVID-19 terms due to its outbreak. It has encouraged us to involve terms related to viral infections that facilitate the capture of ADEs caused by COVID-19 in patients with hypertension in the near future. In addition, the pulmonary ADEs of HLT codes in this study were filtered by setting the highest level, system organ class (SOC), with the focus on respiratory, thoracic, and mediastinal disorders (n=28), and infection class containing viral infection (n=2). We plan to include ADEs from the class of Blood and lymphatic system disorders such as thrombosis, coagulation, or platelet disorders. In the big data era, as the spontaneous reports from different data sources including the FDA FAERS database (21), the Vaccine Adverse Event Reporting System (VAERS) (41, 42) , and the WHO International Database are increasing in size; drug profiles based ADEs can be established based on quantitative methods, retrieving the signals, or detecting new signals in large numbers of reports by different methods with the combination of clinical review is need for pharmacovigilance. To derive the desired information from datasets, there are a few main methodological steps in this study. In the following, we briefly illustrate procedures in our workflow integrated by machine All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint steps in the preparation and analysis of the ADE database to make a decision and interpret our results, each step is detailed in the following subsections: 1. Working hypothesis: drugs from the same drug class could have different pulmonary ADE profiles affecting outcomes in acute respiratory illness, with potential implication in SARS-CoV-2 infection. 2. Designing error correction techniques for data scrubbing and retrieval. 3. Implementing data exploration technique for initial data analysis to visually explore and understand the characteristics of the data from post-marketing drug safety surveillance. 4 . Data curation and annotation to organize and integrate data collected from various sources from the FDA, MedDRA, and ATC classification. This phase entails annotation, organization, clustering, and presentation of the assorted data types from the 1DATA databank. 5. ADE-associated information retrieval for patients with hypertension provides massive collections of reports to investigate adverse drug events based on comparative population data analysis. 6. Integration of machine learning models. 7. Acquiring results after data preprocessing and cleansing that significantly reduces the size of data and eliminates insignificant and noise-driven reports. 8. and 9. Enhancing decision and interpretation via data-driven machine learning to help identify incidences of pulmonary ADEs for potential therapy and confounding factors that may have implications for treating patients diagnosed with COVID-19, respectively. As a part of data cleaning, we were also challenged by multiple technical issues when combining drugs: (i) there were many drugs' names that did not track a specific standard. (ii) Formulations of All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The data were integrated into the 1DATA databank (www. Level Term (LLT)] for coding ADE reports (22) . This study aggregates raw ADE reports to terms from the HLT and SOC levels. ATC classification is likewise an internationally applied hierarchical system for active drug substances based on site of action (organ or system) and mechanistic properties (therapeutic, pharmacological, and chemical). Drugs in this study were grouped according to ATC classification. Data integration into 1DATA occurred through the All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (20, 23) . ADEs cause approximately 30 billion dollars a year of added health care expenses, along with negative-including fatal-health outcomes (20) . The practice of prescribing drugs based on information from drug preapproval labeling may misrepresent or deprecate the incidence and prevalence of specific ADEs. The FDA defines the term 'adverse event' as: "any untoward medical occurrence associated with the use of a drug in humans, whether or not considered drug related, including the following: an adverse event occurring in the course of the use of a drug product in professional practice; an adverse event occurring from drug overdose whether accidental or intentional; an adverse event occurring from drug abuse; an adverse event occurring from drug withdrawal; and any failure of expected pharmacological action" (24, 25). The main method used in this study, Bayesian shrinkage, is based on a baseline frequency, which is the relative risk or relative reporting ratio It compares a drug-ADE count, N, to its expected count, E. For instance, when Nij/Eij is equal to 100, then drugi and ADEj occurred 100 times as frequently as the baseline frequency represents. A huge difference of occurrences between two drug-ADE pairs might lead to similar RR due to E in All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint the denominator, even statistically the same, but the frequency illustrates sampling variation. When more events of ADEj are caused by drugi higher than the same ADE in the database, RRij>1. Drug-ADE surveillance should be triggered when large RR scores show up for specific drug-ADE pairs. However, the variability of RR for small counts drug-ADE pairs is unreliable, the high value of RR might be accidental. Principal component analysis (PCA) was obtained based on the log expected value of RR, log(E), to analyze ADEs for different drugs, to reduce the features from the drug-ADE matrix. The distinct clusters from PCA plots were used to compare the similarities of drugs based on . PCA was conducted using built-in function PCA in R (R 3.6.3 version, R Core Team, GNU GPL v2), and PCA biplots were produced using the R package factoextra, and 3D PC plots were produced using R package plotly. DuMouchel (17) proposed an empirical Bayes approach based on the Gamma-Poisson Shrinker (GPS) algorithm to bring down the inflated value of RR due to small counts without impacting RR associated with large counts. Thus, the drug profile based on ADE could be reconstructed with reduced variation in RR. GPS redefines RRij as λij=μij/Eij drawn from a prior distribution with a mixture of two gamma distributions, μij is the mean of the Poisson distribution of counts for drugi and ADEj All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. The shrinkage abates vagueness by reducing RR scores to a conservative level, which helps to alleviate false-positive signals, avoiding arbitrary drug-ADE assessment. The R package openEBGM was used to implement the GPS method (26) . The profile of each drug comprises EBGM of all ADEs. The Pearson correlation matrix was constructed based on the EBGM between pairs of drugs. The vector ! = H !% , !& , … , !. J for i∈{1,2,…,n} denotes the EBGM corresponding to drugi. The Pearson correlation method determines the associations between pairwise vectors of reported drugs, which are the elements in the correlation matrix. This adjacency matrix was highly dense (n×n), and it is difficult to graph the network when too many drugs (1131) are present. A penalized regression method, graphical least absolute shrinkage and selection operator (GLASSO), was then introduced to encourage sparsity in the adjacency matrix, in order to plot high dimensional graphs from the correlation matrix (27) . An R package called huge was utilized to perform GLASSO (28) . All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.07.21258497 doi: medRxiv preprint The MedDRA hierarchy is multi-axial, for example, "influenza" is from the PT level and is encompassed within two SOC levels "Respiratory, thoracic and mediastinal disorders" and "Infections and infestations". Therefore, the columns of EBGM calculations in the drug-ADE matrix involve HLTs from the "Respiratory, thoracic and mediastinal disorders" and "Infections and infestations" levels. For better visualization, ADE columns of one drug were put in a block with other rows being zeros. The dimension of a drug-ADE matrix was expanded from (m×q) to (m×mq) where m(