key: cord-0469771-14h7h67h authors: Ajirak, Marzieh; Heiselman, Cassandra; Fuchs, Anna; Heiligenstein, Mia; Herrera, Kimberly; Garretto, Diana; Djuric, Petar title: Bayesian Nonparametric Dimensionality Reduction of Categorical Data for Predicting Severity of COVID-19 in Pregnant Women date: 2020-11-07 journal: nan DOI: nan sha: 90eeefd58bfb547ba148169173f8a19c920b1198 doc_id: 469771 cord_uid: 14h7h67h The coronavirus disease (COVID-19) has rapidly spread throughout the world and while pregnant women present the same adverse outcome rates, they are underrepresented in clinical research. We collected clinical data of 155 test-positive COVID-19 pregnant women at Stony Brook University Hospital. Many of these collected data are of multivariate categorical type, where the number of possible outcomes grows exponentially as the dimension of data increases. We modeled the data within the unsupervised Bayesian framework and mapped them into a lower-dimensional space using latent Gaussian processes. The latent features in the lower dimensional space were further used for predicting if a pregnant woman would be admitted to a hospital due to COVID-19 or would remain with mild symptoms. We compared the prediction accuracy with the dummy/one-hot encoding of categorical data and found that the latent Gaussian process had better accuracy. The coronavirus disease 2019 (COVID-19) has become an unprecedented public health crisis. Around the world, many governments issued a call to researchers in machine learning (ML) and artificial intelligence (AI) to address highpriority questions related to COVID-19. This call was not unusual because ML methods are finding many uses in medical diagnosis applications. The ML field is rich with examples where based on predictive models one can estimate disease severity [1] and consequently, the state of a patient's health [2] - [4] . These models employ data-driven algorithms that can extract features and discover complicated patterns that could have not been recognized or interpreted by humans. Pregnant women are a particularly important patient population to study due to their vulnerability to disease and the often underrepresentation of the population in clinical research [5] . Despite studies in this field [6] - [10] , there has been a relative sparsity of data in regards to COVID-19 and its effect on pregnancy. Utilizing ML techniques to study this population during the pandemic can help build pregnancyspecific evidence to guide clinical recommendations [11] . The authors thank the support of NIH under Award RO1HD097188-01. Much of the medical data are of multivariate categorical type, and typically they represent patients' demographics, maternal comorbidities, pregnancy complications, and disease symptoms. As a result, one has to work with long vectors of categorical variables which in turn leads to a huge number of possible realizations. This then creates very sparse spaces when we deal with a limited number of data [12] . Besides, the data include random errors and systematic biases, and sometimes they are missing [13] . By overcoming the challenges that clinical data introduce, one can layout the grounds for developing more accurate models and efficient algorithms for inference. The key to have successful predictive methods largely depends on feature selection and data representation. A common approach is to have a clinical doctor specify the variables and label the clinical data to be used as training sets. Then the ML method will find mappings and features from the data, which subsequently will be tested on new data sets. Although appropriate in many situations, a supervised definition of the features contributes to losing an opportunity to learn latent patterns and features [4] . In countering the subjectivity of defining the features, an unsupervised learning approach can be used to extract useful information from data. One other advantage of unsupervised learning is that abstract features of patients can often be represented in low-dimensional spaces and thus, they can summarize efficiently the information available in the data. This further allows for easy visualization of the cohort of patients under consideration. In the ML literature, categorical latent Gaussian processes provide data efficient and powerful Bayesian framework for learning latent functions or patterns [14] . In this paper, we model the categorical data from pregnant women as generated non-linearly from a latent space. More specifically, we map the categorical variables including maternal comorbidities, pregnancy complications, ABO blood types, etc., into a continuous lower dimensional space. Then we use these learned features along with the remaining numerical data (maternal age, BMI, etc.) to predict whether (a) the patient will develop severe symptoms and will come back to the hospital due to COVID-19, days after tested positive, or (b) the patient will remain asymptomatic or symptomatic but with mild symptoms. We compared the performance obtained by direct and non-linear dimensionality reduction of the categorical data with the methodology of one-hot encoding, which is commonly applied in the machine learning circles when dealing with categorical data. The remainder of this paper is organized as follows. In the next section, we explain the categorical latent Gaussian model first introduced in [14] . Then we introduce an alternative pipeline that deals with categorical data. We test the proposed approach first on synthesized data and then on original COVID-19 data. Gaussian process latent variable models (GPLVMs) are Bayesian nonparametric frameworks that allow for unsupervised learning [15] . GPLVMs can be seen as multi-output Gaussian process regressions when the inputs are unobserved. To be more specific, let f n ∈ R K be the n-th observed data vector of dimension K, n = 1, 2, . . . , N . Further, let these data be associated with inputs x n ∈ R Q through K different functions. If we assume that these functions are independent, then for f n we can write where f nk (x n ) represents the kth dimension of f n (x n ) and where the notation N (f nk ; 0, k(x n , x n )) means that the random variable f nk (x n ) is Gaussian with mean zero and variance defined by the covariance function k(x n , x n ). In order to automatically learn the dimensionality of the latent space, we will use the concept known as Automatic Relevance Determination (ARD) with the kernel In GPLVMs, X ∈ R N ×Q is a matrix of latent variables, and therefore we assign it a prior density. A typical approach is to use the standard Gaussian distribution, and thus we have where the x n 's are the rows of X. By defining the matrix of observations F ∈ R N ×K , where the rows represent the multiple outputs f n , we wish to compute the marginal likelihood of the data: The authors in [15] developed a variational Bayesian approach for the marginalization of the latent variables, X, allowing them to optimize the resulting lower bound on the marginal likelihood with respect to the hyperparameters. They further used the lower bound for model comparison and automatic selection of the latent dimensionality. We consider now the discrete version of GPLVM where for each input x n , we observe a discrete variable y n that can take values 1, ..., K, with probabilities In the multivariate case, we have y n ∈ R D . Next, we consider a generative model for a dataset Y ∈ R N ×D with N observations and D categorical variables. We denote the d-th variable in the n-th observation by y nd . Now we express (6) as where f ndk is function of the input variable x n ∈ R Q , i.e., f ndk = F dk (x n ) . Next, we summarize the generative model (the indices below have the following meaning: n refers to observation, d to the dimension of the output, m to an inducing point (defined below), and k to category), where x nq and F dk are latent variables with prior distributions given by (8) and (9), respectively, with GP signifying Gaussian process (GP). Further, the z m s are inducing inputs, m = 1, 2, . . . , M , and the u mdk s are inducing outputs whose role is explained further below. We note that we assume a Gaussian distribution prior with standard deviation σ 2 x for x nk , and a GP prior for each of the functions F. We reiterate that for each vector of latent function values f dk , we introduce a separate set of M variational inducing variables u dk , evaluated at a set of inducing input locations from the set Z = {z 1 , z 2 , . . . , z M }. It is assumed that all u dk s are computed at the same inducing locations. The inducing variables are function points drawn from the GP prior and lie in the same latent space as F variables (Fig. 1) . The pictorial description of the generative model is displayed in Fig. 2 . The marginal log-likelihood is intractable because of the covariance function of the GP and the nonlinear Softmax likelihood in (12) . We consider a variational approximation to the posterior distribution of X, F and U ∈ R M ×D×K factorized as, By applying Jensen's inequality, we can write a lower bound of the log-evidence (ELBO) as where, The lower bound is still intractable because of the softmax likelihood, log p (y nd | f nd ). Therefore, we will compute the lower bound L and its derivatives with the Monte Carlo method. We draw samples of x n , U d ∈ R M ×K (see Fig. 1 ) and f nd from q (x n ) , q (U d ) , and p (f nd | x n , U d ), respectively, and estimate L with the sample average. We consider mean field variational approximation of the latent points q(X) and a joint Gaussian distribution for q(U ) as, where the covariance matrix Σ d is shared for the same categorical variable d. The KL divergence in L can be computed analytically with the given variational distributions. We need to optimize the hyperparameters of each GP (parameters of K d ), parameters of the variational random variables u dk , µ dk , Σ d , mean m nq and variance σ 2 nq of the latent inputs. Consider the categorical variable y that can take values 001, 010, and 100 (or blue, red, and green). The input variable x comes from a space of patients. Further, let three functions F 11 , F 12 , and F 13 model the f ndk s that are used for computing the probability of each category (the first index of the functions refers to the dimension, which in this example equals to one). For instance, F 11 (x n ) is proportional to the probability of y for patient n with input x n being 001. Similarly, we define f n12 and f n13 for the categories 010 and 100, respectively. We perform the inference using the introduced method ( Fig. 3 (a) ) and compare it with the one-hot encoding of the categorical variables (Fig. 3 (b) ). We observe that by the onehot encoding and then applying GPLVM, the structure of the latent space is distorted. The first two dimensions of x are shown in Fig. 4 (c) . Although a one-dimensional manifold is detected, the points at the boundary of the two clusters are obviously distorted. We used data collected at SBUH of 155 test-positive COVID-19 pregnant women. The dataset is composed of categorical variables including patients' symptoms, maternal comorbidities, pregnancy complications, race, employer type, insurance, known sick contact, and ABO blood type. It also has numerical data including age, BMI, gravidity, parity, and admission lab values. The list of categorical and numerical We first reduced the dimension of categorical data by mapping them into a lower-dimension space using discrete-GPLVM. Next, we used the extracted latent features combined with numerical variables for the supervised task of binary classification. Then we converted the categorical variables to one-hot features and then applied GPLVM. For classification we employed Random Forest, Naïve Bayes, AdaBoost, k-Nearest Neighbours (kNN), Support Vector Machine (SVM), and Logistic Regression. We compared the performances of the methods by Area Under the ROC Curve (AUC), Classifi- with T P representing True Positive, T N True Negative, F P False Positive, and F N False Negative predictions, and The results are summarized in Tables III and IV. The results suggests that the performance of almost all classifiers improved by using the discrete GPLVM. The best performance of all classifiers was achieved by Random Forest. It appears that with dimensionality reduction using discrete GPLVM we compress information better than with GPLVM carried out by one-hot encoding. We also mapped the data for the task of visualization of the cohort. Figure 5 shows the visualization of the patients using discrete-GPLVM by setting the latent dimension to Q = 2. We observe that the latent features of the symptomatic patients or patients with mild symptoms (blue circles) are well clustered and somewhat separated from the patients who were hospitalized or who were admitted to ICU (red circles and red crosses). Admitted to Hospital for COVID-19 Admitted to ICU Figure 5 . Visualization of the patients. Blue circles represent asymptomatic patients or patients with mild symptoms, red circles represent patients who were hospitalized and red crosses are patients who were admitted to ICU. In this paper, we modeled multivariate categorical data using Gaussian process latent variable models to predict if a pregnant women would be admitted to the hospital due to COVID-19. In our approach, we used a data-efficient Bayesian framework for reducing the dimension of high-dimensional categorical data. Our tests with synthetic data showed that the method is capable of finding latent structures of the data. Further, the results on test-positive COVID-19 pregnant women suggest that the method discovered latent structures that were useful for further classification of the data. Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity Severity detection for the coronavirus disease 2019 (covid-19) patients using a machine learning model based on the blood and urine tests Artificial intelligence and machine learning to fight COVID-19 Deep patient: An unsupervised representation to predict the future of patients from the electronic health records Coronavirus disease 2019 during pregnancy: Do not underestimate the risk of maternal adverse outcomes COVID-19 infection among asymptomatic and symptomatic pregnant women: Two weeks of confirmed presentations to an affiliated pair of new york city hospitals Perinatal depressive and anxiety symptoms of pregnant women during the coronavirus disease 2019 outbreak in china Clinical characteristics and intrauterine vertical transmission potential of COVID-19 infection in nine pregnant women: a retrospective review of medical records Coronavirus disease 2019 (COVID-19) pandemic and pregnancy Clinical course of severe and critical coronavirus disease 2019 in hospitalized pregnancies: a united states cohort study Early prediction of mortality risk among severe COVID-19 patients using machine learning An Introduction to Categorical Data Analysis Statistical analysis with missing data Latent Gaussian processes for distribution estimation of multivariate categorical data Bayesian Gaussian process latent variable model