key: cord-0218361-atpqk4k0
authors: Donnat, Claire; Miolane, Nina; Bunbury, Frederick de St Pierre; Kreindler, Jack
title: A Bayesian Hierarchical Network for Combining Heterogeneous Data Sources in Medical Diagnoses
date: 2020-07-27
journal: nan
DOI: nan
sha: a8f2d3218e7025d47807574a8359f0887826826b
doc_id: 218361
cord_uid: atpqk4k0

Computer-Aided Diagnosis has shown stellar performance in providing accurate medical diagnoses across multiple testing modalities (medical images, electrophysiological signals, etc.). While this field has typically focused on fully harvesting the signal provided by a single (and generally extremely reliable) modality, fewer efforts have utilized imprecise data lacking reliable ground truth labels. In this unsupervised, noisy setting, the robustification and quantification of the diagnosis uncertainty become paramount, thus posing a new challenge: how can we combine multiple sources of information -- often themselves with vastly varying levels of precision and uncertainty -- to provide a diagnosis estimate with confidence bounds? Motivated by a concrete application in antibody testing, we devise a Stochastic Expectation-Maximization algorithm that allows the principled integration of heterogeneous, and potentially unreliable, data types. Our Bayesian formalism is essential in (a) flexibly combining these heterogeneous data sources and their corresponding levels of uncertainty, (b) quantifying the degree of confidence associated with a given diagnostic, and (c) dealing with the missing values that typically plague medical data. We quantify the potential of this approach on simulated data, and showcase its practicality by deploying it on a real COVID-19 immunity study.

Current medical diagnoses are most often based on the combination of several data inputs by medical experts, typically including (i) clinical history, interviews, and physical exams, (ii) laboratory tests, (iii) electrophysiological signals, and medical images. Advances in the machine learning (ML) community have highlighted the potential of ML to contribute to the field of Computer-Aided Diagnosis (CAD), for which we distinguish two main classes of methods.

Single modality analysis. Most recent efforts in the ML community have focused on analyzing a single medical data source -called a "modality". For instance, the parsing of electronic health records (EHR) for clinical history, patients' interviews, and physical exams has shown huge potential for the diagnosis of a broad range of diseases, from coronary artery disease to rheumatoid arthritis [1] [2] [3] [4] [5] . ML algorithms have been developed to process various types of laboratory results, from urine steroid profiles to gene expression data [6] [7] [8] [9] . Others have shown success in automatically processing and classifying medical signals such as electrocardiograms (ECG) [10] [11] [12] [13] , or electroencephalograms (EEG) [14] . Finally, a large body of work has focused on medical imaging analysis encompassing a vast number of tasks, such as automatic extraction of diagnostic features [15, 16] , segmentation of anatomical structures [17] [18] [19] [20] [21] [22] , or direct diagnosis through image classification [23, 24] .

While these works achieve record-breaking diagnostic accuracy, they often rely on supervised learning approaches -requiring the diagnosis ground-truth to be available during training -and on the acquisition of large datasets. Furthermore, they focus solely on a single type of data input (EEGs, scans, etc.) -often acquired by clinicians using specialized, highly accurate equipment -and do not harvest the potentially rich and complementary sources of information provided by alternative medical modalities. Yet, with the development of at-home diagnostic tests (lateral flow assays, questionnaire data for disease screening, mobile health apps, etc.), this paradigm shifts, and diagnoses have to be established through the combination of multiple sources of cheaper, yet often noisier and more imprecise data. In light of the uncertainty exhibited by these various inputs, it also becomes indispensable to pair the provision of a diagnosis with a notion of a confidence interval, thereby thoroughly characterizing our state of knowledge given the available data.

Multiple modality integration. Interestingly, diagnosis studies fusing different data sources seemed to be more prominent in the first years of CAD. Bayesian networks [25] -also called belief networks -have been used as a crucial decision tool for automatic diagnosis. Such networks provide a biologically-grounded and interpretable statistical framework that integrates the probabilities of contracting the disease given a patient's clinical history, and the probability of developing symptoms or observing positive test results in the presence or absence of the disease. The advantages of these Bayesian networks are that they are (a) fully unsupervised and do not require ground-truth information, and (b) able to provide meaningful results even in low sample regimes.

Bayesian networks have thus been implemented in a variety of contexts to integrate clinical data and laboratory results, and diagnose conditions ranging from pyloric stenosis to acute appendicitis [26] [27] [28] [29] [30] . They have also been used to combine clinical data with medical images, and subsequently applied to assess venous thromboembolism [31] , community-acquired pneumonia [32] , head injury prognostics [33] , or to predict tumors [34] [35] [36] . As early as 1994, full integration of clinical data, laboratory results and imaging features was performed to diagnose gallbladder disease [37] .

However, contrary to our proposed setting, the uncertainty that these Bayesian networks integrate is typically known and controlled. The medical conditions that they study are well characterized by a set of specific questions and physiological exams (e.g. projectile vomiting, potassium levels and ultrasounds in the case of Pyloric stenosis). Not only do these inputs provide a very strong signal for the diagnosis, but the uncertainty arising from the different modalities can often be reliably informed by a test manufacturer, extensive medical research or prior ML studies (for diagnostic inputs obtained through classification algorithms). The uncertainty of a diagnostic method, however, is not always well specified: whether it be (i) a physician's assessment of the symptoms associated with a novel or rare disease, (ii) predictions of an ML algorithm whose accuracy has not been fully characterized outside of a curated research environment or reference datasets (such as MNIST, CIFAR), or simply (iii) biological tests whose sensitivity still has to be determined.

Multiple noisy modality integration: a COVID-19 case study. To motivate this paper and further understand the limitations of these previous approaches, let us consider a particular use case: COVID-19 antibody testing. Lateral flow immunoassay (LFA) antibody tests for COVID-19 are one of the manageable, affordable and easily deployable options to allow at-home testing for large population and provide assessments of our immunity. Yet, studies have shown that the sensitivity of these tests remains highly variable and highly contingent on the time of testing. The successful deployment of LFAs thus depends on their augmentation with additional data inputs, such as user-specific risk factors and self-reported symptoms. The provision of confidence scores is essential to flag potential false negative or positive tests (requiring re-testing or closer scrutiny) and to assess local prevalence levels -both pivotal for researchers and policymakers in the context of a pandemic.

Contributions and outline. This applications paper is geared towards the practical integration of noisy sources of information for CAD. Our contribution is two-fold. From a methods perspective, we account for the variability of the inputs by devising a two-level Bayesian hierarchical model. In contrast to existing Bayesian methods for CAD, our model is deeper, and trained using a Stochastic Expectation Maximization (EM) algorithm [38] [39] [40] . The Stochastic EM allows to overcome the limitations of its non-stochastic counterpart [38] , that is (a) its sensitivity to the starting position, (b) its potential slow convergence rate, and (c) its possible convergence to a saddle position instead of a local maximum. From an applications perspective, we gear this algorithm to enhance at-home LFA testing in the context of the COVID-19 pandemic. In particular, we wish to (a) quantify the benefit of multimodal data integration when the diagnostics are uncertain, and (b) show how our method can benefit medical experts or researchers in real life. Section 3 presents the Bayesian model for multimodal integration and the Stochastic Expectation-Maximization algorithm that performs principled and scalable inference. Section 5 presents extensive tests of our model on simulated datasets. Section 6 details the results obtained on the Covid-19 dataset and the impact of our method for affordable and reliable at-home test kits.

By way of clarifying the challenges that our algorithm proposes to overcome, we present here the COVID Clearance Remote Recovery & Antibody Monitoring Platform study 1 , which motivated our approach. The purpose of this study is to track the evolution of the immunity of a cohort of adult participants in the UK with various COVID-19 exposure risks. This paper focuses on a subset of 117 healthcare workers, for which we have both questionnaire information and LFA test results.

Home-testing Immunoassay Data. Participants are issued packs of home-testing kits with written instructions. These kits identify Immunoglobulin M (IgM) and/or Immunoglobulin G (IgG) specific to SARS-CoV2 in blood samples using three gold-standard methods 2 3 4 . Pictures and typical examples of LFA test results are provided in the supplementary materials. For the LFAs, participants self-report a positive result if IgM and/or IgG are detected by the test in addition to a positive control band. Through the collection of this data, the COV-CLEAR study aims to address questions relating to (a) the quantification of the robustness of the antibody response, and (b) the durability of this response. However, while affordable, we highlight that the LFA tests suffer from vastly varying levels of sensitivity -registering a sensitivity as low as 70% on asymptomatic individuals.

Clinical Data: Questionnaire. Additionally, participants are asked to answer a questionnaire, ideally before knowing the result of their LFA test. The form consists of questions related to K = 14 exhibited symptoms (fever, cough, runny nose, etc.) and M = 2 subject-specific risk factors (household size and proportion of members with a suspected or confirmed history of COVID-19). The empirical distributions of the symptoms and the risk factors in this cohort are provided in the supplementary materials.

In this section, we describe our principled integration of noisy diagnostic test results, with additional clinical data such as symptom data and subject-specific risk factors. Our approach applies to general heterogeneous medical data where the outputs are binary. The latter could be either self-reported answers to questionnaires, clinician-reported physiological exams, outputs of a diagnostic based on an image, abnormalities of laboratory results, etc., making this a widespread and general setting. Thus, while we implement and showcase the method for the particular purpose of applying it to antibody testing, this could be relevant to any medical diagnostic with binary inputs.

Denote D the true diagnosis of an individual (healthy/sick), T the outcome of the noisy diagnostic test (positive/negative), S the symptomaticity (symptomatic/asymptomatic), X the symptoms exhibited and Y the subject-specific risks factors. The underlying assumption is that given a true diagnosis D, the symptoms X and the diagnostic test outcome T are independent. In other words, the probability of the diagnostic test being a false negative P[T = 0|D = 1] is independent of the symptoms of a truly infected individual P[X|D = 1]. Similarly, given a diagnosis D, the test outcome T and the exhibited symptoms X are independent of the risk factors Y . We define:

• P(D = 1|Y ) = π β (Y ) the probability of contracting the disease given risk factors Y , • P[(T = 1|D = 1) = x the sensitivity of the diagnostic test, • P[T = 0|D = 0] = 1 − y the specificity of the diagnostic test, i.e. y = P[T = 1|D = 0], • P(S = 1|D = 0) = p 0 the probability of being symptomatic when whilst not having been infected by that specific disease (the symptoms could be due to another illness for instance),

• P(S = 1|D = 1) = p 1 the probability of having been symptomatic upon infection,

• P(X k = 1|S = 1, D = 0) = s 0k the probability of exhibiting symptom k when not infected,

• P(X k = 1|S = 1, D = 1) = s 1k the probability of exhibiting symptom k upon infection.

In the above, the diagnostic test may be any biological test (e.g. LFA antibody tests), or output of a medical imaging classification algorithm provided by a ML research team. We have here splitted our binary variables into two classes according to their variability: (a) the test T , for which we have provisional estimates for the sensitivity and specificity (given by the manufacturer) but which we are still uncertain, and (b) the symptoms X, which carry additional uncertainty in that neither of them is extremely specific to COVID-19 nor is their prevalence amongst COVID cohorts very well established; Values for (x, y), p 0 , p 1 , s 0 , s 1 or β may be published, but challenged by complementary research or field experience, or may be completely unavailable. We thus need to account for these inputs' variability and for the relative importance of different risk factors β. We propose using a hierarchical Bayesian network (see Fig. 9 ) that is consequently deeper than the ones traditionally implemented in the CAD literature [26] [27] [28] [29] [30] .

The uncertainty in the test sensitivity x and specificity 1−y are expressed through Beta priors on x and y with parameters (α x , β x ) and (α y , β y ), see Fig. 9 . Similarly, the uncertainty on the probabilities of symptomaticity and of the different symptoms are expressed via Beta priors with respective parameters (α p0 , β p0 ), (α p1 , β p1 ), {(α s 0k , β s 0k ), (α s 1k , β s 1k )} k , see Fig. 9 . When published estimates exist for x, y, p 0 , p 1 , s 0 and s 1 (e.g. as provided by the LFA test manufacturer), we match the moments of the Beta priors with those of the published distributions. If this information is unavailable, we use the non-informative Beta prior with parameters (0.5, 0.5). In the COVID-19 example, we implement a non-informative prior for the probabilities of symptomaticity and symptoms, as the etiology of the disease remains unknown. These priors are then updated during learning as we aggregate the information from the observed and imputed variables.

Lastly, we model the probability π β (Y ) of contracting the disease with a logistic regression on the M risk factors Y . In the case of the COVID-19 dataset, these include county data, profession, size of household, etc. The coefficients β = {β m } m weight the relative importance of each factor Y m , m = 1...M . We express our uncertainty on the possible impact of a given factor m by introducing a Gaussian prior: β ∼ N (0, σ 2 β ), see Fig. 9 . 

At the subject level, we seek to compute the posterior distribution of the true diagnosis D i , informed by the integration of the observed variables T i , S i , X i , Y i . This posterior yields an estimated diagnosis, together with a credible interval that expresses the confidence associated with our prediction. In the case of COVID-19, this provides individual estimates of each citizen's immunity that may inform their social behavior as lockdown is eased.

At the global level, we wish to learn: (i) the distributions of the sensitivity x and specificity y of the diagnostic test as observed in the field; (ii) the distributions of the probabilities p 0 , p 1 of being symptomatic, and s 0 , s 1 of exhibiting specific symptoms, within a population of interest; and (iii) the distribution of the impact β of the risk factors for contracting the disease. In the context of the COVID-19 pandemic, such aggregated figures are pivotal to understand the dynamic of the disease and implement appropriate crisis policy.

Since the true diagnoses are hidden variables, the Expectation-Maximization algorithm [41] is an appealing method to perform inference in our Bayesian model; hence to achieve the aforementioned objectives. However, the EM algorithm requires the computation of an expectation over the posterior of the hidden variables, which may be intractable depending on the probability distributions defined in the model. To allow for flexibility in terms of model's design within our multimodal data integration framework, we offer to rely on the Stochastic EM algorithm (StEM) [42] . StEM effectively estimates the conditional expectation in the EM using the "Stochastic Imputation Principle", i.e. approximating the expectation by sampling once from the underlying distribution. This method allows us to carry out inference with priors that are not necessarily the Beta distributions implemented in our experimentsthus providing us with additional flexibility in modeling real data. Furthermore, StEM is more robust, being less dependent on the parameters' initialization than the EM -a definite advantage given our very uncertain framework. Finally, StEM shows better asymptotic behavior: unlike the EM, StEM always leads to maximization of a complete data log-likelihood in the M-step [43].

Starting iteration j + 1, the current estimates of the model's parameters are θ (j) . The stochastic E-step computes the posterior distribution P(D i |T i , S i , X i , Y i , θ (j) ) of the hidden variable D i . We then sample a diagnosis D i from this posterior. Proposition 4.1. The odds of the posterior of the hidden variable D i at iteration (j + 1) writes:

.

(1)

The following proposition shows the updates in the model's parameters at iteration (j + 1), performed in the M-step. Proposition 4.2. The parameters updates write: 

Since our approach is unsupervised, we begin by validating it on synthetic datasets where the ground truth is known and controlled -thus allowing us to characterize the performance of the algorithm, and to showcase the strength of combining multiple noisy modalities.

We assume the same generative model as in Fig. 9 and generate synthetic data for various pairs of values of sensitivity and specificity ranging from 60% to 99% . In each case, we simulate N = 100 tuples of variables ((Y i , X i , T i ) n i=1 , for n ∈ {100, 200, 500, 1000} participants. We also vary the noise level σ ∈ {0.1, 0.5, 1.0} for the prior on D in Fig. 9 . To mimic our COVID data, we simulate K = 14 symptoms and M = 2 risk factors, for which we randomly select the parameters 5 .

Improving upon the sole test T . Fig. 10 (A) and (B) show the Stochastic EM's raw accuracy and accuracy improvement over the sole test result T . Note that this difference is always positive, highlighting that our method only improves upon single diagnostic inputs -even when one input is more reliable than any of the others. As expected, the most substantial improvements upon T are observed when the test specificity and/or sensitivity are low. For instance, for sensitivity and specificity of 70%, our method provides an 86.6±(2.5)% accuracy for the diagnosis -or equivalently, a 16% gain over T . We further quantify the algorithm's performance in Table 1 of the supplementary materials, for values of the specificity close to those observed on the field.

Benchmarking against other models. We now benchmark our algorithm against other approaches:

• Vanilla Classifier: using both the context Y and symptoms X as inputs, we fit a logistic regression to the test labels T . We choose the regularization parameter using 10-fold cross-validation, and compute confidence intervals for the log-probability of (D|X, T ) by bootstrapping. • Data-Agnostic EM: we implement a deterministic version of EM, providing uninformative priors for the parameter (thereby reflecting our absence of knowledge of the truth), which are not updated -i.e, an "uninformed" equivalent of the belief networks found in the literature. • Data-Informed EM: similar to the Data-Agnostic EM, but choosing the priors (which then remain fixed) based on the empirical data. The results are shown in Fig. 4 (B), and further completed in the supplementary materials. We highlight the superiority of our deeper and more adaptive hierarchical model, yielding improvements (for a reasonable tuple of specificity and sensitivity of 80%) of up to 4% over the Data-Informed EM, 8% over the Data-Agnostic one, and 9% over the Vanilla classifier.

Assessing the robustness of the Stochastic EM. For a fixed specificity 80%, Figure 3 shows the accuracy of StEM for different values of σ (A) or as the number of samples increases (B). These axes of variation seem to yield little impact on the model's performance, providing evidence that our algorithm is fairly robust, especially with respect to low-sample size.

Convergence. Finally, we assess the convergence properties of our algorithm. We examine the distribution of the relative difference between recovered coefficients and ground truth (expressed as a percentage of the ground-truth value). The plots, provided in the supplementary materials, show deviations that are within a few percentages of the true value of the coefficients -thus highlighting the ability of the model to converge to the ground-truth parameters, and making it relevant from a medical perspective to characterize the disease's etiology. Illustrations of the behavior of the number of required EM steps are also provided in the supplementary materials.

We turn to the real dataset of n = 117 participants described in Section 2. The purpose of this section is to show how our algorithm can be practically applied to process heterogeneous data types and inform participants and researchers alike. At the individual level, we provide each user with (a) the confirmation of the diagnostic, and (b) confidence intervals associated with the uncertainty to shed more light on the uncertainty associated with the diagnostic. At the global level, we provide policymakers with (c) aggregated analysis of the herd immunity, and associated measures of uncertainty.

At the subject-level. We present two examples, where our algorithm either confirms or infirms the result of the test -thereby allowing for the potential flagging of false negatives. The first example ( Fig. 5 A) is a user that registers a negative test while being asymptomatic and with a limited number of risk factors. In this case, we expect our model to confirm the result of the test, and provide a narrower confidence interval as per the probability of immunity -as confirmed by Fig. 5 (A). The second example showcases an instance where the questionnaire and the test disagree. Subject 108 is a user that registers a negative test, while exhibiting a wide number of known COVID symptoms (dry cough, shortness of breath, fever), but took the test less than 10 days after his illness. While the posterior does not reclassify the subjects' diagnosis, the confidence interval associated with the prediction of immunity reflects the uncertainty associated with this case, and flags it as a potential false negative.

At the global level. At the global level, this framework allows to perform principled inference for the disease and the population of interest. For this cohort of n = 117 healthcare workers in the UK, at-home tests predict 35.8% of immunity. The posterior distributions of the model's parameters shed light on the actual accuracy of the LFA tests on our population. Fig. 6 (A) compares the posteriors of the sensitivity and specificity to the priors built from values reported by manufacturers, for asymptomatic subjects. Furthermore, our results provide information regarding the most prevalent symptoms for COVID-19. For instance, among symptomatic participants, the probability of exhibiting fever is higher for infected subjects (see Fig. 6 (B)) while cough with sputum does not seem to be associated with COVID-19, being probably the sign of another infection (see Fig. 6 (C)). Similar plots on additional symptoms are provided in the supplementary materials.

This applications paper provides a statistical framework for multimodal integration of noisy diagnostic test results, self-reported symptoms, and demographic variables. Compared to previous approaches, our work is more amenable to the handling of the different inputs' uncertainty. While we have focused here on binary symptoms, this adaptive Bayesian framework, together with the flexibility provided by the Stochastic EM algorithm, pave the way for the integration of other continuous-valued inputs (index of the severity of the disease, pain, etc.) as well as interactions -an extension that we are currently in the process of implementing. The robustness of the Stochastic EM is crucial in ensuring the reliability of the parameters given the medical stakes. Finally, we note that the generative model allows for principled handling of missing data, thus allowing us to make diagnoses even in the presence of incomplete data.

[ 

This section provides additional details on the COV-CLEAR dataset 6 . Figure 7 shows the sampling distributions of selected symptoms and risk factors in the population of interest -grouped by the result of the lateral flow immunoassay (LFA) test. Figure 8 illustrates the LFA test results as observed by a given subject. Participants self-report a positive result if IgM and/or IgG are detected by the test.

The marker "C" stands for "Control". The uncertainty in the test sensitivity and specificity is expressed by putting a prior on x and y, which will be updated during training:

The uncertainty on the symptomaticity, as well as on the appearance of specific symptoms for infected and healthy individuals is expressed with a prior on s 0 and s 1 , and updating it when we aggregate more information:

Lastly, we model the probability for an individual to contract the disease depending on their risk factors, by modeling the logodds:

where the parameters β weight the importance of the different components of the risk factors Y (county data, profession, size of household) in contracting the disease. We express our uncertainty on whether a given factor has an impact by putting a prior on β:

In this model:

• ζ = α x , β x , α y , β y , α p0 , β p0 , α p1 , β p1 , {α s 0k , β s 0k } k , {α s 1k , β s 1k } k , {σ βm } m are hyper-parameters, considered fixed and known,

• θ = x, y, p 0 , p 1 , {s 0k , s 1k } k , {β m } m are the parameters,

• D is a hidden random variable,

• Y, S, X, T are the observed variables.

For simplicity of notations, we write O = (S, X, T ) some of the observed variables. The Bayesian model is represented in plate notations in Figure 9 . Our objective is two-fold. At the patient level, we compute the posterior distribution of the true diagnosis D i , informed by the integration of the observed variables O i . This distribution gives us an estimate of the true diagnosis, through Maximum a Posteriori, as well as a credible interval that expresses our confidence in this diagnosis. At the global level, we learn the posterior distributions of the parameters x, y, i.e. the sensitivity and specificity of the test -to be compared to the values given by the providers. We also learn the posterior distributions of the parameters s 0 , s 1 , i.e. the probability of symptoms with or without the disease -to be compared to the values given by the medical specialists and the CDC. We also learn the posterior distribution of the parameter β, which weights the impact of each risk factor for contracting the disease.

To fulfil this objective, we perform inference in the Bayesian model described in Figure 9 and in the main paper. Since D are hidden variables, we proceed with the Expectation-Maximization (EM) algorithm, specifically its stochastic version.

We seek the Maximum a Posteriori (MAP) of the parameters θ. Given the parameters' estimates, we can compute the posterior distributions of the diagnosiss D i 's. The stochastic EM algorithm allows computing jointly the posterior distribution of the parameters and the hidden variables D i 's, through an iterative procedure described below.

We want to maximize the posterior distribution of the parameters θ:

which translates into maximizing the expression:

In the above computations, the lower bound is obtained using Jensen inequality. It represents a tangent lower bound of the posterior distribution: θ → (θ) at the given set of parametersθ. On the last line, the expectation is taken for D i distributed according to P(D i |O i ,θ, Y i ). Following the literature on the stochastic EM algorithm, we compute this expectation by replacing it with its Monte-Carlo estimate, sampling one D i according to P(D i |O i ,θ, Y i ). For simplicity of notation, we denote:

The approximate lower bound of the posterior of the parameters writes, after sampling:

and we only need to maximize the following function M in its parameters θ:

Since D i = D i is now fixed, we further decompose the right-hand side of the above inequality.

using the conditional independence of (S i , X i ) and T i given D i . Using the dependency structure of the variables relying on the Bayesian network from Figure 9 , we get:

log P(D i = D i |β, Y i ) + log P(x, y) + log P(p 0 , p 1 ) + log P(s 0 , s 1 ) + log P(β)

As a result, we find the maximum a posteriori of x, y, s 0 , s 1 and β separately, by maximizing respectively:

To summarize, at iteration (j + 1) of the stochastic EM algorithm:

• Stochastic E-step: For each patient i:

-Compute the posterior distribution of their diagnosis P(D i |O i ,θ, Y i ),

-Sample D i from the posterior.

• M-step: Maximize the approximate lower-bound of the parameters' posteriors in θ = (x, y, p 0 , p 1 , s 0 , s 1 , β) by maximizing M (x, y), M (p 0 , p 1 ), M (s 0 , s 1 ) and M (β). This updates the parameters to:

until convergence in θ = (x, y, p 0 , p 1 , s 0 , s 1 , β). At each iteration, this algorithm maximizes a (stochastic approximation) of a tangent lower-bound of the posterior distribution of the parameters. Therefore, each iteration increases the posterior distribution of the parameters.

As an auxiliary computation, we provide the formula for the joint log-likelihood under our model, which we will use in the E-and M-steps.

The joint log-likelihood writes: 

Using the conditional independence of (S, X) and T given D, we write: 

We separately compute the probabilities in the above formula, using the probability distributions provided by the generative model. We get, for the term involving the immunoassay test T :

For the term involving the symptoms X, we get:

For the term involving the symptomatic variable S, we get:

For the terms involving the diagnosis D and the parameter β of the logistic regression, we get:

where we use the notations:

• π(Y, β) = g(Y β) with g the sigmoid function from the logistic regression,

• n(β; σ 2 β ) the probability density function of the Gaussian N (0, σ 2 β ).

We plug the expressions of the probabilities in the joint log-likelihood, and get:

where we omit the normalizing constants from the beta distributions.

In the stochastic E-step, we aim to sample D i from the posterior distribution of the hidden variable D i . Given the current estimate θ (j) of the parameters, we compute the posterior of the diagnosis D i for each patient i:

where we plug the expression of the joint log-likelihood from Equation 51.

Omitting the coefficients that are shared in both formulae for the probability below, we have:

by identification in the expression of the joint log-likehood from Equation 51.

This shows the proposition:

Proposition B.1. The odds of the posterior of the hidden variable D i at iteration (j + 1) writes:

Following the methodology of the stochast EM algorithm, we sample from this posterior to obtain D i .

In the M-step, we update the parameters of the generative model by maximizing the lower-bound of their posterior distribution. We update each set of parameters separetely, using the expressions in Equation 49.

Given D i 's, we update x, y by maximizing:

We recognize in this expression the logarithm of the probability distribution of a product of beta distributions in x and in y, omitting the normalization constants:

such that M (x, y) writes:

The updated sensitivity and specificity, x (j+1) , y (j+1) , maximize M (x, y). They are the modes of the beta distributions:

If α x , β x > 1 and α y , β y > 1, the expression for the modes are:

We can have the case β x , β y < 1 if the sensitivity or the specificity are very close to 1. In this case, we update the parameters using the expectation of the Beta distribution, instead of the mode.

Given the D i 's, we update p 0 , p 1 by maximizing:

We recognize in this expression the logarithm of the probability distribution of a product of beta distributions in p 0 and in p 1 , omitting the normalization constants :

The updated probabilities of being symptomatic, in absence or presence of the disease, p (j+1) 0 , p (j+1) 1 , maximize M (p 0 , p 1 ). They are the modes of the beta distributions:

If α p0 , β p0 > 1 and α p1 , β p1 > 1, the expression for the modes are:

We can have the case β p0 , β p1 < 1 if the sensitivity or the specificity are very close to 1. In this case, we update the parameters using the expectation of the Beta distribution, instead of the mode.

Given D i 's, we update s 0 , s 1 by maximizing:

We recognize in this expression the logarithm of the probability distribution of a product of beta distributions in s 0 and in s 1 , omitting the normalization constants:

and:

such that M (s 0 , s 1 ) writes:

The updated probabilities of exhibiting symptoms, in absence or presence of the disease, s , maximize M (s 0 , s 1 ). They are the modes of the beta distributions:

If α s0 , β s0 > 1 and α s1 , β s1 > 1, the expression for the modes are:

We can have the case β s0 , β s1 < 1 if the probabilities of symptoms are very close to 1. In this case, we update the parameters using the expectation of the Beta distribution, instead of the mode.

Given the D i 's, we update β by maximizing:

where we omit the normalization constant of the Gaussian distributions. We recognize the loss function of the logistic regression, that we optimize using stochastic gradient descent, with update rule:

This gives the update in β.

This proposition summarizes the parameters updates.

Proposition B.2. The parameters updates write:

where the minimization on β is performed through stochastic gradient descent.

We apply the stochastic EM algorithm in the Bayesian model truncated at T . In this model:

considered fixed and known,

• θ = p 0 , p 1 , {s 0k , s 1k } k , {β m } m are the parameters,

• D is a hidden random variable,

• Y, S, X are the observed variables.

We now writ: O = (S, X).

We still wish to maximize the expression:

under our new notations stated above. The approximate lower bound of the posterior of the parameters still writes, after sampling one D i according to P(D i |O i ,θ, Y i ):

, using the notation:

Again, we only need to maximize the following function M in its parameters θ:

Since D i = D i is now fixed, we further decompose the right-hand side of the above inequality.

where:

are the same functions as in the case with observed T i . The only difference is thatD i is sampled

The joint log-likelihood in the truncated Bayesian model writes: 

which gives the same result as in the non-truncated case, but without the term P(T, x, y|D). We use the computations from subsection B.3.1 to get the final expression of the loglikelihood: We plug the expressions of the probabilities in the joint log-likelihood, and get:

where we omit the normalizing constants from the beta distributions.

In the stochastic E-step, we aim to sample D i from the posterior distribution of the hidden variable D i . Given the current estimate θ (j) of the parameters, we compute the posterior of the diagnosis D i for each patient i:

where we plug the expression of the joint log-likelihood of the truncated model.

Omitting the coefficients that are shared in both formulae for the probability below, we have:

by identification.

We derive the Stochastic EM algorithm in the full Bayesian model, taking into account the missing variables T i 's (missing for i = r + 1...n) and hidden variables D i 's. We want to maximize the posterior distribution of the parameters θ:

where we write O i = (S i , X i , T i ) when T i is available and O M i = (S i , X i ) when T i is missing. This translates into maximizing the expression:

where we use Jensen inequality to compute the tangent lower-bounds of D =

independently. This is still a valid lower-bound of atθ as the sum of the tangent-lower bounds is a tangent-lower bound of the sum.

After sampling, we need to maximize the following function of θ:

where M(θ, n = r) is the cost function in the case without missing T .

We compute the second term of the lower-bound, which we denote M N EW :

which is M(θ, n = (n − r + 1), T i → T i ). Therefore: mathcalM M IS (θ) is M(θ) where the missing T i 's have been replaced by theirs imputed value.

We proceed with the Stochastic EM algorithm as follows:

• M-step: Compute the tangent-lower bound, which amounts to:

• E-step: Maximize the lower-bound in θ.

For i = 1, ..., r, sampling D i is performed as usual. We focus on sampling D i , T i from

We compute the posterior of interest:

We get:

These allow to sample ( D i , T i ) according to the posterior, for i = r + 1, .., n.

The update are the same, except that the missing T i s are replaced by their imputed values.

In addition, we can consider that we have a different prior on the imputed T i 's. Therefore, we create a new class of sensivitivy/specificity pair, designed for the imputed T i . The priors are initialized as non-information Beta distribution with parameters (2, 2) , and updated at training time. 

This section provides additional details on the validation of the StEM algorithm on synthetic data. Improvement upon T . Benchmarks. We complete here our discussion of the improvement that our method brings upon Computer-Aided Diagnosis (CAD) standard methods. Fig. 10 compares -for various pairs of sensitivity and specificity. -the diagnosis accuracies of the Stochastic EM algorithm (StEM) and two variants of the EM algorithm: one variant where the parameters' priors are agnostic, or uninformative; and another variant where these priors are computed from the available data, as described in Section 5. This figure demonstrates that learning the parameters' posterior distributions while estimating the diagnosis yields higher prediction accuracy.

Convergence. Fig. 11 shows the distribution of the relative difference between recovered coefficients and ground truth (as a percentage of the ground truth value), showing deviations that are within a few percentages of the true value of the coefficients -thus highlighting the ability of the model to converge to the ground truth parameters. Fig. 12 shows the average number of convergence steps for different values of sensitivity, specificity and sample sizes. Interestingly, we note that for high values of the sensitivity/specificity, the rate of convergence is much faster. To illustrate the complexity of the algorithm, we also display in Fig. 13 the time required per iteration as a function of the number of samples. 

At the subject level. Fig. 14 presents four examples, where our algorithm either confirms or infirms the result of the test -thereby allowing for the potential flagging of false negatives or negatives.

The first example (Fig. 14 A) is a user that registers a positive test, together with a significant number of symptoms and risk factors. The second user (Fig. 14 B) registers a negative test, while being asymptomatic and with a limited number of risk factors. In both cases, we expect our model to confirm the result of the test, and provide a narrower confidence interval as per the probability of immunity -as confirmed by Fig. 14. The third and fourth examples showcase instances where the our diagnostic posterior and the test disagree. Subject 108 is a user that registers a negative test, while exhibiting a wide number of known COVID symptoms (dry cough, shortness of breath, fever), but took the test less than 10 days after his illness. Similarly, subject 92 exhibited less symptoms (in particular, no shortness of breath), but lives in a household of three, where all the other members have also fallen sick. While the posteriors reclassify the subjects' diagnosis in each case, the confidence interval associated with the prediction of immunity reflect the uncertainty that is associated with these cases, and flag them as potential false negatives.

At the global level: LFA sensitivity and specificity. The posterior distributions of the model's parameters shed light on the actual accuracy of the LFA tests on our population. Fig. 15 compares the posteriors of the sensitivity and specificity to the priors built from values reported by manufacturers, for: (A) asymptomatic subjects, (B) symptomatic subjects who took the test between 2 to 10 days after their first symptoms, and (C) symptomatic subjects who took the test between 11 to 20 days after their first symptoms. We observe that the StEM has updated the prior distributions of sensitivity and specificity, the most significant update being observed for the asymptomatic subjects.

At the global level: COVID-19 symptoms in the population of interest. Furthermore, our results provide information regarding the most prevalent symptoms for COVID-19. The posterior probability of exhibiting specific symptoms, among symptomatic subjects with positive or negative predicted diagnostic is shown on Fig. 16 . We emphasize that these probability distributions are relevant to our population of n = 117 healthcare workers, and may vary for studies considering another population. 

Regression modelling strategies for improved prognostic prediction

Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease

Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. AMIA

Predicting risk of emergency admission to hospital using primary care data: Derivation and validation of QAdmissions score

Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records

Novel neural network application for bacterial colony classification

Using machine learning to aid the interpretation of urine steroid profiles

Artificial neural network approach in laboratory test reporting: Learning algorithms

Applications of Bayesian network models in predicting types of hematological malignancies

Automated detection of arrhythmias using different intervals of tachycardia ECG segments with convolutional neural network

Real-Time Patient-Specific ECG Classification by 1-D Convolutional Neural Networks

ECG-Based classification of resuscitation cardiac rhythms for retrospective data analysis

Novel methodology of cardiac health recognition based on ECG signals and evolutionary-neural system

EEG signal classification using universum support vector machine

Feature extraction using convolutional neural network for classifying breast density in mammographic images

Chest pathology identification using deep feature selection with non-medical training

Automatic Segmentation of MR Brain Images with a Convolutional Neural Network

Automatic Segmentation of Liver Tumor in CT Images with Deep Convolutional Neural Networks

Fully Convolutional Networks for Semantic Segmentation

V-Net: Fully convolutional neural networks for volumetric medical image segmentation

3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study

Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3D Conditional Random Fields

Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network

Classification of teeth in cone-beam CT using deep convolutional neural network