Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data PAPER Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data Zi-Hang Wen,1 Jeremy L. Langsam,2 Lu Zhang,3 Wenjun Shen4, ∗ and Xin Zhou2, 5, 6 ∗ 1School of Optical and Electronic Information, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, 430074, Hubei, China, 2Department of Biomedical Engineering, Vanderbilt University, 2301 Vanderbilt Place, 37235, Nashville, USA, 3Department of Computer Science, Hong Kong Baptist University, Room R708, Sir Run Run Shaw Building, Kowloon Tong, Hong Kong, 4Department of Bioinformatics, Shantou University Medical College, No. 22 Xinling Road, Shantou, 515041, Guangdong, China, 5Department of Computer Science, Vanderbilt University, 2301 Vanderbilt Place, 37235, Nashville, USA and 6Data Science Institute, Vanderbilt University, Sony Building, 1400 18th Ave S Building, Suite 2000, 37212, Nashville, USA ∗Corresponding authors: maizie.zhou@vanderbilt.edu; wjshen@stu.edu.cn FOR PUBLISHER ONLY Received on Date Month Year; revised on Date Month Year; accepted on Date Month Year Abstract Single-cell RNA-seq (scRNA-seq) offers opportunities to study gene expression of tens of thousands of single cells simultaneously, to investigate cell-to-cell variation, and to reconstruct cell-type-specific gene regulatory networks. Recovering dropout events in a sparse gene expression matrix for scRNA-seq data is a long-standing matrix completion problem. We introduce Bfimpute, a Bayesian factorization imputation algorithm that reconstructs two latent gene and cell matrices to impute final gene expression matrix within each cell group, with or without the aid of cell type labels or bulk data. Bfimpute achieves better accuracy than other six publicly notable scRNA-seq imputation methods on simulated and real scRNA-seq data, as measured by several different evaluation metrics. Bfimpute can also flexibly integrate any gene or cell related information that users provide to increase the performance. Availability: Bfimpute is implemented in R and is freely available at https://github.com/maiziezhoulab/Bfimpute. Key words: single cell; RNA-seq; imputation; Bayesian factorization Introduction Single-cell RNA-seq (scRNA-seq) has been widely used to study genome-wide transcriptomes in single cell resolution. The cellular resolution made possible by scRNA-seq data distinguishes it from bulk RNA-seq and makes it advantageous in investigating cell-to-cell variation [1]. Today, different commercial platforms are available to perform scRNA-seq, including Fluidigm C1, Wafergen ICELL8 and 10X Genomics Chromium. Droplet-based methods via 10X Genomics Chromium can process tens of thousands of cells; microwell- based, microfluidic-based methods via Fluidigm C1 and Wafergen ICELL8 process fewer cells but with a higher sequencing depth. For all these platforms, missing values make up a large proportion of scRNA-seq data, ranging from 40% - 90% in the gene expression count matrix [2, 3, 4, 5, 6]. In scRNA-seq data, this large percentage of missing events is defined as the so-called ‘dropout’ phenomenon [7]. Gene ‘dropout’ means a gene is observed at a moderate expression level in one cell but it is not detected in another cell of the same type. Analyses of scRNA-seq data, including dimensionality reduction, clustering, and Differential Expression (DE) analysis have shown that effective imputations for dropout events improve downstream analyses and assist biological interpretations [8, 9, 10, 11]. To date, several notable imputation methods have been proposed: scImpute [12], DrImpute [13], MAGIC [14], SAVER [15], VIPER [16] and SCRABBLE [17]. scImpute first performs clustering to identify cell subpopulations and further identifies dropout events through a Gamma-Normal mixture model, finally imputes dropout events by a non-negative least squares regression [12]. DrImpute optimizes the step of identifying cell subpopluations to impute dropout events by averaging the imputation from multiple clustering results [13]. MAGIC builds a Markov affinity-based graph for imputation relying on cell to cell interactions [14]. SAVER uses a Bayesian- based model by various prior probability, and alters all gene expression values [15]. VIPER imputes dropout events relying on local neighborhood cells via non-negative sparse regression models [16]. SCRABBLE has been recently introduced to impute dropout events by adopting the bulk RNA-seq data [17]. Even though a lot of efforts have been taken into analyzing and imputing real dropout events, imputation of dropout events is still a difficult problem because of the high dropout rate and complex cellular heterogeneities for different scRNA-seq datasets. Relying on matrix completion to 1 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint email:email-id.com https://github.com/maiziezhoulab/Bfimpute https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 Wen et al. Latent vectors True values Dropouts Training Imputing Gene1 Gene2 GeneN Cell1 CellM G1 T G2 T GN T C1 CM Expression Matrix E Cell Latent Matrix C p(E|G,C,α) = N i=1 M j=1 N(Ei j |Gi TCj,α −1) Iij Ê = GTC · · · = × · · · · · · · · · · · · · · · · · · · · · · · · · · · Gi TCj Gene Latent Matrix GT Fig. 1. A brief illustration blueprinting the architecture of Bfimpute method. In each group, Bfimpute borrows information from true values and factorizes the expression matrix into two latent matrices using MCMC. After training, Bfimpute imputes dropouts by performing product of the latent matrices. The details are shown in Methods section. impute missing values is a long-standing question and has been investigated in biological sciences, including gene expression prediction, miRNA–disease, protein-protein interaction [18] etc. Even though similar mathematical models could be applied to different biological problems, to solve matrix completion problem in scRNA-seq (recovering the dropout events), it is crucial to take the features of scRNA-seq into consideration. Most of existing scRNA-seq imputation methods have shown it is advantageous for imputation to borrow and leverage information from similar cells. In recent years, researchers also start to integrate additional gene or cell related information (e.g. bulk data for SCRABBLE) to assist imputation which is important in matrix completion problem. In this study, we present Bfimpute, a powerful imputation tool for scRNA-seq data that recovers dropout events by factorizing the count matrix into the product of gene-specific and cell-specific feature matrices [19, 20]. Bfimpute uses full Bayesian inference to describe the latent information for genes and cells and carries out a Markov chain Monte Carlo scheme which is able to easily incorporate any gene or cell related information to train the model and perform the imputation [18] (Figure 1). We demonstrate that Bfimpute performs better than the six other notable published imputation methods mentioned above (scImpute, SAVER, VIPER, DrImpute, MAGIC, and SCRABBLE) in both simulated and real scRNA-seq datasets on improving clustering and differential gene expression analyses and recovering gene expression temporal dynamics (pseudotime analysis) [21]. Methods Cell clustering and dropout detection Bfimpute first provides an optional normalization step to smooth the gene expression values (counts per million, followed by logarithm base 10 with bias 1.01). Bfimpute then performs a local imputation within each cell group. We adopt the same approach as scImpute [12] to detect cell clusters, which applies spectral clustering methods on the result of Principal Component Analysis (PCA) to reduce the impact of dropout events. We integrate spectral clustering by using the ’Spectrum’ function of the Spectrum R package [22] or the ’specc’ function of the kernlab R package [23]. Bfimpute also adopts the Gamma-Normal mixture distribution model from scImpute to determine dropout events [12]. Probabilistic model for scRNA-seq expression matrix imputation After above-mentioned steps, we adapted a multi-variate priors model from Bayesian Probabilistic Matrix Factorization (BPMF) [20] to recover dropouts for scRNA-seq datasets. Since every cell group is mathematically equivalent, we arbitrarily choose one to demonstrate local imputation in Bfimpute. Suppose we have N genes and M cells in one cell group, and the expression matrix is E ∈ RN×M. Each entity Eij represents the expression level of gene i in cell j. Bfimpute factorizes E into G ∈ RD×N and C ∈ RD×M which are defined as gene and cell latent matrix, respectively, where D is the dimension of the latent factor. Column vector Gi and Cj represent the gene-specific and cell-specific latent vector, respectively. The imputed matrix to recover E will be given as Ê = GTC. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 3 We introduce the Gaussian noise model for the gene expression profile E with precision α, which was firstly proposed by Probabilistic Matrix Factorization (PMF) [19]: p(E|G,C,α) = N∏ i=1 M∏ j=1 [ N(Eij|Gi T Cj,α −1 ) ]Iij (1) where Iij is the indicator function that is 0 if the Eij is a dropout and equal to 1 otherwise. To get use of gene or cell related information such as bulk data or other data user provided, we add entity features SG ∈ RFG×N and SC ∈ RFC×M as gene and cell feature matrix, respectively, where FG and FC are the dimentionalities of these additional features. The Gaussian model for the prior distributions over genes and cells latent vectors adapted from Macau [18] will be given by: p(Gi|SGi ,µG, ΛG,βG) = N(Gi|µG + βG TSGi , Λ −1 G ) p(Cj|SCj ,µC, ΛC,βC) = N(Cj|µC + βC TSCj , Λ −1 C ) (2) where {µG,µC} and {ΛG, ΛC} are the means and precisions, and βG ∈ RFG×D and βC ∈ RFC×D are the weight matrices for the entity features. Weight initialization by a zero mean normal distribution is used and they will be updated iteratively by the Bayesian inference steps (details described later). Also, direct imputation of single cell RNA-seq data could be applied by initiating zeros into feature vectors SG and SC(where FG = FC = 1) if no additional information is given. To perform Bayesian inference, we introduce the priors referring to BPMF [20] for {µG, ΛG} and {µC, ΛC}. p(µG, ΛG|µ0,β0,ν0,W0) = N(µG|µ0, (β0ΛG) −1 ) ×W(ΛG|W0,ν0) p(µC, ΛC|µ0,β0,ν0,W0) = N(µC|µ0, (β0ΛC) −1 ) ×W(ΛC|W0,ν0) (3) where W is the Wishart Distribution with ν0 as the degrees of freedom and W0 as the scale matrix. We also set a zero mean normal distribution as βG and βC’s priors and a gamma distribution as the problem dependent αG and αC’s hyperpriors adapted from Macau [18]: p(βG|ΛG,αG) = N(vec(βG)|0, ΛG−1 ⊗ (αGI)−1) p(βC|ΛC,αC) = N(vec(βC)|0, ΛC−1 ⊗ (αCI)−1) (4) p(αG|k,θ) = G(αG|k/2, 2θ/k) p(αC|k,θ) = G(αC|k/2, 2θ/k) (5) where vec(βX) is the vectorization of βX, ⊗ represents the Kronecker product and αX is the precision (X ∈ {G,C}). k/2 and 2θ/k are shape and scale, respectively. k and θ are hyperparameters which are set to 1. Gibbs sampler to impute dropout events We use Markov Chain Monte Carlo (MCMC) algorithm to train Bfimpute, which is a sampling based approach to tackle the Bayesian inference problem. Bfimpute constructs a Markov Chain from a random initial value and after running the chain for K̃ steps, it will eventually converge to its stationary distribution. Bfimpute then uses the average of (K − K̃) stationary stages to approximate the real distribution of E and gain the estimated values Êij for dropouts: p(Êij|E,G,C) ≈ 1 K − K̃ K∑ k=K̃+1 p(Êij|Gi (k) ,Ci (k) ,α) (6) More specifically, Bfimpute chooses Gibbs sampler to achieve Bayesian matrix factorization. In every cycle, we sample the conditional distribution from the posterior distribution in Bayes’ theorem. Since the probabilistic models of genes and cells are symmetric, the conditional distributions over genes and the conditional distribution over cells have the same form. In particular, based on (1) and (2), the conditional probability for Gi is: p(Gi|E,C,α,S G i ,µG, ΛG,βG) = N(Gi|µ (G)′ i , Λ (G)′ i ) (7) ∝ M∏ j=1 [ N(Eij|Gi T Cj,α −1 ) ]Iij ×p(Gi|S G i ,µG, ΛG,βG) where  Λ (G)′ i = ΛG + α ∑ j ( SjSj T )Iij µ (G)′ i = ( [Λ (G)′ i ] −1 )[ ΛG ( µG + βG Tx (G) i ) + α ∑ j (EijCj) Iij ] According to (2) and (3), we can derive the conditional probability for µG and ΛG: p(µG, ΛG|G,S G ,βG,αG,µ0,β0,ν0,W0) = N(µG|µ0 ′ , ( β0 ′ ΛG )−1 )W(ΛG|W0 ′ ,ν0 ′ ) (8) ∝ p(Gi|S G i ,µG, ΛG,βG) ×p(µG, ΛG|µ0,β0,ν0,W0) where  µ0 ′ = β0µ0+NḠ β0+N β0 ′ = β0 + N ν0 ′ = ν0 + N + FG W0 ′ = [W0 −1 + NH̄ + β0µ0µ0 T −β0′µ0′µ0′ T + αGβG TβG] −1 Ḡ = 1 N ∑ N i=1 ( Gi −βGTSGi ) H̄ = 1 N ∑ N i=1 ( Gi −βGTSGi )( Gi −βGTSGi )T Considering (4) and (5), we get the conditional probability for αG: p(αG|βG, ΛG,k,θ) = G(αG|k ′ /2, 2θ ′ /k ′ ) (9) ∝ p(βG|ΛG,αG) ×p(αG|k,θ) (10) where { k′ = (FGD+θ)k θ+θ·tr(βGTβGΛG) θ′ = FGD + θ From (2) and (4), we are able to know the conditional probability for βG: p(βG|ΛG,αG,G,S G ,µG) = N(µβG, ΛβG ) (11) ∝ p(βG|ΛG,αG) × ∏ i p(Gi|S G i ,µG, ΛG,βG) Because the size of the precision matrix ΛβG is too large to compute, we consider to do this part in an alternative way which is firstly proposed by Macau [18] by calculating: β̃G = ( S GT S G + αGI )−1 ( S GT ( G̃ + E1 ) + √ αGE2 ) (12) where G̃ = (G−µG)T , and each row of E1 ∈ RN×D and E2 ∈ RFG×D is sampled from N(0, ΛG−1). .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 Wen et al. Algorithm 1 Gibbs sampling in Bfimpute 1. Initialize {G0,C0,βG(0),βC(0),αG(0),αC(0)} 2. For k = 1, 2, . . . ,K a. Sample the means {µG,µC} and precisions {ΛG, ΛG} of gene and cell latent matrices: µG (k) , ΛG (k) ∼ p(µG, ΛG|G (k−1) ,S G ,βG (k−1) ,αG (k−1) ) µC (k) , ΛC (k) ∼ p(µC, ΛC|C (k−1) ,S C ,βC (k−1) ,αC (k−1) ) b. Sample gene and cell latent matrices {G,C}: • For each i = 1, . . . ,N sample gene latent vectors in parallel: Gi (k) ∼ p(Gi|E,C (k−1) ,S G i ,µG (k) , ΛG (k) ,βG (k−1) ) • For each i = 1, . . . ,M sample cell latent vectors in parallel: Ci (k) ∼ p(Ci|E,G (k) ,S G i ,µG (k) , ΛG (k) ,βG (k−1) ) c. Sample the precisions {αG,αC} of weight matrices: αG (k) ∼ p(αG|βG (k−1) , ΛG (k) ) αC (k) ∼ p(αC|βC (k−1) , ΛC (k) ) d. Sample weight matrices {βG,βC}: βG (k) = ( S GT S G + αG (k) I )−1 ( S GT ( G̃ (k) + E1 ) + √ αG(k)E2 ) βC (k) = ( S CT S C + αC (k) I )−1 ( S CT ( C̃ (k) + E1 ) + √ αC(k)E2 ) The Gibbs sampling steps of Bfimpute are shown in Algorithm 1: Generation of simulated data We first simulated a single cell RNA-seq count matrix with 20000 genes and 500 cells evenly split into 5 groups using the scater(v1.16.2) [24] package and Splatter(v1.12.0) [25] package. The parameter which controls the probability that a gene will be selected as DE was set to 0.08 while the location and scale factor were set to 0.3 and 0.5, respectively. We used ’experiment’ to add the global dropout for every cell. In order to show the universal applicability of Bfimpute, we further generated 6, 7, 8 groups of cells with 600, 700, 800 as total cell numbers and 10 runs for each data with different seeds using the same parameters mentioned above. Quality Control for real datasets We did quality control (QC) (https://github.com/gongx030/ scDatasets) for all real datasets to ensure fairness for all methods before imputation except for PBMCs dataset (see details in Github). As the PBMCs dataset is based on 10X Genomics platform with an extremely high dropout rate, the QC step for PBMCs datasets could remove and lose nearly 80% genes. Evaluation metrics of clustering results We used four evaluation methods: adjusted Rand index [26], Jaccard index [27], normalized mutual information (nmi) [28], and purity score, to analyse the agreement between true cluster labels and the spectral clustering [22] results on the first two Principle Components (PCs) of imputed matrix. Most of these four measurements vary from 0 to 1, with 1 indicating perfect match between them, except the adjusted Rand index which could yield negative values when agreement is less than expected by chance. The adjusted Rand index is an adjusted version of Rand’s statistic [29] which is the probability that a randomly selected pair is classified in agreement. The Jaccard index is similar to Rand Index, but disregards the pairs of elements that are in different clusters for both clusterings [30]. The normalized mutual information combines multiple clusterings into a single one without accessing the original features or algorithms that determine these clusterings. The purity score shows the rate of the total number of cells that are classified correctly. Results We demonstrated the performance of Bfimpute in gene expression recovering, data visualization, cell subpopulation clustering, pseudotime and DE analysis on five publicly available scRNA-seq datasets (Supplementary Table 1), and we compared Bfimpute with six state-of-the-art imputation methods: scImpute, SAVER, VIPER, DrImpute, MAGIC, and SCRABBLE in the following sections. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://github.com/gongx030/scDatasets https://github.com/gongx030/scDatasets https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 5 a Raw Bfimpute scImpute SAVER VIPER DrImpute MAGIC Jaccard index nmi purity Adjusted Rand index Jaccard index nmi purity Adjusted Rand index Jaccard index nmi purity Adjusted Rand index Jaccard index nmi purity Adjusted Rand index 0.00 0.25 0.50 0.75 1.00 V a lu e b k = 5 k = 6 k = 7 k = 8 −10 0 10 20 −10 10 −8 −4 0 4 −4 −10 0 10 20 −20 −10 100 −20 −10 0 10 20 −10 100 Group1 Group2 Group3 Group4 Group5 −5 0 5 −5 0 5 −4 0 4 8 −5 100 5 −10 0 10 20 −10 100 −20 −10 0 10 −20 −10 20100 Group1 Group2 Group3 Group4 Group5 a VIPER DrImpute MAGICSAVER Raw Bfimpute scImputeComplete TSNE 1 TSNE 1 TSNE 1 TSNE 1 T S N E 2 T S N E 2 0 40 Fig. 2. Bfimpute recovers dropout values and improves cell type identification in the simulated data. a. The scatter plots show the first two dimensions of the t-SNE results calculated from the complete data, the raw data, and the imputed data by Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC. b. k represents the number of cell clusters in simuated data. The adjusted Rand index, Jaccard index, nmi, and purity scores of clustering results are based on the raw and imputed data. Bfimpute improves both visualization and cell type identification PCA and t-distributed stochastic neighbor embedding (t-SNE) [31, 24] are two popular dimensionality reduction techniques often used to visualize high-dimensional scRNA-seq datasets. Since the dropout values were unknown in real datasets, we first tested accuracy of all different imputation methods using a simulated dataset where the ground truth was known. We applied the Splatter method to generate simulated datasets, which simulated many features observed in the scRNA-seq data, including zero-inflation, gene-wise dispersion, and differing sequencing depths between cells. To test the strength and robustness of different imputation methods, we simulated a wide range of datasets to include 5, 6, 7 and 8 different cell types (Methods section). Bfimpute achieved the most compact and well separated clusters on the simulation, followed by scImpute and DrImpute (Figure 2). For all different cell types simulations, we also evaluated the clustering performances by the evaluation metrics, where Bfimpute achieved the best scores for adjusted Rand index, Jaccard index, normalized mutual information and purity score compared to the raw data and other five imputation methods (Methods section). We further used two real datasets for this analysis and the first two principal components (PCs) from PCA were plotted to compare every dataset across seven different conditions: raw dataset, and six imputed ones through the Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC methods. We first applied all imputation methods to a real scRNA-seq dataset from a human embryonic stem (ES) cell differentiation study [2] to demonstrate the capacity of Bfimpute for improving the performance of data visualization. The dataset contains 1018 single cells from seven cell groups: Neuronal progenitor cells (NPCs), definitive endoderm cell (DEC), endothelial cells (ECs) and trophoblast-like cells (TBs) are progenitors differentiated from H1 human ES cells. H9 human ES cells and human foreskin fibroblasts (HFFs) were used as controls cells. The raw dataset (i.e. without imputation) clearly identified the cluster of HFF cells, however five other cell types were clustered very closely. After imputation by Bfimpute, the homogeneous subpopulations of H1 and H9 human ES cells were observed to substantially overlap and well separated from the rest of the progenitors. The DECs, ECs, HFFs, NPCs and TBs were also compactly clustered and well separated on the PCA plot (Figure 3a). Compared with the raw dataset, SAVER, VIPER and DrImpute had no significant improvement for cell groups identification. scImpute was the second best and generated similar compact cell groups as Bfimpute. We then compared clustering results of the spectral clustering algorithms [22] on the first two PCs to demonstrate the capability of Bfimpute to improve clustering accuracy in cell type identifications. For .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 Wen et al. −40 0 40 −50 0 50 100 PCA 1 (28%) P C A 2 ( 1 0 % ) Raw −50 −25 0 25 PCA 1 (35%) P C A 2 ( 1 7 % ) Bfimpute −20 0 20 40 60 PCA 1 (41%) P C A 2 ( 1 5 % ) DEC EC H1 H9 HFF NPC TB scImpute −30 0 30 60 PCA 1 (30%) P C A 2 ( 9 % ) DEC EC H1 H9 HFF NPC TB −25 0 25 PCA 1 (32%) P C A 2 ( 9 % ) VIPER −30 0 30 60 PCA 1 (35%) P C A 2 ( 1 2 % ) DrImpute −60 −40 −20 0 20 PCA 1 (58%) P C A 2 ( 1 6 % ) DEC EC H1 H9 HFF NPC TB 0.00 0.25 0.50 0.75 1.00 V a lu e Cell types -100 -50 0 50 -50 0 50 100 -25 0 25 50 75 -50 0 50 100 -40 0 40 80 -100 -50 0 50 SAVER MAGIC Jaccard index nmi purity Adjusted Rand index Raw Bfimpute ScImput SAVER VIPER DrImpute MAGIC same different Raw Bfimpute scImpute SAVER VIPER DrImpute MAGIC 1.0 0.9 0.8 0.7 0.6 0.5 P e a rs o n C o rr e la ti o n a b c Fig. 3. Bfimpute improves PCA visualization and cell type identification. a. The first two PCs calculated from the raw data, and the imputed data by Bfimpute, scImpute, VIPER, DrImpute, MAGIC, and SAVER. b. The adjusted Rand index, Jaccard index, nmi, and purity scores of clustering results based on the raw and imputed data. c. Average Pearson correlations between any two cells from same type and different type. the true labels, we had seven cell types for this dataset, and we evaluated the clustering results by four different metrics: adjusted Rand index, Jaccard index, normalized mutual information (nmi), and purity (Methods section). All four metrics suggested Bfimpute achieved the best clustering accuracy compared with raw and other five imputation methods (Figure 3b). We also showed the comparison of visualization performance through t-SNE. t-SNE on the raw dataset can better identify the seven cell types comparing to PCA. Bfimpute, DrImpute and SAVER can further separate different cell groups and improve the visualization, however the other four imputation methods demonstrated worse t-SNE results than raw data (Supplementary Figure 1). To illustrate the recovering of dropouts in individual cells by imputation, we calculated the Pearson correlation from log10-transformed read counts between every pair of cells in the same type and from different cell types. This result indicated imputation did recover the zero counts in every cell and the Pearson correlation increased from 0.70 to 0.87 for Bfimpute, 0.85 for scImpute, 0.72 for SAVER, 0.73 for VIPER, 0.78 for DrImpute, and 0.97 for MAGIC (Figure 3c, blue bars). One scatter plot of correlations between two randomly selected stem cells of the same cell type was demonstrated in Supplementary Figure 2. As we expected, imputation methods usually increased the Pearson correlation between any two cells in the same cell type. Imputation should not increase the correlation between cells in different cell types by disregarding the biological variation between them. Among all imputation methods, MAGIC achieved the highest correlation in the same cell type, but the correlation between different cell types was also the highest (Figure 3c, red bars). Bfimpute demonstrated the best balance, by maximizing the difference between correlation for the same over different cell types. We further investigated Bfimpute’s performance of visuali- zation and cell type identification on another zebrafish [3] scRNA-seq dataset. This dataset contains 246 single cells from six cell groups, and Hematopoietic stem and progenitor cells (HSPCs) and HSPCs/thrombocytes among them come from one defined cell type with expected heterogeneity. After the QC step, the zebrafish dataset was still sparse with zeros .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 7 composing over 87.5% of the total counts. The comparison of visualization performance via PCA on the raw and six imputed datasets is shown in Supplementary Figure 3. The raw dataset only roughly identified the cluster for neutrophil cells, whereas cells from other cell types were mixed and spread wildly. After imputation by Bfimpute, four distinct immune cell subpopulations can be identified for neutrophils, T, Natural Killer (NK) and B cells, where the cluster members were much more compact compared to those of the raw dataset. Neutrophils, T, NK and B cells were distantly positioned on the PCA plot. HSPCs and HSPCs/thrombocytes were from one defined cell type with expected heterogeneity, so after Bfimpute’s imputation, they were still spatially closer than other cells (Supplementary Figure 3a). The raw data and the imputed data by other five imputation methods did not correctly identify the four immune cell subpopulations. Clustering accuracy results from the four metrics for Bfimpute were better than the other five imputation methods, and Bfimpute achieved a better correlation for the same cell type without loosing variation between different cells types (Supplementary Figure 3b,c). Bfimpute improves DE and pseudotime analysis DE analysis is widely used in bulk RNA-seq data. Performing DE analysis for scRNA-seq data to reveal the stochastic nature of gene expression in single cells is challenging since scRNA- seq data suffers from high dropout events. However, it has been proven that good imputation methods could lead to a better agreement between scRNA-seq and bulk RNA-seq data of the same biological condition on genes known to have little cell-to-cell heterogeneity. We utilized a real dataset by Chu et al [2] with both bulk and scRNA-seq data available on human embryonic stem cells and definitive endoderm cells (DEC) [32, 33], to compare Bfimpute with the raw dataset and other five imputation methods for DE analysis. This dataset contained six samples of bulk RNA-seq (four in H1 ES cells and two in DEC) and 350 samples of scRNA-seq (212 in H1 ES cells and 138 in DEC). The percentages of zero entries were 8.8% in bulk data and 44.9% in scRNA-seq data, respectively. We first performed DE analysis in the bulk data and identified the top 200 DE genes by DESeq2 [10]. We then plotted these 200 genes’ expression profiles in scRNA-seq data for seven conditions: raw dataset, Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC. We found these top 200 genes’ expression profiles after Bfimpute’s imputation demonstrated better concordance with those in bulk data (Figure 4a). To further evaluate whether imputation improves DE analysis in scRNA-seq data, we first used DESeq2 to identify DE genes for raw scRNA-seq dataset and scRNA-seq datasets after six different imputations. We then generated different lists of DE genes for the bulk data by applying different thresholds for false discovery rates of genes. Finally for every threshold, we compared the DE genes for the bulk data and scRNA-seq data of those seven different conditions and calculated the AUC values for each condition. The AUC values suggested all imputation methods improved DE analysis. Bfimpute generated DE genes most consistent with the bulk data (AUC values raw: 0.568, Bfimpute: 0.670, scImpute: 0.665, SAVER: 0.624, VIPER: 0.639, DrImpute: 0.657 and MAGIC: 0.668). Bulk data for the same biological condition was provided and could be used as a gold standard to compare the average gene expression level with the scRNA-seq data, even though the scRNA-seq data presented more cell-to-cell variation. We expected that average gene expression level in the scRNA- seq data was highly correlated with bulk RNA-seq data. To investigate this, we plotted correlations between gene expression in single-cell and bulk data and found that all imputation methods did improve the correlation between bulk and scRNA-seq data, and Bfimpute, MAGIC and scImpute had the best improvement (Supplementary Figure 4). We further selected several genes (e.g., ANGPT1,GDF3, BMP4, EPB41L5) of DECs from different time points to plot their average gene expression levels in both bulk and scRNA-seq data. These genes were annotated with the GO term “endoderm development”, and they were likely to be affected by dropout events [13, 34]. Imputed read counts for these genes by Bfimpute showed higher gene expression correlation and better consistency with the bulk data (Figure 4b and Supplementary Figure 5, 6). In addition to the DE analysis, we also used the time course scRNA-seq data [2] from the same Chu et al study to show Bfimpute improved gene expressions temporal dynamics through pseudotime analysis. In this dataset, a total of 758 single cells were captured and profiled by scRNA-seq at 0, 12, 24, 36, 72, and 96 h of differentiation. We first applied Bfimpute, scImpute and Drimpute to the raw scRNA-seq data with true cell type labels, and then study how the time-course expression patterns change in the imputed data. The PCA results showed that imputed read counts by Bfimpute better distinguished cells of different time points and the six time points cell groups were compact (Supplementary Figure 7a), and the first principle component from PCA indicated that imputed read counts from Bfimpute reflected more accurate transcriptome dynamics along the different time course (Figure 4d). Bfimpute could better differentiate the last two time points (72h and 96h). In the next section, we will discuss impuation with the aid of cell type labels more in details. Bfimpute improves performance with the aid of additional experimental information Imputation methods including Bfimpute, scImpute and DrImpute all first identified similar cells based on clustering, and imputation was then performed by leveraging the expression values from similar cells. Being able to first identify the appropriate cell groups enhanced the ability of imputing the dropout events. A substantial number of scRNA-seq studies have identified cell types from experimental design or marker genes. We applied Bfimpute, scImpute and DrImpute to the raw scRNA-seq data with true cell type labels in three real datasets we have used before, and two more new real datasets. In this study, SAVER, VIPER, DrImpute, and MAGIC were excluded since they were not applicable to use cell labels. We then investigated again the PCA and t-SNE visualizations for cell subpopulations identification. Our results showed Bfimpute outperformed the other two methods and clearly differentiated almost every cell group in different datasets (Figure 5 and Supplementary Figure 7, 8). For the human embryonic stem cell dataset, Bfimpute further correctly identified three outlier cells into correct groups compared to the previous imputation without cell labels (see Figure 5a versus Figure 3a: one EC (orange point), one DEC (blue point), and one NPC (yellow point) cell were brought back to the corresponding EC, DEC and NPC cell groups, respectively). H9 cells were also further apart from H1 cells in the vertical dimension. For the zebrafish dataset, even the most mixed B, NK, T cells (blue, green, and yellow colors) from the raw dataset were separated from each other after Bfimpute’s imputation, and HSPCs and .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 Wen et al. Cell type DEC H1 0 1 2 3 4 5 00h 12h 24h 36h 72h 96h T im e p o in t Bfimpute scImpute 00h 12h 24h 36h 72h 96h DrImputeRaw First principal component First principal component First principal component First principal component -60 -30 0 30 -30 0 30 -50 -25 0 25 50 -60 -30 0 30 a c Bulk Raw Bfimpute scImpute SAVER VIPER MAGICDrImpute Cell type DEC H1 0 1 2 3 4 5 L o g 1 0 (E x p re s s io n + 1 ) ANGPT1 Bfimpute scImpute SAVERRaw 00h 12h 24h 36h 72h 96h 00h 12h 24h 36h 72h 96h 00h 12h 24h 36h 72h 96h 00h 12h 24h 36h 72h 96h 2 1 0 b Fig. 4. Bfimpute improves DE and pseudotime analysis. a. The expression profiles of the top 200 DE genes detected in the bulk data by DESeq2 for seven conditions: raw dataset, Bfimpute, scImpute, SAVER, VIPER, DrImpute, and MAGIC. b. Time-course expression patterns of the example gene ANGPT1 that is annotated with GO term “endoderm development”. The small black triangles marks the average bulk data for each time point. c. The first principal component is plotted to show cells of different time points along the differentiation. HSPCs/thrombocytes cells were spatially close, but split into two cell groups (Figure 5b and supplementary Figure 7b). To test Bfimpute with another kind of cell-label information, we used a human preimplantation embryonic development dataset (t-SNE and pseudotime analyses are shown in Figure 5c). The Petropoulos dataset [4] included single cells from five stages of human preimplantation embryonic development, ranging from developmental day (E) 3 to 7. The five different stages were clearly distinguished from each other after Bfimpute’s imputation. We also applied three imputation methods to a large 10X dataset generated by the high-throughput droplet-based system. To generate this dataset, we randomly selected 500 cells from nine immune cell types, so it contained a total of 4500 peripheral blood mononuclear cells (PBMCs) [12, 5]. In the raw data, 98.3% read counts are exactly zeros. Our PCA and t-SNE results indicated that Bfimpute’s imputation identified nine immune cell types from raw data (5d). In summary, these results suggested that Bfimpute with the aid of labels always further improved visualization and identification of cell subpopulations, and the downstream analysis. SCRABBLE is another recent approach integrating bulk data to impute dropout events in scRNA-seq data. Since Bfimpute can easily adopt bulk data as additional information into the gene latent matrix, we have also tested if bulk data can further improve performance. In the scRNA-seq dataset of .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 9 −40 0 40 PCA 1 (28%) P C A 2 ( 1 0 % ) −25 0 25 50 PCA 1 (35%) P C A 2 ( 1 7 % ) Bfimpute −60 −40 −20 0 20 40 PCA 1 (41%) P C A 2 ( 1 5 % ) scImpute −60 −30 0 30 PCA 1 (35%) P C A 2 ( 1 2 % ) DEC EC H1 H9 HFF NPC TB DrImpute −4 0 4 TSNE 1 T S N E 2 −10 −5 0 5 TSNE 1 T S N E 2 Bfimpute −8 −4 0 4 TSNE 1 T S N E 2 scImpute −3 0 3 TSNE 1 T S N E 2 B cells HSPCs HSPCs/thrombocytes neutrophils NK cells T cells DrImpute −20 −10 0 10 20 TSNE 1 T S N E 2 −30 −20 −10 0 10 20 TSNE 1 T S N E 2 Bfimpute −20 −10 0 10 20 TSNE 1 T S N E 2 scImpute −30 −20 −10 0 10 20 TSNE 1 T S N E 2 E3 E4 E5 E6 E7 DrImpute Raw Raw Raw -50 0 50 100 -50 0 50 100 -50 0 50 100 -50 0 50 100 -5 0 5 -10 -5 0 5 10 -10 -5 0 5 10 -20 0 20 40 -20 -10 0 10 20 30 -20 0 20 40 -20 0 20 a b c d -10 -5 0 5 10 Zebrafish Human embryonic stem cell differentiation Human preimplantation embryonic developement −20 −10 0 10 20 T S N E 2 30 -20 0 20 T S N E 2 40 -20 0 20 -40 -20 0 20 -40 -40 -20 0 20 -20 0 20 40 -20 0 20 40 -20 0 20 B Cytotoxic T Helper T Memory T Monocyte Naive cytotoxic T Naive T Natural killer Regulatory TT S N E 2 T S N E 2 TSNE 1 TSNE 1 TSNE 1 TSNE 1 Bfimpute scImpute DrImputeRaw Peripheral blood mononuclear cells (PBMCs) Fig. 5. Bfimpute with labels improves PCA and t-SNE visualizations and cell type identification. a. The first two PCs calculated from the raw data, and the imputed data by Bfimpute, scImpute, and DrImpute for the human embryonic stem cell differentiation study. b. The first two dimensions from the raw data, and the imputed data by Bfimpute, scImpute, and DrImpute for the zebrafish data. c. The first two dimensions from the raw data, and the imputed data by Bfimpute, scImpute, and DrImpute for the human preimplantation embryonic development. d. The first principal component is plotted to show cells of different time points along the embryonic development. human embryonic stem cells with bulk data, we did not observe significant differences between Bfimpute and Bfimpute with bulk data as additional information (Supplementary Figure 8 versus Figure 5a). The reason could be that similar gene level information has less effect than similar cell level information for the imputation of dropout events. We also found that in these scRNA-seq datasets, SCRABLE’s performance after integrating cell labels information with bulk data, was not better than Bfimpute (Supplementary Figure 8). Discussion and Conclusion ScRNA-seq has become an indispensable tool in recent years, as it has made it possible to study genome-wide transcriptomes in single cell resolution. Due to sequencing technical issues, a large proportion of dropout events exist in scRNA-seq data, which limit its usefulness. Several approaches have been proposed to solve this problem, with modest results. In this study, we introduced Bfimpute to recover dropout events in scRNA-seq data. We have shown that Bfimpute can improve performance in recovering gene expression detected by bulk RNA-seq, as well .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 Wen et al. as in downstream analyses, including identification of cell sub- populations, differential expressed genes and gene expressions temporal dynamics. Bfimpute uses a fully Bayesian probabilistic matrix factorization by substituting hyperparameters with hyperpriors and performing Gibbs sampling for the approximate inference. The advantage of this Bayesian model is that it provides a predictive distribution instead of just a single number during recovering each dropout event, and the confidence in the prediction can be quantified and considered into the model. The use of a full Bayesian model proved to be a considerable advantage for Bfimpute to outperform other imputation methods. Bfimpute imputes two latent cell and gene matrices for each cell group through a Gibbs sampling process, and reaches a stationary state to generate the final cell-gene expression matrix, in which the dropout events will be recovered. Another advantage of Bfimpute is being able to integrate any gene or cell related information of scRNA-seq data into these two latent gene and cell matrices to impute missing values. Information from both similar cells or/and bulk data can be easily integrated into our model. Even though scImpute and DrImpute have a similar functionality in this respect, that allows them to impute dropout events with the aid of number of cell types or cell labels, they fail to achieve as good performance as Bfimpute for most of scRNA-seq data that we tested. Any resource provided by the users from the cell level and gene level could be used as additional information to improve dropout events imputation in scRNA-seq data in the future. Key Points • Imputation to recover dropout events for scRNA- seq data is important for determining genome-wide transcriptomes in single cell resolution. • Bfimpute uses a fully Bayesian probabilistic matrix factorization by substituting hyperparameters with hyperpriors and performing Gibbs sampling for approximate inference. • The advantage of this Bayesian model is that it provides a predictive distribution instead of just a single number during recovering each dropout event, and the confidence in the prediction can be quantified and considered into the model. • Bfimpute is able to integrate any gene or cell related information of scRNA-seq data into these two latent gene and cell matrices to impute missing values. • Bfimpute achieves better accuracy than other six widely used scRNA-seq imputation methods on simulated and real scRNA-seq data, as measured by several different evaluation metrics. Competing interests There is NO Competing Interest. Author contributions X.Z. conceived and led this work. Z.H.W. and X.Z. designed the model and implemented the Bfimpute software. Z.H.W., J.L.L, W.S and X.Z led the data analysis. Z.H.W, W.S and X.Z wrote the paper with feedback from J.L.L and L.Z. Funding This work was supported by Vanderbilt University Development Funds (FF 300033). L.Z. is partially supported by Research Grant Council Early Career Scheme (HKBU 22201419). References 1. Fuchou Tang, Catalin Barbacioru, Yangzhou Wang, Ellen Nordman, Clarence Lee, Nanlan Xu, Xiaohui Wang, John Bodeau, Brian B Tuch, Asim Siddiqui, et al. mrna- seq whole-transcriptome analysis of a single cell. Nature methods, 6(5):377–382, 2009. 2. Li-Fang Chu, Ning Leng, Jue Zhang, Zhonggang Hou, Daniel Mamott, David T Vereide, Jeea Choi, Christina Kendziorski, Ron Stewart, and James A Thomson. Single- cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology, 17(1):173, 2016. 3. Qin Tang, Sowmya Iyer, Riadh Lobbardi, John C Moore, Huidong Chen, Caleb Lareau, Christine Hebert, McKenzie L Shaw, Cyril Neftel, Mario L Suva, et al. Dissecting hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution using rna sequencing. Journal of Experimental Medicine, 214(10):2875–2887, 2017. 4. Sophie Petropoulos, Daniel Edsgärd, Björn Reinius, Qiaolin Deng, Sarita Pauliina Panula, Simone Codeluppi, Alvaro Plaza Reyes, Sten Linnarsson, Rickard Sandberg, and Fredrik Lanner. Single-cell rna-seq reveals lineage and x chromosome dynamics in human preimplantation embryos. Cell, 165(4):1012–1026, 2016. 5. Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):1–12, 2017. 6. Peng Qiu. Embracing the dropouts in single-cell rna-seq analysis. Nature communications, 11(1):1–9, 2020. 7. Peter V Kharchenko, Lev Silberstein, and David T Scadden. Bayesian approach to single-cell differential expression analysis. Nature methods, 11(7):740–742, 2014. 8. Ingrid Lönnstedt and Terry Speed. Replicated microarray data. Statistica sinica, pages 31–46, 2002. 9. Simon Anders and Wolfgang Huber. Differential expression analysis for sequence count data. Nature Precedings, pages 1–1, 2010. 10. Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):550, 2014. 11. Oliver Stegle, Sarah A Teichmann, and John C Marioni. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16(3):133–145, 2015. 12. Wei Vivian Li and Jingyi Jessica Li. An accurate and robust imputation method scimpute for single-cell rna-seq data. Nature communications, 9(1):1–9, 2018. 13. Wuming Gong, Il-Youp Kwak, Pruthvi Pota, Naoko Koyano-Nakagawa, and Daniel J Garry. Drimpute: imputing dropout events in single cell rna sequencing data. BMC bioinformatics, 19(1):1–10, 2018. 14. David Van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J Carr, Cassandra Burdziak, .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Bfimpute 11 Kevin R Moon, Christine L Chaffer, Diwakar Pattabiraman, et al. Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729, 2018. 15. Mo Huang, Jingshu Wang, Eduardo Torre, Hannah Dueck, Sydney Shaffer, Roberto Bonasio, John I Murray, Arjun Raj, Mingyao Li, and Nancy R Zhang. Saver: gene expression recovery for single-cell rna sequencing. Nature methods, 15(7):539–542, 2018. 16. Mengjie Chen and Xiang Zhou. Viper: variability- preserving imputation for accurate gene expression recovery in single-cell rna sequencing studies. Genome biology, 19(1):1–15, 2018. 17. Tao Peng, Qin Zhu, Penghang Yin, and Kai Tan. Scrabble: single-cell rna-seq imputation constrained by bulk rna-seq data. Genome biology, 20(1):88, 2019. 18. Jaak Simm, Adam Arany, Pooya Zakeri, T Haber, Jörg K Wegner, V Chupakhin, Hugo Ceulemans, and Yves Moreau. Macau: Scalable bayesian factorization with high- dimensional side information using mcmc. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2017. 19. Andriy Mnih and Russ R Salakhutdinov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008. 20. Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of the 25th international conference on Machine learning, pages 880–887, 2008. 21. Robrecht Cannoodt, Wouter Saelens, and Yvan Saeys. Computational methods for trajectory inference from single-cell transcriptomics. European journal of immunology, 46(11):2496–2506, 2016. 22. Christopher R John, David Watson, Michael R Barnes, Costantino Pitzalis, and Myles J Lewis. Spectrum: Fast density-aware spectral clustering for single and multi-omic data. Bioinformatics, 36(4):1159–1166, 2020. 23. Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002. 24. Davis J McCarthy, Kieran R Campbell, Aaron TL Lun, and Quin F Wills. Scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r. Bioinformatics, 33(8):1179–1186, 2017. 25. Luke Zappia, Belinda Phipson, and Alicia Oshlack. Splatter: simulation of single-cell rna sequencing data. Genome biology, 18(1):1–15, 2017. 26. Leslie C Morey and Alan Agresti. The measurement of classification agreement: An adjustment to the rand statistic for chance agreement. Educational and Psychological Measurement, 44(1):33–37, 1984. 27. Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912. 28. Alexander Strehl and Joydeep Ghosh. Cluster ensembles— a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617, 2002. 29. William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971. 30. Silke Wagner and Dorothea Wagner. Comparing clusterings: an overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe, 2007. 31. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. 32. Pei Wang, Ryan T Rodriguez, Jing Wang, Amar Ghodasara, and Seung K Kim. Targeting sox17 in human embryonic stem cells creates unique strategies for isolating and analyzing developing endoderm. Cell stem cell, 8(3):335–346, 2011. 33. Pei Wang, Kristen D McKnight, David J Wong, Ryan T Rodriguez, Takuya Sugiyama, Xueying Gu, Amar Ghodasara, Kun Qu, Howard Y Chang, and Seung K Kim. A molecular signature for purified definitive endoderm guides differentiation and isolation of endoderm from mouse and human embryonic stem cells. Stem cells and development, 21(12):2273–2287, 2012. 34. Judith A Blake, Janan T Eppig, James A Kadin, Joel E Richardson, Cynthia L Smith, Carol J Bult, and Mouse Genome Database Group. Mouse genome database (mgd)- 2017: community knowledge resource for the laboratory mouse. Nucleic acids research, 45(D1):D723–D729, 2017. .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.10.430649doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430649 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Methods Cell clustering and dropout detection Probabilistic model for scRNA-seq expression matrix imputation Gibbs sampler to impute dropout events Generation of simulated data Quality Control for real datasets Evaluation metrics of clustering results Results Bfimpute improves both visualization and cell type identification Bfimpute improves DE and pseudotime analysis Bfimpute improves performance with the aid of additional experimental information Discussion and Conclusion Competing interests Author contributions Funding