key: cord-1035151-ka9rum7o authors: Xiong, Lei; Tian, Kang; Li, Yuzhe; Zhang, Qiangfeng Cliff title: Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space date: 2021-10-11 journal: bioRxiv DOI: 10.1101/2021.04.06.438536 sha: 3be1dd73f04421743568dfcc9edd143ca511a8d0 doc_id: 1035151 cord_uid: ka9rum7o Computational tools for integrative analyses of diverse single-cell experiments are facing formidable new challenges including dramatic increases in data scale, sample heterogeneity, and the need to informatively cross-reference new data with foundational datasets. Here, we present SCALEX, a deep-learning method that integrates single-cell data by projecting cells into a batch-invariant, common cell-embedding space in a truly online manner (i.e., without retraining the model). SCALEX substantially outperforms online iNMF and other state-of-the-art non-online integration methods on benchmark single-cell datasets of diverse modalities, (e.g., scRNA-seq, scATAC-seq), especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We showcase SCALEX’s advantages by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19 patients, each assembled from diverse data sources and growing with every new data. The online data integration capacity and superior performance makes SCALEX particularly appropriate for large-scale single-cell applications to build-upon previously hard-won scientific insights. Single-cell experiments enable the decomposition of samples into their constituent, 31 diverse cell-types and cell states 1-4 . Many computational tools have been developed for 32 integrative analysis of single-cell datasets, all seeking to separate biological variations 33 from non-biological noise, such as batch effects of different donors, conditions, and/or 34 analytical platforms 5,6 . The scope of the integration task is expanding rapidly with 35 technical advances for single-cell studies, which continue to grow larger and larger in 36 scale, now exceeding 1 million cells in some cases 7, 8 . Moreover, the range of 37 examined sample types is also increasing, and datasets now often include highly 38 heterogenous cell subsets 9,10 . Most importantly, as single-cell studies become more 39 routine, new studies should be informatively cross-referenced to foundational 40 research stuides 7, 8, [11] [12] [13] [14] [15] . Thus, there is a growing need for integration tools that can 41 manage single-cell data of large-scale and complex cell-type compositions while also 42 supporting accurate alignment to and exploration within existing datasets. 43 Most current single-cell data integration methods (e.g., Seurat [16] [17] [18] MNN 19 , 44 Harmony 20 , Conos 21 , Scanorama 22 , BBKNN 23 , etc.) are based on the searching across 45 batches for cell-correspondence, for instance similar individual cells or cell 46 anchors/clusters. These methods suffer from three limitations. First, they are prone to 47 mixing cell populations that only exist in some batches, which becomes a severe 48 problem for the integration of complex datasets that contain non-overlapping cell 49 populations in each batch (i.e., partially overlapping data) 16, 17 . Second, they require 50 computational resources that increase dramatically as the number of cells and of 51 batches increase, making these methods increasingly unsuitable for today's large-scale single-cell datasets 7, 8, [11] [12] [13] [14] [15] . Finally, these methods can only remove batch effects from 53 the current dataset being assessed. Each time a new dataset is added, it requires an 54 entirely new integration process that changes the existing integration results of 55 previous studies. This requirement severely limits a tool's ability to continuously 56 integrate arriving new single-cell data without recalculating existing integrations from 57 scratch, a capacity referred to as "online" data integration 24 . 58 Online data integration ability is becoming increasingly crucial with today's 59 single-cell experiments. The recently developed tool, online iNMF 24 , an online version 60 of LIGER 25 , iteratively applies integrative non-negative matrix factorization (iNMF) to 61 decouple the shared and dataset-specific factors related to cell identities, and thus is 62 able to incorporate new data with existing datasets on-the-fly. Another recently 63 developed package, scvi-tools 26 , combining scVI 27 with scArches 28 , applies a 64 conditional variational autoencoder (VAE) 29 framework to model the inherent 65 distribution of the input single-cell data for data integration. However, the conditional 66 VAE design of scVI requires model augmentation and retraining when integrating new 67 data, meaning that scVI is not an online method. 68 Here, we developed SCALEX as a method for online integration of heterogeneous 69 single-cell data based on a VAE framework. The encoder of SCALEX is designed to be 70 a data projection function that only preserves batch-invariant biological data 71 components when projecting single-cells. Importantly, the projection function is a 72 generalized one that requires no retraining on new data, thus allowing SCALEX to 73 integrate single-cell data in an online manner. Working with an extensive collection of 74 benchmark datasets, we demonstrate that SCALEX substantially outperforms online 75 iNMF as well as non-online single-cell data integration tools, in terms of integration 76 accuracy, scalability, and computationally efficiency. The advantages make SCALEX 77 particularly appropriate for the integration and research utilization of today's 78 Overview of the SCALEX model. SCALEX applies a variational autoencoder (VAE) 636 to project the different batches of datasets into the same batch-invariant 637 low-dimensional embeddings by learning a batch-free encoder and a batch-specific 638 decoder simultaneously. Since the encoder and decoder are coupled to learn a 639 batch-free encoder, a batch label (b) is only exposed to the decoder within the 640 domain-specific batch normalization (DSBN), thus the decoder captures the batch 641 information while the encoder learns the domain-invariant features. In the encoder, 642 SCALEX takes the input expression profile ( ) across all the batches as a whole 643 mixture distribution without distinguishing their batch sources and extracts their mean 644 ( ) and variance ( ) of the latent representations (z) in a 10-dimension embedding 645 space to learn their global data structure. A standard multivariate Gaussian prior is 646 used for z, while the approximated distribution of z is re-parameterized by = + 647 * , where is sampled from ℕ(0, ). In the decoder, SCALEX maps the latent 648 representations with batch label (b) back to their original profile. To enable the 649 decoder to capture the batch-specific variations, a DSBN layer is applied to learn a 650 batch-specific normalization for each batch label (b), before transforming them back to their original profile with the new batch variations. To learn the global distribution 652 to avoid overcorrection on partially overlapping datasets, within each mini-batch in 653 the training process, SCALEX randomly samples data from all batches and trains on 654 them together with Batch Normalization to smooth the batch-specific shifts and align 655 to the global distribution. Once trained, the encoder of SCALEX is generalized to any 656 batches and serves as a universal function for globally mapping different batches of 657 datasets into the same batch-invariant space. 658 Training SCALEX is to maximize the log-likelihood of the observed single-cell 659 sequencing data : 660 Then the loss function is transformed into the evidence lower bound (ELBO). While 661 the ELBO can be further decomposed into two terms: 662 The first term is the reconstruction term, which minimizes the distance between the 664 generated output data ′ and the original input data , calculated as the binary cross 665 entropy between ′ and . The second term is the regularization term, which 666 minimizes the Kullback-Leibeler divergence between posterior distribution 667 ℕ( , ) and prior distribution ℕ(0, ) of latent representations . To enable a more 668 flexible alignment under the latent space, we adjusted the coefficient of the second 669 term to 0.5 after hyper-parameter optimization via a grid search; thus, the final loss 670 function is: 671 The overall network architecture of SCALEX consists of an encoder and a decoder. 673 The encoder is a two-layer neural network (fully connected [1024]-BN-ReLU-fully 674 connected[10]) for mean ( ) and variance ( ) of the 10-dimension latent 675 representations z using a reparameterization to obtain latent representations z, and the 676 decoder has only one layer (no hidden layer), directly connecting latent 677 representations to the output (fully connected-DSBN-Sigmoid) with 678 domain-specific batch normalization, where the latent representations z and batch ID 679 are provided as input, and a Sigmoid activation function. We used the Adam 75 680 optimizer with a 5e-4 weight decay and betas (0.9, 0.999, the exponential decay rate 681 for the first and second moment parameters) to optimize the model under the learning 682 rate 2e-4. We adopted mini-batch strategy to iteratively optimize the model, in each 683 mini-batch, we randomly sampled data from all batches instead of from the same 684 batch, and the mini-batch size for training input is 64. The maximum number of 685 training iterations is 30,000 and an early stopping is triggered when there has been no 686 improvement for 10 epochs. The hyper-parameters are chosen after a grid search. 687 SCALEX is very robust with all of these hyper-parameters, all of the results in this 688 manuscript are produced under the same parameters. 689 Preprocessing for scRNA-seq. We downloaded gene expression matrices and 709 preprocessed them using the following procedure: i). Cells with fewer than 600 genes 710 and genes present in fewer than 3 cells were filtered out. ii). Total counts of each cell 711 were normalized to 10,000. iii). Values of each gene were subjected to log 712 transformation with an offset of 1. iv). The top 2,000 highly variable genes were 713 identified. v). Values of each gene were normalized to the range of 0-1 within each 714 batch by the MaxAbsScaler function in the scikit-learn package in Python. The 715 processed matrix was used as input for the SCALEX model for downstream 716 differential gene expression analysis. 717 For the human fetal atlas dataset, we collected two batches (batch GSE156793, 718 which contains 4,062,980 cells by sciRNA-seq3, and batch GSE134355, which 719 contains 254,266 cells by Microwell-seq). We then selected the cells from the 720 common tissues (1,369,619 cells) for integration and computational efficiency 721 benchmarking (down-sampled from different data sizes including 10K, 50K, 250K, 722 1M, and 4M). 723 Preprocessing for scATAC-seq. We downloaded open chromatin profile matrices 724 (peaks or bins), merged them by peaks (or bins), and processed them using the 725 following procedure: i). The combined matrix was binarized and filter bins with fewer 726 than 3 cells. ii). The top 30,000 most variable peaks (or bins) were selected using the 727 select_var_feature function in the EpiScanpy 76 package. iii). Total counts of each cell 728 were normalized to the median of the total counts of all cells by using the 729 normalize_total function, with parameters target_sum="None" in the Scanpy 77 730 package. iv). Values of each peak (or bin) were normalized to the range of 0-1 within 731 each batch by the MaxAbsScaler function in the scikit-learn package in Python. The 732 processed matrix was used as input for the SCALEX model. 733 Preprocessing for cross-modality data (scRNA-seq and scATAC-seq). We first 734 created a gene activity matrix by the GeneActivity function in the Signac 78 R package 735 to quantify the activity of each gene from scATAC-seq data. We then combined gene 736 activity score matrix with scRNA-seq data matrix as two individual "batches" for 737 integration. The subsequent preprocessing followed the same preprocessing used for 738 the scRNA-seq data (above). Visualization. UMAP algorithm 46 was used for visualization. We applied the 740 neighbors function from the Python package Scanpy with the parameters 741 n_neighbors=30 and metric="Euclidean" for computing the neighbor graph, followed 742 by umap function with min_dist=0.1 to visualize cells in a two-dimensional space. 743 Adjusted Rand Index. The Rand Index (RI) computes a similarity score between two 744 clustering assignments by considering matched and unmatched assignment pairs, 745 independent of the number of clusters. The Adjusted Rand Index (ARI) score is 746 calculated by "adjust for chance" with RI as follows: 747 If given the contingency table, then ARI can also be represented by: 748 The ARI score is 0 for random prediction and 1 for perfectly matching. Where P and T are categorical distributions for the predicted and real clustering, I is 751 the mutual entropy, and H is the Shannon entropy. 752 Silhouette score. We used the silhouette score to assess the separation of biological 753 populations with the function silhouette_score in the scikit-learn package in Python. 754 The silhouette score was computed by combining the average intra-cluster distance (a) 755 and the average nearest-cluster (b) for each cell. 756 Here, we took UMAP embeddings as input to calculate silhouette score. Batch entropy mixing score. Batch entropy mixing score (adapted from "entropy of 758 batch mixing" 19 ) was used to access the regional mixing of cells from different 759 batches, with a high score suggesting that cells from different batches are well mixed 760 together. 761 The batch entropy mixing score was computed as follows: 762 (1) Calculated the proportion Pi of cell numbers in each batch to the total cell 763 numbers. 764 (2) Randomly chose 30 cells from all batches. 765 (3) Calculated the 30 nearest neighbors for each randomly chosen cell. 766 (4) The regional mixing entropies for each cell were defined as: 767 where pi is the proportion of cells from batch i in a given region, such that ∑ = 768 1, pi' is a correction item to eliminate the deviation caused by the different cell 769 numbers in different batches. The total mixing entropy was then calculated as the sum 770 of the regional mixing entropies. 771 (5) Repeated (2)-(4) for 10 iterations with different randomly chosen cells and 772 calculated the average, E, as the final batch entropy mixing score. 773 Note that to mitigate the effect of misalignment of batch-specific cell-types, we 774 calculated the batch entropy mixing score only based on cells from cell-types that are 775 common in different batches. et al. 2019 20 to assess batch and cell-type mixing. We calculated integration LISI 778 (iLISI) and cell-type LISI (cLISI) values using the compute_lisi function in the lisi R 779 package. UMAP embeddings, batch labels, and cell-type labels were used as input in 780 calculation. Briefly, 781 where ∈ , , … , is the i-th cell's UMAP embeddings in the dataset of size 782 N, and Y is the set of unique values with respect to the type of LISI we are computing 783 (i.e., Y is the values of "batch label" for calculating iLISI and the value of "cell-type 784 label" for calculating cLISI). The probability ( | ) refers to the "relative 785 abundance" of the covariate y within KNN (k-nearest neighborhood) of . A 786 Gaussian kernel-based distribution of neighborhoods was used and the perplexity was 787 fixed to 30. 788 Over-correction score. We defined an over-correction score to assess the level of 789 over-correction problem, based on calculating the percentage of cells with 790 inconsistent cell-types in each cell's neighborhood. We calculated the over-correction 791 score over all cells, and for each cell i we averaged the frequency of the k-nearest 792 neighboring cells with distinct cell-types to the cell i (see the following equation). 793 where n is the total cell number, k represents the k-nearest neighbors of each cell, 794 the cell-type of the cell i is _ , the cell-type of the neighboring cell j is 795 , and is an indicative function defined as: 796 Formally, the over-correction score is a negative index, i.e., the higher the 797 over-correction score, the more severe the extent of inaccurate mixing of cell-types. 798 Comparison with other integration methods. We compared SCALEX to nine other 799 batch effect removal methods (see below for specific details of each method). For 800 each dataset as input for all methods, we performed the same filtration, followed by 801 method-specific normalization, batch correction and visualization. Note that for visual 802 comparison, we also included the embeddings of the raw input data, wherein we 803 performed dimensionality reduction by Principal Component Analysis (PCA) 79 804 followed by UMAP visualization to see the batch effects. No correction function was 805 used. All parameters were kept as default values. 806 Scanorama (v1.6). We performed the preprocessing pipelines as stated above (as the 807 same below), and used the Scanpy and scanorama Python packages for integration. 808 For the highly_variable_genes function, we set flavor="seurat", batch_key="batch", 809 and n_top_genes=2,000. After extracting highly variable genes, we divided the 810 datasets according to the batch labels and formed a new list of datasets as the input for 811 the correct_scanpy function. The integration matrix was kept for downstream analysis. 812 All other parameters were kept their default values. 813 BBKNN (v1.3.12) . We used Scanpy and bbknn Python packages and followed the 814 suggested pipelines for integration. For the highly_variable_genes function, we set 815 flavor="seurat", batch_key="batch", and n_top_genes=2,000. After selecting cell 816 neighbors at the low-dimensional space from the PCA analysis, we performed the 817 bbknn function with neighbors_within_batch=5, n_pcs=20, and trim=0. The basic idea of cell-type annotation by label transfer is based-on that the same 868 cell-types will occupy the same locations in the low-dimensional SCALEX 869 cell-embedding space, thus cell-type annotation in one data batch can be transferred to another data batch, for the cells positioned at the same locations. Technically, we used 871 the KNeighborsClassifier function from the scikit-learn package to train a prediction 872 model, using the representations (in the low-dimensional cell-embedding space) of the 873 single-cell data with known cell-type labels as input. We then used this model to make 874 cell-type predictions for cells without annotations using their representations (in the 875 low-dimensional cell-embedding space) as input. For comparison, label transfer for 876 online iNMF follows the same procedures as SCALEX by predicting the cell-type 877 based-on the projected locations. 878 Similarity matrix and confusion matrix. We used similarity matrix to evaluate the 879 congruence of two different batches for the same cell-types in the common 880 cell-embedding space. Technically, we merged all cells with the same cell-type label 881 and calculated an average representation (in the low-dimensional cell-embedding 882 space) for the cell-type. This was repeated for all cell-types. We then calculated the 883 similarity matrix S=[S ij ] for the cell-type similarities of the two batches, where S ij is 884 the Pearson correlation coefficient between the average representation of cell-type i in 885 data_batch_1 and the average representation of cell-type j in data_batch_2. 886 We used the confusion matrix to evaluate the accuracy of cell-type annotations 887 (prediction) when a gold-standard annotation is available, which is typical for 888 "cell-type annotation by label transfer" (see above). In cell-type annotation by label 889 transfer, we predict the cell-types for a single-cell data_batch_1, using the annotations 890 in another data_batch_2. When data_batch_1 was already annotated with cell-types, 891 we can calculate the confusion matrix C= [C ij ] to compare the cell-type predictions 892 with the existing cell-type annotations, where C ij equals the percentage of cells known 893 to be in cell-type i and predicted to be in cell-type j. 894 F1 score. F1 score is a weighted average of the precision and recall as a classification 895 metric, which is computed by: 1 = 2 * (precision * recall)/(precision + recall) Generation of partially overlapping datasets. To simulate partially overlapping 897 datasets from the pancreas dataset, we used the pancreas_celseq2 and 898 pancreas_smartseq2 data batches, and worked with only six cell-types (alpha, beta, 899 ductal, acinar, delta, gamma). For each simulated partially overlapping dataset, we 900 randomly selected three to six cell-types from each batch, and counted the number of 901 the common cell-types, which was used as the indicator for the overlapping level 902 (whole integers, 0 to 6). We required the union of cell-types in the newly simulated 903 partially overlapping dataset to cover all six cell-types. 904 For the PBMC dataset, we used both of the two data batches and worked with 905 twelve cell-types (B, CD4 T, CD4 naive T, CD8 T, CD8 naive T, DC, HSC, 906 Megakaryocyte, NK, monocyte-CD14, monocyte-FCGR3A, pDC). We used the same 907 down-sampling strategy as for the pancreas dataset (above). 908 Analysis of changes in cell-type frequency across multiple conditions. To identify 909 differences in cell-type frequency among the scRNA-seq data from the mild/moderate, 910 severe, convalescent COVID-19 patients, as well as the healthy and influenza patient 911 controls, we applied a Dirichlet-multinomial regression model. This model accounts 912 for the constraint that the cell frequencies in a scRNA-seq data are not independent of 913 each other. In detail, we normalized the regression coefficients to a standard normal 914 distribution and calculated a z score, and then conducted significance testing based on 915 the regression model generated by the DirichReg function in the R package 916 DirichletReg (v 0.7). 917 analysis. Differential gene expression analysis was performed on all expressed genes 919 using the rank_genes_groups function with method="t-test" in the Scanpy package, 920 for two certain cell-types in a COVID-19 single-cell atlas. A gene was considered 921 differentially expressed when a log2-fold change was >1 in the two conditions in 922 comparison, and the Benjamini-Hochberg adjusted P-value was < 0.01. The top 200 923 highly expressed genes sorted by scores (implemented in Scanpy) of each cell-type 924 were used as the input for GO analysis, and enriched GO terms were acquired for 925 each group of cells of the "GO_Biological_Process_2018" dataset using the Python 926 package GSEApy. Table 1 ). We then calculated 931 the cytokine and inflammatory scores from the raw gene expression profile using the 932 score_genes function implemented in the Scanpy. 933 We evaluate the contributions of each design element by the performance of different 935 ablation variant models on integration and projection tasks (Fig. S18, Fig. S19 ). 936 Baseline/full SCLAEX. The baseline or the full SCALEX model is described in 937 the paper, including an encoder without batch label, mini-batch sampling from all 938 batches with Batch Normalization, a decoder with DSBN, and beta=0.5. All other 939 variant models are compared to this baseline model. Human Fetal Atlas dataset) including runtime (left) and memory usage (right). Online 983 iNMF was not successfully tested on 4M data due to a HDF5 file conversion issue for 984 large data (Methods). f, UMAP embeddings of the mouse brain scATAC-seq dataset 985 before (left) and after integration (middle, right); colored by data batch or cell-types. g, 986 UMAP embeddings of the PBMC scRNA-seq and scATAC-seq cross-modality 987 dataset before (left) and after SCALEX integration (middle, right); colored by batch 988 or cell-type. 989 1/cLISI scores. Note that online iNMF was not successfully tested on the Human 1086 Fetal Atlas due to a HDF5 file conversion issue for large data. 1087 scATAC-seq dataset and other modality datasets. a, UMAP embeddings of the 1089 mouse brain scATAC-seq dataset before and after integration by indicated methods; 1090 colored by batch (left) and cell-type (right). b, Cell-type-specific peaks for the mouse 1091 brain scATAC-seq dataset. c, Scatter plot showing the ARI and NMI scores based-on 1092 the Leiden and Louvain clustering results, the Silhouette and batch entropy mixing 1093 scores, and the iLISI and 1/cLISI scores. d,e, UMAP embedding before and after 1094 SCALEX integration over CITE-seq (which measures abundance of both RNA and 1095 protein in single cells) spleen lymph dataset and cross-modality mouse brain dataset 1096 between spatial transcriptome (MERFISH) dataset and RNA-seq (10X) dataset. 1097 the full and different test-variants of SCALEX using the ARI score (y-axis) and the 1236 NMI score (x-axis), the Silhouette score (y-axis) and the batch entropy mixing score 1237 (x-axis), and the 1/cLISI score (y-axis) and the iLISI score (x-axis), across the 1238 indicated benchmark datasets. 1239 performance of SCALEX (beta=0.5) and SCALEX (beta=1) using the ARI score 1251 (y-axis) and the NMI score (x-axis), the Silhouette score (y-axis) and the batch 1252 entropy mixing score (x-axis), and the 1/cLISI score (y-axis) and the iLISI score 1253 (x-axis), across the indicated benchmark datasets. Cytokine score Frequency SC4 Atlas STAR CYP17A1 FDX1 CYP11A1 SULT2A1 HBG2 HBG1 SLC25A37 ANK1 TFRC ALB AFP APOA2 SERPINA1 HBG2 COL1A1 COL1A2 TNC COL3A1 PCDH9 COL1A2 COL3A1 COL1A1 DCN PIEZO2 MYL7 TPM1 TNNT2 ACTC1 MYL4 MYH3 NEB ACTC1 MYLPF TPM2 STMN2 RGS5 RBFOX1 CHGB KCNQ5 CADM2 RORB GPM6A GRIA4 NRG1 CHGA PCSK1N CADPS INS TTR DCC CTC-340A15.2 ERBB4 NRXN3 ZNF423 CD74 SRGN HLA-DRA HLA-DRB1 B2M PTPRC ARHGAP15 TCF7 TRBC2 TRAC CALCRL EMCN FLT1 COL4A1 KDR ADAM12 PAPPA FLT1 KRT8 CSH1 SPINK1 CLPS CTRB2 CPA2 CPA1 SFTPB SFTPC GPC5 DLG2 WIF1 RAB11FIP1 SPINK1 CLDN18 AGR2 KRT19 KRT18 KRT8 RBP2 FTL S100A10 chr14_29650001_29655000 chr6_59270001_59275000 chr7_122820001_122825000 chr11_50950001_50955000 chr7_113735001_113740000 chr4_16295001_16300000 chr13_23255001_23260000 chr3_82315001_82320000 chr4_39555001_39560000 chr12_92190001_92195000 chr12_46645001_46650000 chr3_42225001_42230000 chr16_24175001_24180000 chr12_106185001_106190000 chr8_114350001_114355000 chr4_27115001_27120000 chr2_123070001_123075000 chr6_42235001_42240000 chr13_115005001_115010000 chr12_50255001_50260000 chr14_16090001_16095000 chr18_60810001_60815000 chr19_33315001_33320000 chr13_112945001_112950000 chr17_89355001_89360000 chr1_15775001_15780000 chr3_101225001_101230000 chr17_52350001_52355000 chr1_82775001_82780000 chr8_103435001_103440000 chr3_50420001_50425000 chr1_182480001_182485000 chr16_18080001_18085000 chr2_110135001_110140000 chr14_8310001_8315000 chr2_71535001_71540000 chr18_40365001_40370000 chr6_6875001_6880000 chr11_106535001_106540000 chr4_49060001_49065000 chr2_71525001_71530000 chr2_158610001_158615000 chr2_71535001_71540000 chr1_190725001_190730000 chr13_59065001_59070000 chr2_158610001_158615000 chr2_22620001_22625000 chr2_70565001_70570000 chr2_161125001_161130000 chr9_111425001_111430000 chr2_158610001_158615000 chr18_70165001_70170000 chr2_71535001_71540000 chr2_22620001_22625000 chr6_6870001_6875000 chr2_71535001_71540000 chr2_71525001_71530000 chr2_158610001_158615000 chr6_6870001_6875000 chr8_110650001_110655000 chr18_61105001_61110000 chr4_47315001_47320000 chr14_114865001_114870000 chr3_59260001_59265000 chr5_35725001_35730000 chr3_31165001_31170000 chr16_91305001_91310000 chr17_10450001_10455000 chr17_25515001_25520000 chr2_127625001_127630000 chr10_86340001_86345000 chr10_86335001_86340000 chr5_116420001_116425000 chr3_95700001_95705000 chr8_72300001_72305000 chr1_172270001_172275000 chr4_57845001_57850000 chr19_8990001_8995000 chr5_119670001_119675000 chr9_21795001_21800000 chr14_122470001_122475000 chr9_87730001_87735000 chr12_33955001_33960000 chr4_114905001_114910000 chr3_122070001_122075000 chr18_47190001_47195000 chr14_12390001_12395000 chr11_33825001_33830000 chr15_71520001_71525000 chr6_100930001_100935000 chr13_76260001_76265000 chr8_103340001_103345000 chr9_79525001_79530000 chr19_15510001_15515000 chr4_98075001_98080000 chr9_113520001_113525000 chr7_126035001_126040000 chr11_4475001_4480000 chr7_99390001_99395000 IGKC MZB1 NKG7 GNLY KLRD1 NKG7 GNLY KLRD1 CD2 CD3D TRAC EPCAM KRT18 KRT7 TM4SF1 DNASE1L3 LIFR FLT1 HBB CA1 ALAS2 CXCL1 CYP2A13 GSTA1 SCD TAT-AS1 CD163 MAFB VSIG4 MS4A1 LTB CD37 CD79B ACTA2 COL1A1 MS4A1 CD79B CD79A BANK1 CD3D CD3E CD4 CCR7 MAL CD8B CD8A GZMA GZMH STMN1 HMGB2 MKI67 CD14 S100A12 FCGR3A CSF1R GATA2 CLC HDC CPA3 RUNX1 HBB HBA1 HBD HBA2 ALAS2 PPBP PF4 SPARC ITGA2B GNG11 CAVIN2 CLEC1B CTSA NKG7 GNLY GZMB CTSW KLRD1 NAMPT BASP1 BCL2A1 FCGR3B LCN2 LTF RETN S100P MZB1 ITM2C XBP1 IGHA1 IGLC2 CDK6 SOX4 EGFL7 CLEC10A FCER1A CD1C IRF8 TCF4 TSPAN13 LGMN To accomplish an accurate generalized encoder, the design of the full SCALEX included the following specific innovations: 1) an asymmetric autoencoder that inputs batch information only to the decoder (i.e., never to the encoder) (See diagram in Fig. S1 ); 2) a DSBN layer in the decoder to release the encoder from the burden of capturing the batch-specific variations; 3) a mini-batching strategy that samples data from all batches simultaneously (rather than single batches iteratively) and thus more tightly follows the same overall distribution of the full input dataset; this strategy includes a Batch Normalization layer in the encoder that adjusts the deviation of each mini-batch and aligns them to the overall input distribution. We conducted ablation studies to investigate the contributions of each design element of SCALEX. That is, we analyzed the performance (for integration and projection tasks) of the full SCALEX and four SCALEX "test-variants", each with a distinct network architecture (Fig. S19-21 ). These can be summarized as follows: Full SCALEX (referred as "Baseline" in the following). This Baseline model includes an encoder without batch labeling, sampling from all batches with Batch Normalization, decoder with DSBN, and beta=0.5. All other ablations are compared relative to this. Encoder with batch label. A test-variant with Baseline including an encoder with batch label as input. We found that the integration performance of this variant is similar with the full SCALEX, showing only a slight reduction in the evaluation scores (Fig. S19) . However, the real issue is this: this addition of batch information at the beginning precludes online integration of newly arriving data. Put another way, this SCALEX test-variant is not capable of integrating single-cell data in a truly online manner. A test-variant of Baseline removing the DSBN layer from decoder. The DSBN layer combines multiple batch-specific Batch Normalization layers to capture the batch-specific information; this approach has been demonstrated as effective for domain adaption, as it provides a weak alignment across different domains. We observed an obvious drop in the integration performance (in terms of all evaluation scores) and a slight drop in the projection performance of this SCALEX test-variant, based-on the UMAP embeddings (Fig. S19 ,20). Removing the Batch Normalization layer from the encoder, and each mini-batch is sampled by batch instead of from all the batches. The integration performance of this test-variant was obviously worse than full SCALEX (Fig. S19) . The projection performance also dropped obviously, with clear deviations from the common cell-embedding space (Fig. S20) . A test-variant that uses a regular autoencoder instead of a VAE framework. This variant performed the worst among all the variants we tested, for both the integration and projection tasks (Fig. S19,20) . Note that we also explored altering the Beta factor. Replacing the beta factor as 1 instead of 0.5. The integration performance of test variant of SCALEX for beta factor of 1 is worse than the SCALEX for beta factor of 1 (Fig. S21) . Murine single-cell RNA-seq reveals cell-identity-and 579 tissue-specific trajectories of aging Single-cell transcriptomes of the human skin reveal 582 age-related loss of fibroblast priming Single-cell transcriptome analysis of human skin identifies novel 585 fibroblast subpopulation and enrichment of immune subsets in atopic 586 dermatitis Severe COVID-19 Is Marked by a Dysregulated 589 Immunophenotyping of COVID-19 and influenza highlights 592 the role of type I interferons in development of severe COVID-19 A single-cell atlas of the peripheral immune response in 595 patients with severe COVID-19 Single-cell analysis of two severe COVID-19 patients reveals a 598 monocyte-associated and tocilizumab-responding cytokine storm Cell-Type-Specific Immune Dysregulation in Severely Ill 601 COVID-19 Patients Single-cell landscape of immunological responses in 604 patients with COVID-19 Single cell profiling of COVID-19 patients: an international 607 data resource from multiple tissues. medRxiv Longitudinal Multi-omics Analyses Identify Responses 610 of Megakaryocytes, Erythroid Cells, and Plasmablasts as Hallmarks of Severe 611 COVID-19 Clinical and immunological features of severe and moderate 614 coronavirus disease 2019 A Method for Stochastic Optimization Mgp Sparc Bgn Serpinh1 Gsn Dcn Col1a2 Col6a1 Col1a1 Sparc Mgp Bgn Gsn Igfbp7 Sod3 Col1a2 Dcn Col3a1 B_c01-TCL1A B_c02-MS4A1-CD27 B_c03-CD27-AIM2 B_c04-SOX5-TNFRSF1B B_c05-MZB1-XBP1 B_c06-MKI67 DC_c1-CLEC9A DC_c2-CD1C DC_c3-LAMP3 DC_c4-LILRA4 Epi-AT2 Epi-Ciliated Epi-Secretory Epi-Squamous Macro_c1-C1QC Macro_c2-CCL3L1 Macro_c3-EREG Macro_c4-DNAJB1 Macro_c5-WDR74 Macro_c6-VCAN Mast Mega Mono_c1-CD14-CCL3 Mono_c2-CD14-HLA-DPB1 Mono_c3-CD14-VCAN Mono_c4-CD14-CD16 Mono_c5-CD16 NK_c01-FCGR3A NK_c02-NCAM1 NK_c03-MKI67 Neu_c1-IL1B Neu_c2-CXCR4(low) Neu_c3-CST7 Neu_c4-RSAD2 Neu_c5-GSTP1(high)OASL(low) Neu_c4-RSAD2 Neu_c5-GSTP1(high)OASL(low)