key: cord-1035151-ka9rum7o
authors: Xiong, Lei; Tian, Kang; Li, Yuzhe; Zhang, Qiangfeng Cliff
title: Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space
date: 2021-10-11
journal: bioRxiv
DOI: 10.1101/2021.04.06.438536
sha: 3be1dd73f04421743568dfcc9edd143ca511a8d0
doc_id: 1035151
cord_uid: ka9rum7o

Computational tools for integrative analyses of diverse single-cell experiments are facing formidable new challenges including dramatic increases in data scale, sample heterogeneity, and the need to informatively cross-reference new data with foundational datasets. Here, we present SCALEX, a deep-learning method that integrates single-cell data by projecting cells into a batch-invariant, common cell-embedding space in a truly online manner (i.e., without retraining the model). SCALEX substantially outperforms online iNMF and other state-of-the-art non-online integration methods on benchmark single-cell datasets of diverse modalities, (e.g., scRNA-seq, scATAC-seq), especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We showcase SCALEX’s advantages by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19 patients, each assembled from diverse data sources and growing with every new data. The online data integration capacity and superior performance makes SCALEX particularly appropriate for large-scale single-cell applications to build-upon previously hard-won scientific insights.

Single-cell experiments enable the decomposition of samples into their constituent, 31 diverse cell-types and cell states 1-4 . Many computational tools have been developed for 32 integrative analysis of single-cell datasets, all seeking to separate biological variations 33 from non-biological noise, such as batch effects of different donors, conditions, and/or 34 analytical platforms 5,6 . The scope of the integration task is expanding rapidly with 35 technical advances for single-cell studies, which continue to grow larger and larger in 36 scale, now exceeding 1 million cells in some cases 7, 8 . Moreover, the range of 37 examined sample types is also increasing, and datasets now often include highly 38 heterogenous cell subsets 9,10 . Most importantly, as single-cell studies become more 39 routine, new studies should be informatively cross-referenced to foundational 40 research stuides 7, 8, [11] [12] [13] [14] [15] . Thus, there is a growing need for integration tools that can 41 manage single-cell data of large-scale and complex cell-type compositions while also 42 supporting accurate alignment to and exploration within existing datasets. 43

Most current single-cell data integration methods (e.g., Seurat [16] [17] [18] MNN 19 , 44 Harmony 20 , Conos 21 , Scanorama 22 , BBKNN 23 , etc.) are based on the searching across 45 batches for cell-correspondence, for instance similar individual cells or cell 46 anchors/clusters. These methods suffer from three limitations. First, they are prone to 47 mixing cell populations that only exist in some batches, which becomes a severe 48 problem for the integration of complex datasets that contain non-overlapping cell 49 populations in each batch (i.e., partially overlapping data) 16, 17 . Second, they require 50 computational resources that increase dramatically as the number of cells and of 51 batches increase, making these methods increasingly unsuitable for today's large-scale single-cell datasets 7, 8, [11] [12] [13] [14] [15] . Finally, these methods can only remove batch effects from 53 the current dataset being assessed. Each time a new dataset is added, it requires an 54 entirely new integration process that changes the existing integration results of 55 previous studies. This requirement severely limits a tool's ability to continuously 56 integrate arriving new single-cell data without recalculating existing integrations from 57 scratch, a capacity referred to as "online" data integration 24 . 58 Online data integration ability is becoming increasingly crucial with today's 59 single-cell experiments. The recently developed tool, online iNMF 24 , an online version 60 of LIGER 25 , iteratively applies integrative non-negative matrix factorization (iNMF) to 61 decouple the shared and dataset-specific factors related to cell identities, and thus is 62 able to incorporate new data with existing datasets on-the-fly. Another recently 63 developed package, scvi-tools 26 , combining scVI 27 with scArches 28 , applies a 64 conditional variational autoencoder (VAE) 29 framework to model the inherent 65 distribution of the input single-cell data for data integration. However, the conditional 66 VAE design of scVI requires model augmentation and retraining when integrating new 67 data, meaning that scVI is not an online method. 68

Here, we developed SCALEX as a method for online integration of heterogeneous 69 single-cell data based on a VAE framework. The encoder of SCALEX is designed to be 70 a data projection function that only preserves batch-invariant biological data 71 components when projecting single-cells. Importantly, the projection function is a 72 generalized one that requires no retraining on new data, thus allowing SCALEX to 73 integrate single-cell data in an online manner. Working with an extensive collection of 74 benchmark datasets, we demonstrate that SCALEX substantially outperforms online 75 iNMF as well as non-online single-cell data integration tools, in terms of integration 76 accuracy, scalability, and computationally efficiency. The advantages make SCALEX 77 particularly appropriate for the integration and research utilization of today's 78 

Overview of the SCALEX model. SCALEX applies a variational autoencoder (VAE) 636 to project the different batches of datasets into the same batch-invariant 637 low-dimensional embeddings by learning a batch-free encoder and a batch-specific 638 decoder simultaneously. Since the encoder and decoder are coupled to learn a 639 batch-free encoder, a batch label (b) is only exposed to the decoder within the 640 domain-specific batch normalization (DSBN), thus the decoder captures the batch 641 information while the encoder learns the domain-invariant features. In the encoder, 642 SCALEX takes the input expression profile ( ) across all the batches as a whole 643 mixture distribution without distinguishing their batch sources and extracts their mean 644 ( ) and variance ( ) of the latent representations (z) in a 10-dimension embedding 645 space to learn their global data structure. A standard multivariate Gaussian prior is 646 used for z, while the approximated distribution of z is re-parameterized by = + 647 * , where is sampled from ℕ(0, ). In the decoder, SCALEX maps the latent 648 representations with batch label (b) back to their original profile. To enable the 649 decoder to capture the batch-specific variations, a DSBN layer is applied to learn a 650 batch-specific normalization for each batch label (b), before transforming them back to their original profile with the new batch variations. To learn the global distribution 652 to avoid overcorrection on partially overlapping datasets, within each mini-batch in 653 the training process, SCALEX randomly samples data from all batches and trains on 654 them together with Batch Normalization to smooth the batch-specific shifts and align 655 to the global distribution. Once trained, the encoder of SCALEX is generalized to any 656 batches and serves as a universal function for globally mapping different batches of 657 datasets into the same batch-invariant space. 658

Training SCALEX is to maximize the log-likelihood of the observed single-cell 659 sequencing data : 660

Then the loss function is transformed into the evidence lower bound (ELBO). While 661 the ELBO can be further decomposed into two terms: 662

The first term is the reconstruction term, which minimizes the distance between the 664 generated output data ′ and the original input data , calculated as the binary cross 665 entropy between ′ and . The second term is the regularization term, which 666 minimizes the Kullback-Leibeler divergence between posterior distribution 667 ℕ( , ) and prior distribution ℕ(0, ) of latent representations . To enable a more 668 flexible alignment under the latent space, we adjusted the coefficient of the second 669 term to 0.5 after hyper-parameter optimization via a grid search; thus, the final loss 670 function is: 671

The overall network architecture of SCALEX consists of an encoder and a decoder. 673

The encoder is a two-layer neural network (fully connected [1024]-BN-ReLU-fully 674 connected[10]) for mean ( ) and variance ( ) of the 10-dimension latent 675 representations z using a reparameterization to obtain latent representations z, and the 676 decoder has only one layer (no hidden layer), directly connecting latent 677 representations to the output (fully connected-DSBN-Sigmoid) with 678 domain-specific batch normalization, where the latent representations z and batch ID 679 are provided as input, and a Sigmoid activation function. We used the Adam 75 680 optimizer with a 5e-4 weight decay and betas (0.9, 0.999, the exponential decay rate 681 for the first and second moment parameters) to optimize the model under the learning 682 rate 2e-4. We adopted mini-batch strategy to iteratively optimize the model, in each 683 mini-batch, we randomly sampled data from all batches instead of from the same 684 batch, and the mini-batch size for training input is 64. The maximum number of 685 training iterations is 30,000 and an early stopping is triggered when there has been no 686 improvement for 10 epochs. The hyper-parameters are chosen after a grid search. 687 SCALEX is very robust with all of these hyper-parameters, all of the results in this 688 manuscript are produced under the same parameters. 689 Preprocessing for scRNA-seq. We downloaded gene expression matrices and 709 preprocessed them using the following procedure: i). Cells with fewer than 600 genes 710 and genes present in fewer than 3 cells were filtered out. ii). Total counts of each cell 711 were normalized to 10,000. iii). Values of each gene were subjected to log 712 transformation with an offset of 1. iv). The top 2,000 highly variable genes were 713 identified. v). Values of each gene were normalized to the range of 0-1 within each 714 batch by the MaxAbsScaler function in the scikit-learn package in Python. The 715 processed matrix was used as input for the SCALEX model for downstream 716 differential gene expression analysis. 717

For the human fetal atlas dataset, we collected two batches (batch GSE156793, 718 which contains 4,062,980 cells by sciRNA-seq3, and batch GSE134355, which 719 contains 254,266 cells by Microwell-seq). We then selected the cells from the 720 common tissues (1,369,619 cells) for integration and computational efficiency 721 benchmarking (down-sampled from different data sizes including 10K, 50K, 250K, 722 1M, and 4M). 723

Preprocessing for scATAC-seq. We downloaded open chromatin profile matrices 724 (peaks or bins), merged them by peaks (or bins), and processed them using the 725 following procedure: i). The combined matrix was binarized and filter bins with fewer 726 than 3 cells. ii). The top 30,000 most variable peaks (or bins) were selected using the 727 select_var_feature function in the EpiScanpy 76 package. iii). Total counts of each cell 728 were normalized to the median of the total counts of all cells by using the 729 normalize_total function, with parameters target_sum="None" in the Scanpy 77 730 package. iv). Values of each peak (or bin) were normalized to the range of 0-1 within 731 each batch by the MaxAbsScaler function in the scikit-learn package in Python. The 732 processed matrix was used as input for the SCALEX model. 733

Preprocessing for cross-modality data (scRNA-seq and scATAC-seq). We first 734 created a gene activity matrix by the GeneActivity function in the Signac 78 R package 735 to quantify the activity of each gene from scATAC-seq data. We then combined gene 736 activity score matrix with scRNA-seq data matrix as two individual "batches" for 737 integration. The subsequent preprocessing followed the same preprocessing used for 738 the scRNA-seq data (above).

Visualization. UMAP algorithm 46 was used for visualization. We applied the 740 neighbors function from the Python package Scanpy with the parameters 741 n_neighbors=30 and metric="Euclidean" for computing the neighbor graph, followed 742 by umap function with min_dist=0.1 to visualize cells in a two-dimensional space. 743

Adjusted Rand Index. The Rand Index (RI) computes a similarity score between two 744 clustering assignments by considering matched and unmatched assignment pairs, 745 independent of the number of clusters. The Adjusted Rand Index (ARI) score is 746 calculated by "adjust for chance" with RI as follows: 747

If given the contingency table, then ARI can also be represented by: 748

The ARI score is 0 for random prediction and 1 for perfectly matching. Where P and T are categorical distributions for the predicted and real clustering, I is 751 the mutual entropy, and H is the Shannon entropy. 752

Silhouette score. We used the silhouette score to assess the separation of biological 753 populations with the function silhouette_score in the scikit-learn package in Python. 754

The silhouette score was computed by combining the average intra-cluster distance (a) 755 and the average nearest-cluster (b) for each cell. 756

Here, we took UMAP embeddings as input to calculate silhouette score.

Batch entropy mixing score. Batch entropy mixing score (adapted from "entropy of 758 batch mixing" 19 ) was used to access the regional mixing of cells from different 759 batches, with a high score suggesting that cells from different batches are well mixed 760 together. 761

The batch entropy mixing score was computed as follows: 762

(1) Calculated the proportion Pi of cell numbers in each batch to the total cell 763 numbers. 764

(2) Randomly chose 30 cells from all batches. 765

(3) Calculated the 30 nearest neighbors for each randomly chosen cell. 766

(4) The regional mixing entropies for each cell were defined as: 767

where pi is the proportion of cells from batch i in a given region, such that ∑ = 768 1, pi' is a correction item to eliminate the deviation caused by the different cell 769 numbers in different batches. The total mixing entropy was then calculated as the sum 770 of the regional mixing entropies. 771 (5) Repeated (2)-(4) for 10 iterations with different randomly chosen cells and 772 calculated the average, E, as the final batch entropy mixing score. 773

Note that to mitigate the effect of misalignment of batch-specific cell-types, we 774 calculated the batch entropy mixing score only based on cells from cell-types that are 775 common in different batches.

et al. 2019 20 to assess batch and cell-type mixing. We calculated integration LISI 778 (iLISI) and cell-type LISI (cLISI) values using the compute_lisi function in the lisi R 779 package. UMAP embeddings, batch labels, and cell-type labels were used as input in 780 calculation. Briefly, 781

where ∈ , , … , is the i-th cell's UMAP embeddings in the dataset of size 782 N, and Y is the set of unique values with respect to the type of LISI we are computing 783 (i.e., Y is the values of "batch label" for calculating iLISI and the value of "cell-type 784 label" for calculating cLISI). The probability ( | ) refers to the "relative 785 abundance" of the covariate y within KNN (k-nearest neighborhood) of . A 786

Gaussian kernel-based distribution of neighborhoods was used and the perplexity was 787 fixed to 30. 788

Over-correction score. We defined an over-correction score to assess the level of 789 over-correction problem, based on calculating the percentage of cells with 790 inconsistent cell-types in each cell's neighborhood. We calculated the over-correction 791 score over all cells, and for each cell i we averaged the frequency of the k-nearest 792 neighboring cells with distinct cell-types to the cell i (see the following equation). 793

where n is the total cell number, k represents the k-nearest neighbors of each cell, 794 the cell-type of the cell i is _ , the cell-type of the neighboring cell j is 795

, and is an indicative function defined as: 796

Formally, the over-correction score is a negative index, i.e., the higher the 797 over-correction score, the more severe the extent of inaccurate mixing of cell-types. 798

Comparison with other integration methods. We compared SCALEX to nine other 799 batch effect removal methods (see below for specific details of each method). For 800 each dataset as input for all methods, we performed the same filtration, followed by 801 method-specific normalization, batch correction and visualization. Note that for visual 802 comparison, we also included the embeddings of the raw input data, wherein we 803 performed dimensionality reduction by Principal Component Analysis (PCA) 79 804 followed by UMAP visualization to see the batch effects. No correction function was 805 used. All parameters were kept as default values. 806 Scanorama (v1.6). We performed the preprocessing pipelines as stated above (as the 807 same below), and used the Scanpy and scanorama Python packages for integration. 808

For the highly_variable_genes function, we set flavor="seurat", batch_key="batch", 809 and n_top_genes=2,000. After extracting highly variable genes, we divided the 810 datasets according to the batch labels and formed a new list of datasets as the input for 811 the correct_scanpy function. The integration matrix was kept for downstream analysis. 812

All other parameters were kept their default values. 813 BBKNN (v1.3.12) . We used Scanpy and bbknn Python packages and followed the 814 suggested pipelines for integration. For the highly_variable_genes function, we set 815 flavor="seurat", batch_key="batch", and n_top_genes=2,000. After selecting cell 816 neighbors at the low-dimensional space from the PCA analysis, we performed the 817 bbknn function with neighbors_within_batch=5, n_pcs=20, and trim=0. The basic idea of cell-type annotation by label transfer is based-on that the same 868 cell-types will occupy the same locations in the low-dimensional SCALEX 869 cell-embedding space, thus cell-type annotation in one data batch can be transferred to another data batch, for the cells positioned at the same locations. Technically, we used 871 the KNeighborsClassifier function from the scikit-learn package to train a prediction 872 model, using the representations (in the low-dimensional cell-embedding space) of the 873 single-cell data with known cell-type labels as input. We then used this model to make 874 cell-type predictions for cells without annotations using their representations (in the 875 low-dimensional cell-embedding space) as input. For comparison, label transfer for 876 online iNMF follows the same procedures as SCALEX by predicting the cell-type 877 based-on the projected locations. 878

Similarity matrix and confusion matrix. We used similarity matrix to evaluate the 879 congruence of two different batches for the same cell-types in the common 880 cell-embedding space. Technically, we merged all cells with the same cell-type label 881 and calculated an average representation (in the low-dimensional cell-embedding 882 space) for the cell-type. This was repeated for all cell-types. We then calculated the 883 similarity matrix S=[S ij ] for the cell-type similarities of the two batches, where S ij is 884 the Pearson correlation coefficient between the average representation of cell-type i in 885 data_batch_1 and the average representation of cell-type j in data_batch_2. 886

We used the confusion matrix to evaluate the accuracy of cell-type annotations 887 (prediction) when a gold-standard annotation is available, which is typical for 888 "cell-type annotation by label transfer" (see above). In cell-type annotation by label 889 transfer, we predict the cell-types for a single-cell data_batch_1, using the annotations 890 in another data_batch_2. When data_batch_1 was already annotated with cell-types, 891 we can calculate the confusion matrix C= [C ij ] to compare the cell-type predictions 892 with the existing cell-type annotations, where C ij equals the percentage of cells known 893 to be in cell-type i and predicted to be in cell-type j. 894 F1 score. F1 score is a weighted average of the precision and recall as a classification 895 metric, which is computed by: 1 = 2 * (precision * recall)/(precision + recall) Generation of partially overlapping datasets. To simulate partially overlapping 897 datasets from the pancreas dataset, we used the pancreas_celseq2 and 898 pancreas_smartseq2 data batches, and worked with only six cell-types (alpha, beta, 899 ductal, acinar, delta, gamma). For each simulated partially overlapping dataset, we 900 randomly selected three to six cell-types from each batch, and counted the number of 901 the common cell-types, which was used as the indicator for the overlapping level 902 (whole integers, 0 to 6). We required the union of cell-types in the newly simulated 903 partially overlapping dataset to cover all six cell-types. 904

For the PBMC dataset, we used both of the two data batches and worked with 905 twelve cell-types (B, CD4 T, CD4 naive T, CD8 T, CD8 naive T, DC, HSC, 906

Megakaryocyte, NK, monocyte-CD14, monocyte-FCGR3A, pDC). We used the same 907 down-sampling strategy as for the pancreas dataset (above). 908

Analysis of changes in cell-type frequency across multiple conditions. To identify 909 differences in cell-type frequency among the scRNA-seq data from the mild/moderate, 910 severe, convalescent COVID-19 patients, as well as the healthy and influenza patient 911 controls, we applied a Dirichlet-multinomial regression model. This model accounts 912 for the constraint that the cell frequencies in a scRNA-seq data are not independent of 913 each other. In detail, we normalized the regression coefficients to a standard normal 914 distribution and calculated a z score, and then conducted significance testing based on 915 the regression model generated by the DirichReg function in the R package 916

DirichletReg (v 0.7). 917

analysis. Differential gene expression analysis was performed on all expressed genes 919 using the rank_genes_groups function with method="t-test" in the Scanpy package, 920

for two certain cell-types in a COVID-19 single-cell atlas. A gene was considered 921 differentially expressed when a log2-fold change was >1 in the two conditions in 922 comparison, and the Benjamini-Hochberg adjusted P-value was < 0.01. The top 200 923 highly expressed genes sorted by scores (implemented in Scanpy) of each cell-type 924 were used as the input for GO analysis, and enriched GO terms were acquired for 925 each group of cells of the "GO_Biological_Process_2018" dataset using the Python 926 package GSEApy. Table 1 ). We then calculated 931 the cytokine and inflammatory scores from the raw gene expression profile using the 932 score_genes function implemented in the Scanpy. 933

We evaluate the contributions of each design element by the performance of different 935 ablation variant models on integration and projection tasks (Fig. S18, Fig. S19 ). 936

Baseline/full SCLAEX. The baseline or the full SCALEX model is described in 937 the paper, including an encoder without batch label, mini-batch sampling from all 938 batches with Batch Normalization, a decoder with DSBN, and beta=0.5. All other 939 variant models are compared to this baseline model. Human Fetal Atlas dataset) including runtime (left) and memory usage (right). Online 983 iNMF was not successfully tested on 4M data due to a HDF5 file conversion issue for 984 large data (Methods). f, UMAP embeddings of the mouse brain scATAC-seq dataset 985 before (left) and after integration (middle, right); colored by data batch or cell-types. g, 986 UMAP embeddings of the PBMC scRNA-seq and scATAC-seq cross-modality 987 dataset before (left) and after SCALEX integration (middle, right); colored by batch 988 or cell-type. 989 1/cLISI scores. Note that online iNMF was not successfully tested on the Human 1086

Fetal Atlas due to a HDF5 file conversion issue for large data. 1087

scATAC-seq dataset and other modality datasets. a, UMAP embeddings of the 1089 mouse brain scATAC-seq dataset before and after integration by indicated methods; 1090 colored by batch (left) and cell-type (right). b, Cell-type-specific peaks for the mouse 1091 brain scATAC-seq dataset. c, Scatter plot showing the ARI and NMI scores based-on 1092 the Leiden and Louvain clustering results, the Silhouette and batch entropy mixing 1093 scores, and the iLISI and 1/cLISI scores. d,e, UMAP embedding before and after 1094 SCALEX integration over CITE-seq (which measures abundance of both RNA and 1095 protein in single cells) spleen lymph dataset and cross-modality mouse brain dataset 1096 between spatial transcriptome (MERFISH) dataset and RNA-seq (10X) dataset. 1097 the full and different test-variants of SCALEX using the ARI score (y-axis) and the 1236 NMI score (x-axis), the Silhouette score (y-axis) and the batch entropy mixing score 1237 (x-axis), and the 1/cLISI score (y-axis) and the iLISI score (x-axis), across the 1238 indicated benchmark datasets. 1239 performance of SCALEX (beta=0.5) and SCALEX (beta=1) using the ARI score 1251 (y-axis) and the NMI score (x-axis), the Silhouette score (y-axis) and the batch 1252 entropy mixing score (x-axis), and the 1/cLISI score (y-axis) and the iLISI score 1253 (x-axis), across the indicated benchmark datasets. Cytokine score Frequency SC4 Atlas STAR  CYP17A1  FDX1  CYP11A1  SULT2A1  HBG2  HBG1  SLC25A37  ANK1  TFRC  ALB  AFP  APOA2  SERPINA1  HBG2  COL1A1  COL1A2  TNC  COL3A1  PCDH9  COL1A2  COL3A1  COL1A1  DCN  PIEZO2  MYL7  TPM1  TNNT2  ACTC1  MYL4  MYH3  NEB  ACTC1  MYLPF  TPM2  STMN2  RGS5  RBFOX1  CHGB  KCNQ5  CADM2  RORB  GPM6A  GRIA4  NRG1  CHGA  PCSK1N  CADPS  INS  TTR  DCC  CTC-340A15.2  ERBB4  NRXN3  ZNF423  CD74  SRGN  HLA-DRA  HLA-DRB1  B2M  PTPRC  ARHGAP15  TCF7  TRBC2  TRAC  CALCRL  EMCN  FLT1  COL4A1  KDR  ADAM12  PAPPA  FLT1  KRT8  CSH1  SPINK1  CLPS  CTRB2  CPA2  CPA1  SFTPB  SFTPC  GPC5  DLG2  WIF1  RAB11FIP1  SPINK1  CLDN18  AGR2  KRT19  KRT18  KRT8  RBP2 FTL S100A10 chr14_29650001_29655000  chr6_59270001_59275000  chr7_122820001_122825000  chr11_50950001_50955000  chr7_113735001_113740000  chr4_16295001_16300000  chr13_23255001_23260000  chr3_82315001_82320000  chr4_39555001_39560000  chr12_92190001_92195000  chr12_46645001_46650000  chr3_42225001_42230000  chr16_24175001_24180000  chr12_106185001_106190000  chr8_114350001_114355000  chr4_27115001_27120000  chr2_123070001_123075000  chr6_42235001_42240000  chr13_115005001_115010000  chr12_50255001_50260000  chr14_16090001_16095000  chr18_60810001_60815000  chr19_33315001_33320000  chr13_112945001_112950000  chr17_89355001_89360000  chr1_15775001_15780000  chr3_101225001_101230000  chr17_52350001_52355000  chr1_82775001_82780000  chr8_103435001_103440000  chr3_50420001_50425000  chr1_182480001_182485000  chr16_18080001_18085000  chr2_110135001_110140000  chr14_8310001_8315000  chr2_71535001_71540000  chr18_40365001_40370000  chr6_6875001_6880000  chr11_106535001_106540000  chr4_49060001_49065000  chr2_71525001_71530000  chr2_158610001_158615000  chr2_71535001_71540000  chr1_190725001_190730000  chr13_59065001_59070000  chr2_158610001_158615000  chr2_22620001_22625000  chr2_70565001_70570000  chr2_161125001_161130000  chr9_111425001_111430000  chr2_158610001_158615000  chr18_70165001_70170000  chr2_71535001_71540000  chr2_22620001_22625000  chr6_6870001_6875000  chr2_71535001_71540000  chr2_71525001_71530000  chr2_158610001_158615000  chr6_6870001_6875000  chr8_110650001_110655000  chr18_61105001_61110000  chr4_47315001_47320000  chr14_114865001_114870000  chr3_59260001_59265000  chr5_35725001_35730000  chr3_31165001_31170000  chr16_91305001_91310000  chr17_10450001_10455000  chr17_25515001_25520000  chr2_127625001_127630000  chr10_86340001_86345000  chr10_86335001_86340000  chr5_116420001_116425000  chr3_95700001_95705000  chr8_72300001_72305000  chr1_172270001_172275000  chr4_57845001_57850000  chr19_8990001_8995000  chr5_119670001_119675000  chr9_21795001_21800000  chr14_122470001_122475000  chr9_87730001_87735000  chr12_33955001_33960000  chr4_114905001_114910000  chr3_122070001_122075000  chr18_47190001_47195000  chr14_12390001_12395000  chr11_33825001_33830000  chr15_71520001_71525000  chr6_100930001_100935000  chr13_76260001_76265000  chr8_103340001_103345000  chr9_79525001_79530000  chr19_15510001_15515000  chr4_98075001_98080000  chr9_113520001_113525000  chr7_126035001_126040000  chr11_4475001_4480000  chr7_99390001_99395000 IGKC  MZB1  NKG7  GNLY  KLRD1  NKG7  GNLY  KLRD1  CD2  CD3D  TRAC  EPCAM  KRT18  KRT7  TM4SF1  DNASE1L3  LIFR  FLT1  HBB  CA1  ALAS2  CXCL1  CYP2A13  GSTA1  SCD  TAT-AS1  CD163  MAFB  VSIG4  MS4A1  LTB  CD37  CD79B  ACTA2  COL1A1 MS4A1  CD79B  CD79A  BANK1  CD3D  CD3E  CD4  CCR7  MAL  CD8B  CD8A  GZMA  GZMH  STMN1  HMGB2  MKI67  CD14  S100A12  FCGR3A  CSF1R  GATA2  CLC  HDC  CPA3  RUNX1  HBB  HBA1  HBD  HBA2  ALAS2  PPBP  PF4  SPARC  ITGA2B  GNG11  CAVIN2  CLEC1B  CTSA  NKG7  GNLY  GZMB  CTSW  KLRD1  NAMPT  BASP1  BCL2A1  FCGR3B  LCN2  LTF  RETN  S100P  MZB1  ITM2C  XBP1  IGHA1  IGLC2  CDK6  SOX4  EGFL7  CLEC10A  FCER1A  CD1C  IRF8  TCF4  TSPAN13 LGMN 

To accomplish an accurate generalized encoder, the design of the full SCALEX included the following specific innovations: 1) an asymmetric autoencoder that inputs batch information only to the decoder (i.e., never to the encoder) (See diagram in Fig. S1 );

2) a DSBN layer in the decoder to release the encoder from the burden of capturing the batch-specific variations;

3) a mini-batching strategy that samples data from all batches simultaneously (rather than single batches iteratively) and thus more tightly follows the same overall distribution of the full input dataset; this strategy includes a Batch Normalization layer in the encoder that adjusts the deviation of each mini-batch and aligns them to the overall input distribution.

We conducted ablation studies to investigate the contributions of each design element of SCALEX. That is, we analyzed the performance (for integration and projection tasks) of the full SCALEX and four SCALEX "test-variants", each with a distinct network architecture (Fig. S19-21 ). These can be summarized as follows:

Full SCALEX (referred as "Baseline" in the following). This Baseline model includes an encoder without batch labeling, sampling from all batches with Batch Normalization, decoder with DSBN, and beta=0.5. All other ablations are compared relative to this.

Encoder with batch label. A test-variant with Baseline including an encoder with batch label as input. We found that the integration performance of this variant is similar with the full SCALEX, showing only a slight reduction in the evaluation scores (Fig. S19) . However, the real issue is this: this addition of batch information at the beginning precludes online integration of newly arriving data. Put another way, this SCALEX test-variant is not capable of integrating single-cell data in a truly online manner.

A test-variant of Baseline removing the DSBN layer from decoder.

The DSBN layer combines multiple batch-specific Batch Normalization layers to capture the batch-specific information; this approach has been demonstrated as effective for domain adaption, as it provides a weak alignment across different domains. We observed an obvious drop in the integration performance (in terms of all evaluation scores) and a slight drop in the projection performance of this SCALEX test-variant, based-on the UMAP embeddings (Fig.   S19 ,20).

Removing the Batch Normalization layer from the encoder, and each mini-batch is sampled by batch instead of from all the batches. The integration performance of this test-variant was obviously worse than full SCALEX (Fig. S19) . The projection performance also dropped obviously, with clear deviations from the common cell-embedding space (Fig. S20) .

A test-variant that uses a regular autoencoder instead of a VAE framework. This variant performed the worst among all the variants we tested, for both the integration and projection tasks (Fig. S19,20) .

Note that we also explored altering the Beta factor. Replacing the beta factor as 1 instead of 0.5. The integration performance of test variant of SCALEX for beta factor of 1 is worse than the SCALEX for beta factor of 1 (Fig. S21) .

Murine single-cell RNA-seq reveals cell-identity-and 579 tissue-specific trajectories of aging

Single-cell transcriptomes of the human skin reveal 582 age-related loss of fibroblast priming

Single-cell transcriptome analysis of human skin identifies novel 585 fibroblast subpopulation and enrichment of immune subsets in atopic 586 dermatitis

Severe COVID-19 Is Marked by a Dysregulated 589

Immunophenotyping of COVID-19 and influenza highlights 592 the role of type I interferons in development of severe COVID-19

A single-cell atlas of the peripheral immune response in 595 patients with severe COVID-19

Single-cell analysis of two severe COVID-19 patients reveals a 598 monocyte-associated and tocilizumab-responding cytokine storm

Cell-Type-Specific Immune Dysregulation in Severely Ill 601 COVID-19 Patients

Single-cell landscape of immunological responses in 604 patients with COVID-19

Single cell profiling of COVID-19 patients: an international 607 data resource from multiple tissues. medRxiv

Longitudinal Multi-omics Analyses Identify Responses 610 of Megakaryocytes, Erythroid Cells, and Plasmablasts as Hallmarks of Severe 611 COVID-19

Clinical and immunological features of severe and moderate 614 coronavirus disease 2019

A Method for Stochastic Optimization

Mgp Sparc Bgn Serpinh1 Gsn Dcn Col1a2 Col6a1 Col1a1 Sparc Mgp Bgn Gsn Igfbp7 Sod3 Col1a2 Dcn Col3a1

B_c01-TCL1A B_c02-MS4A1-CD27 B_c03-CD27-AIM2 B_c04-SOX5-TNFRSF1B B_c05-MZB1-XBP1 B_c06-MKI67 DC_c1-CLEC9A DC_c2-CD1C DC_c3-LAMP3 DC_c4-LILRA4 Epi-AT2 Epi-Ciliated Epi-Secretory Epi-Squamous Macro_c1-C1QC Macro_c2-CCL3L1

Macro_c3-EREG Macro_c4-DNAJB1 Macro_c5-WDR74 Macro_c6-VCAN Mast Mega Mono_c1-CD14-CCL3 Mono_c2-CD14-HLA-DPB1

Mono_c3-CD14-VCAN Mono_c4-CD14-CD16 Mono_c5-CD16 NK_c01-FCGR3A NK_c02-NCAM1 NK_c03-MKI67 Neu_c1-IL1B Neu_c2-CXCR4(low) Neu_c3-CST7 Neu_c4-RSAD2 Neu_c5-GSTP1(high)OASL(low)

Neu_c4-RSAD2 Neu_c5-GSTP1(high)OASL(low)