scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling


scPNMF: sparse gene encoding of single cells to facilitate
gene selection for targeted gene profiling

Dongyuan Song 1,†, Kexin Aileen Li 2,†, Zachary Hemminger 3, 4,
Roy Wollman 3, 4, 5, and Jingyi Jessica Li 2,6,7,∗

Abstract

Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of indi-
vidual cells. While scRNA-seq measures thousands of genes, researchers are often interested
in only dozens to hundreds of genes for a closer study. Then a question is how to select
those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling
technologies are gaining popularity for their low costs, high sensitivity, and extra (e.g., spatial)
information; however, they typically can only measure up to a few hundred genes. Then another
challenging question is how to select genes for targeted gene profiling based on existing
scRNA-seq data. Here we develop the single-cell Projective Non-negative Matrix Factorization
(scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way.
Compared with existing gene selection methods, scPNMF has two advantages. First, its
selected informative genes can better distinguish cell types. Second, it enables the alignment
of new targeted gene profiling data with reference data in a low-dimensional space to facilitate
the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm
for gene selection by changing the initialization and adding a basis selection step, which
selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms
the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we
show that scPNMF can guide the design of targeted gene profiling experiments and cell-type
annotation on targeted gene profiling data.

1Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246,
2Department of Statistics, University of California, Los Angeles, CA 90095-1554,
3Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA 90095,
4Department of Integrative Biology and Physiology, University of California, Los Angeles, CA 90095-7239,
5Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095-1569,
6Department of Human Genetics, University of California, Los Angeles, CA 90095-7088,
7Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA.
† These authors contributed equally to this work.
∗ To whom correspondence should be addressed. Contact: jli@stat.ucla.edu

1

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


1 Introduction

The recent development of single-cell RNA sequencing (scRNA-seq) technologies provides un-
precedented opportunities to decipher transcriptome heterogeneity among individual cells [1–3].
A typical scRNA-seq dataset contains thousands to tens of thousands of genes; however, a subset
of genes, which we call informative genes, are usually sufficient for representing the underlying
biological variations of cells in the dataset for two reasons. First, variations of many genes are not
related to the biological variations of interest. For instance, fluctuations in the expression levels
of housekeeping genes are irrelevant to cell types [4, 5]. Second, many genes have strongly
correlated expression levels, suggesting that one gene may represent a group of genes without
much loss of information [6]. Therefore, for scRNA-seq data analysis, informative gene selection
has three advantages: (1) enhancing biological signals by removing unwanted technical variations,
(2) improving the interpretability of analysis results by focusing on informative genes, and (3)
reducing the number of genes to save computational resources.

Besides scRNA-seq data analysis, informative gene selection is also crucial for designing
single-cell targeted gene profiling experiments, which we define to include all technologies that
measure only a specific sets of genes’ expression levels in individual cells. Unlike scRNA-seq,
targeted gene profiling requires a limited number (often no more than hundreds) of genes to be
specified before sequencing. Examples of targeted gene profiling include spatial technologies
(e.g., smFISH [7] and MERFISH [8]) and non-spatial technologies (e.g., BART-Seq [9], HyPR-
seq [10] and 10x-Genomics Targeted Gene Expression). Compared with scRNA-seq, targeted
gene profiling technologies have advantages such as capturing spatial information (by smFISH
and MERFISH), having a lower cost per cell (by BART-Seq), and exhibiting a higher sensitivity for
detecting lowly expressed genes (by HyPR-seq). However, it remains an open and challenging
question to optimize the gene selection for targeted gene profiling under a gene number limitation.

Given the importance of informative gene selection, researchers have developed many gene
selection methods for scRNA-seq data. Most existing methods select genes based on the rela-
tionship between per-gene expression means and per-gene expression variances (with the mean
and variance of each gene calculated across cells). Popular example methods include variance
stabilization transformation (vst) [11] and mean-variance plot (mvp) in the R package Seurat [12],
as well as modelGeneVar in the R package scran [13]. These methods select highly variable
genes that have large expression variances in relation to their expression means. Other methods
use various metrics of gene importance instead of the per-gene expression variance. For example,
M3Drop selects the genes that have zero expression levels in many cells [14]; GiniClust selects
the genes with large Gini indices of expression levels [15]; SCMarker selects the genes that
have expression levels bi/multi-modally distributed and are co-expressed or mutually-exclusively
expressed with some other genes [16]. A common limitation of these existing methods is that they
are all designed to select a relatively large number of genes; thus, their performance in selecting
a small number of genes remains unclear. For instance, in Seurat, the default gene number is

2

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


2000; SCMarker selects 700-900 genes in its exemplar applications [16]. All these gene numbers
are much greater than 200, the maximum gene number allowed by multiple targeted gene profiling
technologies. Therefore, existing gene selection methods may not be suitable for selecting genes
for targeted gene profiling. Another drawback of these methods is that their selected genes lack
functional interpretability; that is, their selected genes are not categorized as functional gene
groups.

In addition to these gene selection methods, linear dimensionality reduction methods, such
as principal component analysis (PCA) and non-negative matrix factorization (NMF), can also
used for gene selection. Specifically, genes can be selected based on their contributions to the
projected low dimensions found by PCA or NMF [17–19]. Although many variants of PCA and
NMF algorithms have been developed for scRNA-seq data analysis, they are not designed for
gene selection [20–26].

Here we propose an unsupervised method scPNMF to simultaneously select informative genes
and project scRNA-seq data onto an interpretable low-dimensional space. Leveraging the Projec-
tive Non-negative Matrix Factorization (PNMF) algorithm [27], scPNMF combines the advantages
of PCA and NMF by outputting a non-negative sparse weight matrix that can project cells in a
high-dimensional scRNA-seq dataset onto a low-dimensional space. Unlike the weight matrix
(a.k.a., loading matrix) found by PCA, the non-negative sparse weight matrix output by scPNMF
correspond to bases that each correspond to a group of co-expressed genes. Compared with the
original PNMF, a unique feature of scPNMF is basis selection: scPNMF uses correlation screening
and multimodality testing to remove the bases that cannot reveal potential cell clusters in the input
scRNA-seq dataset. There are two functionalities of scPNMF: (1) given a pre-specified gene
number and a scRNA-seq dataset, scPNMF selects informative genes based on its weight matrix;
(2) given a targeted gene profiling dataset containing the informative genes, scPNMF projects this
dataset onto the same low-dimensional space of a reference scRNA-seq dataset containing cell
type labels, thus enabling cell type annotation on the targeted gene profiling dataset. Comprehen-
sive benchmark shows that scPNMF outperforms existing gene selection methods in two aspects.
First, the informative genes selected by scPNMF lead to the most accurate cell clustering. Second,
the informative genes and weight matrix of scPNMF lead to the best cell type prediction accuracy
for targeted gene profiling data. Therefore, scPNMF is a powerful gene selection method that can
guide the experimental design and data analysis of single-cell targeted gene profiling.

2 Methods

The core of scPNMF is to learn a low-dimensional embedding of cells so that the bases of
the low-dimensional space correspond to sparse and mutually exclusive gene groups, and that
genes in each group are co-expressed and thus functionally related. Fig.1 illustrates the work-
flow of scPNMF. The input of scPNMF is a log-transformed gene-by-cell count matrix measured
by scRNA-seq. There are two main steps in scPNMF: (I) it learns a low-dimensional sparse

3

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


min
! #$

||𝐗 − 𝐖𝐖𝐓 𝐗||

𝐾 Basis

𝑝
G

en
es

𝐖

Weight Matrix 𝐖

= [𝒘!,𝒘",…,𝒘#]

𝑛 Cells

𝑝
G
en
es

𝐗

2. Pearson Correlation (w/ Cell Library Size )
𝑅 = 0.9

C
el

l l
ib

 s
iz

e

𝒔! 𝒔#

…

𝑝-value = 0.9 𝑝-value = 0.0
3. Multimodality Test

…

1. Functional Annotations (Optional)

𝒘! 𝒘#
Housekeeping Genes … Cell-type Genes

Unselected Basis Selected Basis

𝒔! 𝒔#

D
en

si
ty

𝐖

𝑅 = 0.1

𝐖$

𝑝
G

en
es

Max Gene 
Weights
𝑤!
𝑤"

𝑤#

…
 …

𝐖$

𝑤(!)
𝑤(")

𝑤(#)

…
 …

Gene(!)
Gene(")

Gene(#)

…
 …

Order Genes by Weights 
𝑤(!) ≥ 𝑤(") ≥… ≥ 𝑤(#)

Gene(!)
Gene(")

Gene(&)

…
 …

Truncate by 𝑤(() and 
Keep First 𝑀 Genes

𝑀-Truncation 

Max Gene 
Weights

𝑀: User-defined 
Gene Number

Score Matrix 𝐒 = 𝐖𝐓𝐗

𝐾
B

as
is

= × 𝐗𝐖𝐓𝐒

𝑛 Cells

=	

𝒘!𝐓𝐗
𝒘"𝐓𝐗
⋮

𝒘#𝐓𝐗

=

𝒔!
𝒔"
⋮
𝒔#

𝐗(')

Gene(!)
Gene(")

Gene(&)

…

𝑛 Cells

Informative Gene Selection

Clustering

Visualization

……

Informative Genes:
{Gene * ,Gene + ,…,Gene(()}

𝐖/,(2)

New Data Projection

New Data Projection onto 
Reference Data Space

Reference Data Space

= ×
𝑛 Cells

𝐗(')
)*+= ×𝐒(')

)*+
Gene(!)
Gene(")

Gene(&)

…

Trained Model 𝑓(𝒔)

New Cells

Prediction 𝑓

Cell Type Prediction
Gene(!)
Gene(")

Gene(&)

…

𝑛 Cells

𝒔

𝐗(')𝐖$,(')
𝐓

𝐖$,(')
𝐓

Step I: PNMF Step II: Basis Selection

𝐾' Basis

Applications

𝐒(')
-*.

Figure 1: An overview of scPNMF. Taking a log-transformed gene-by-cell count matrix as the input, scPNMF first
learns a low-dimensional sparse weight matrix W and a low-dimensional cell embedding matrix S. Second, it remove
the bases irrelevant to cell type variations by examining bases’ functional annotations (optional), Pearson correlations
with cell library sizes, and multimodality. Given a user-defined gene number M, scPNMF performs M-truncation to
facilitate two main applications: (1) selecting the desired number of informative genes; (2) projecting new targeted gene
profiling data onto the low-dimensional space defined by reference scRNA-seq data. The details are in the ”Methods”
section.

4

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


weight matrix by PNMF; (II) it selects bases in the weight matrix based on functional annotations
(optional), correlation screening, and multimodality testing to remove uninformative bases that
cannot distinguish cell types. The output of scPNMF includes (1) the selected weight matrix,
a sparse and mutually exclusive encoding of genes as new, low dimensions, and (2) the score
matrix containing embeddings of input cells in the low dimensions. The selected weight matrix
has two main applications: extracting informative gene for downstream analyses, such as cell
clustering and new marker gene identification, and projecting new targeted gene profiling data for
data integration and cell type annotation.

2.1 scPNMF step I: PNMF

In this section, we review the PNMF algorithm [27, 28] as the foundation of scPNMF. We first
compare the formulation of PNMF with that of principal component analysis (PCA) and non-
negative matrix factorization (NMF), and we show that PNMF has the advantages of both PCA
and NMF so that it can be a useful tool for scRNA-seq data analysis. Next, we introduce our
PNMF implementation.

Given a log-transformed count matrix X ∈ Rp×n≥0 , whose p rows correspond to genes and whose
n columns represent cells, and a positive integer K ≤ p, PNMF aims to find a K-dimensional
space, whose dimensions correspond to non-negative, sparse and mutually exclusive linear com-
binations of the p genes, so that projecting the n cells onto the K-dimensional space does not
cause much information loss (i.e., projecting the K-dimensional embeddings of the n cells back to
the original p-dimensional space can largely restore the original n cells). PNMF tackles this task
by solving the optimization problem:

min
W∈Rp×K≥0

‖X−WWTX‖ , (2.1)

where ‖ · ‖ denotes the Frobenius matrix norm. The solution W is referred to as a weight matrix.
Each column of W is a basis, whose p entries are the weights of the p genes. PNMF requires all
weights to be non-negative, leading to a sparse W with most weights as zeros.

PCA is similar to PNMF but does not require all weights to be non-negative. We can write the
optimization problem of PCA as

min
W∈Rp×K,WTW=I

‖X−WWTX‖ , (2.2)

whose solution W is also a weight matrix but not sparse, and W is often referred to as the loading
matrix.

A common property of PNMF and PCA is that the transpose of their weight matrix, WT ∈ RK×p,
can be used to project a new cell with p gene measurements, x ∈ Rp, onto the K-dimensional
space as WTx.

In contrast to PMNF and PCA, NMF finds two non-negative matrices W and H so that their

5

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


product approximates the original matrix X. NMF solves the optimization problem:

min
W∈Rp×K≥0 ,H∈R

K×n
≥0

‖X−WH‖ , (2.3)

whose solution W still has K columns representing bases, and H has n columns as K-dimensional
embeddings of the n cells. Due to the non-negative constraint on W and H, W is a sparse
matrix [29]. However, the transpose WT cannot be used as a projection matrix from the original
p-dimensional space to a K-dimensional space. The reason is that, if WT is a projection matrix,
then by the definition of H we have WTX = H, which would converts the objective function (2.3) of
NMF to the objective function (2.1) of PNMF. In other words, PNMF is a constrained version of NMF
by requiring WT to be a projection matrix. Hence, PNMF inherits the property of NMF by having
non-negative, sparse bases that are mostly mutually exclusive (i.e., different bases correspond
to different gene groups). Moreover, based on the similarities of the objective functions of PNMF
(2.1) and PCA (2.2), we can see that PNMF also resembles PCA by finding a weight matrix whose
transpose can serve as a projection matrix and whose bases are largely orthogonal to each other.
Table 1 summarizes the properties of PNMF, PCA, and NMF.

Table 1: Comparison of the properties of PNMF, PCA and NMF

Optimization Problem Non- Sparsity Mutually New Data
negativity Exclusiveness Projection

PNMF min
W
‖X−WWTX‖ s.t. W ≥ 0 Yes Very high Very high Yes

PCA min
W
‖X−WWTX‖ s.t. WTW = I No Low Low Yes

NMF min
W,H

‖X−WH‖ s.t. W, H ≥ 0 Yes High High No

In the context of scRNA-seq data analysis, the above advantages of PNMF lead to an inter-
pretable and useful weight matrix W. First, the high sparsity of W makes each basis (column)
depend on only a small set of genes, which has been defined as a meta-gene for NMF [30].
Second, the mutual exclusiveness of W makes different bases correspond to different gene sets,
easing the interpretation of bases as meta-genes or functional units. Third, the projection matrix
WT allows the alignment of new data to reference data, thus facilitating cell type annotation on the
new data.

Algorithm 1 summarizes the key steps of PNMF implementation in scPNMF. Our implemen-
tation mainly follows the two papers that proposed the PNMF algorithm [27, 28], and we change
the initialization of W to the weight matrix found by PCA, WPCA, with the absolute value taken on
every entry. Our initialization is motivated by the desired orthogonality of bases (i.e., columns of
W).

With the weight matrix W ∈ Rp×K≥0 learned by PNMF, we obtain the score matrix S = W
TX ∈

RK×n≥0 , whose K rows correspond to the bases and whose n columns represent the cells. Specif-
ically, the j-th column of S is the K-dimensional embedding of the j-th cell; the k-th row of S,

6

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Algorithm 1 Pseudocode of PNMF implementation in scPNMF

Initialize: W = abs(WPCA) ∈ R
p×K
≥0

1: while not converge do
2: for i = 1, · · · , p; k = 1, · · · , K do

3: Wik ← Wik
2
(
XXTW

)
ik

(WWTXXTW)ik + (XX
TWWTW)ik

4: end for

5: W ←
1

‖W‖2
W

6: end while
Output: W ∈ Rp×K≥0 , S = W

TX ∈ RK×n≥0

denoted by sTk , contains the scores (i.e., coordinates) of all n cells in the k-th basis:

sk = w
T
kX , (2.4)

where wk is the k-th column of W, k = 1, . . . , K.
The low rank K needs to be pre-specified in PNMF, same as in PCA and NMF, A larger K

preserves more information in X but also removes less noise (technical variation of cells that is not
of biological interest), impedes the interpretation of W (more bases are more difficult to interpret),
and increases the computational burden. To choose K in a data-driven way, we propose an
orthogonality measure, which shows that K = 20 is a reasonable choice for multiple scRNA-seq
datasets (Section S1.1).

2.2 scPNMF step II: basis selection

The second key step of scPNMF is to select informative bases among the K bases found by
PNMF (i.e., columns of W and rows of S) to remove unwanted variations of cells (e.g., variations
irrelevant to cell types). The columns of W enjoy high sparsity and mutual exclusiveness; that
is, each column contains positive weights corresponding to a unique small set of genes, so it is
expected to reflect a certain biological function. However, some biological functions may not be
relevant to the cell heterogeneity of interest, e.g., cell type composition. Motivated by this, we
propose three strategies for selecting informative bases (columns of W and rows of S): functional
annotations (optional), correlations with cell library sizes, and tests of multimodality.

2.2.1 Strategy 1: examine bases by functional annotations (optional)

The first, optional strategy is to annotate the biological function(s) of each basis in the weight
matrix. For example, scPNMF may apply gene ontology (GO) analysis to the top 10% genes
with the highest weights in each basis (column of W) and record the enriched GO terms as
the basis’ functional annotation. Then, users with prior knowledge can interpret the functional

7

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


annotation on each basis and decide whether or not to remove the basis. For example, if the goal
is to delineate cell types in scRNA-seq data, a basis corresponding to cell-cycle genes should be
removed because they would obscure the distinction of cell types.

However, it is worth noting that filtering bases by biological annotations is optional in scPNMF.
Conservative users can keep all K bases output by PNMF and directly use data-driven basis
selection (Section 2.2.2). For our results in this paper, scPNMF removes the bases corresponding
to well-known housekeeping genes (Section S2).

2.2.2 Data-driven strategies

2.2.2.1 Strategy 2: examine bases by correlations with cell library sizes

Note that the input of scPNMF is a log-transformed unnormalized count matrix for users’ conve-
nience. Hence, scPNMF does not adjust for cell library sizes in the computation of W and S in
step I. Given that the variance of cell library sizes contributes to unwanted variations of cells [11],
it is necessary to remove the bases whose corresponding rows in S are strongly correlated with
cell library sizes.

We use the total log-transformed counts to approximate the library size of each cell, and we
calculate the Pearson correlation between each sk and the library sizes of n cells. The strategy is
to retain the bases whose Pearson correlations are under a pre-defined threshold, which we set
to 0.7 based on empirical observations (Section S1.2).

2.2.2.2 Strategy 3: examine bases by multimodality tests

Another data-driven strategy is to retain the bases whose corresponding scores are multi-modally
distributed. If a basis’ score vector (row in S) contains n scores with a multimodality pattern, then
it is likely to distinguish cell types and should be retained. To implement this strategy, we use
the ACR test [31] to check the multimodality of each basis’ score vector. The null hypothesis is
that the score vector contains n scores sampled from a unimodal distribution, and the alternative
hypothesis is that the distribution has more than one mode. After performing multiple multimodality
tests, one per basis, we use the Benjamini-Hochberg procedure to set a p-value threshold by
controlling the false discovery rate under 1%. The bases whose p-values are under this threshold
will be retained.

In summary, scPNMF step II allows users to use strategy 1 to filter out uninformative bases
based on functional annotations if available; then it implements data-driven strategies 2 and 3 to
further remove bases that have strong correlations with cell library sizes and exhibit unimodality
patterns. The retained bases will have their corresponding columns in W selected and stacked
into the selected weight matrix WS ∈ R

p×K0
≥0 , where K0 is the number of selected bases.

8

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


2.3 Applications of scPNMF output: informative gene selection and
new data projection

The selected weight matrix WS output by scPNMF has two main applications: selection of a
desired number of informative genes and projection of new targeted gene profiling data onto the
low-dimensional space defined by WS. Given a gene number M (e.g., 200), scPNMF uses M-
truncation, a step to select M rows in WS, resulting in M informative genes and a truncated,
selected weight matrix WS,(M) ∈ R

M×K0
≥0 for new data projection.

2.3.1 M-truncation and informative gene selection

We denote the desired number of informative genes by M ∈ N, with M ≤ # of non-zero rows in WS.
M-truncation has three steps.

1. For each gene i, calculate its largest weight wi across bases in WS:

wi = max
k=1,...,K0

(WS)ik, i = 1, 2, . . . , p . (2.5)

2. Order genes by their maximum weights w(1) ≥ w(2) ≥ ··· ≥ w(p) and set the truncation
threshold as w(M). Identify the first M genes as informative genes.

3. Construct the truncated, selected weight matrix WS,(M):

(1) Truncate the selected weight matrix WS by setting all (WS)ik < w(M) to be 0;

(2) Keep the M rows with non-zero entries; stack them by row into WS,(M) based on the
order of the informative genes.

In short, scPNMF selects informative genes based on their maximum weights in the selected
bases. The rationale is that a gene’s maximum weight reflects the gene’s contribution to the
establishment of the K0-dimensional space, which preserves the n cells’ biological variations of
interest. Hence, genes with larger maximum weights are more informative in the sense of encoding
cells’ biological variations. An important application of informative gene selection is to guide the
design of targeted gene profiling experiments.

2.3.2 New data projection

Given the selected M informative genes, once new cells are measured by targeted gene profiling
on these genes, WS,(M) can be used to project the new cells onto the K0-dimensional space
where the cells in the input scRNA-seq data are embedded in. If the input data has cell type
annotations, we refer to the input data as reference data, then we can predict the new cells’ types
from the types of the cells in the reference data. In detail, new data projection has the following
steps:

9

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


1. Apply scPNMF with M-truncation to input, reference data X ∈ Rp×n≥0 with n cells to obtain
the truncated, selected weight matrix WS,(M). Construct X(M) ∈ R

M×n
≥0 as a submatrix of X,

with rows corresponding to the rows of WS,(M), i.e., the M informative genes. Hence, the
K0-dimensional embeddings of the n cells in the reference data are the columns of

SRef(M) = W
T
S,(M) ×X(M) ∈ R

K0×n . (2.6)

2. Denote the targeted gene profiling data of n′ new cells with M informative genes measured
by XNew

(M)
∈ RM×n

′

≥0 . Note that X
New
(M)

contains log-transformed counts and has rows (genes)
corresponding to the rows of X(M). Project the n

′ cells to the K0-dimensional space by

SNew(M) = W
T
S,(M) ×X

New
(M) ∈ R

K0×n′ (2.7)

3. (Optional) Normalize SNew
(M)

and SRef
(M)

to remove batch effects, if existent, by using a single-cell
integration method such as Harmony [32].

Now the n reference cells and the n′ new cells are in the same K0-dimensional space with
biological variations preserved. Then a classifier can be trained on the n reference cells’ types
and SRef

(M)
for cell type prediction, and it can be used to predict the n′ cells’ types from SNew

(M)
.

3 Results

3.1 scPNMF outputs a sparse and functionally interpretable repre-
sentation of scRNA-seq data

We first demonstrate that scPNMF step I, PNMF, outputs a sparse and functionally interpretable
gene encoding of cells. We use the FregGold dataset [33], which consists of three cell types
(three human lung adenocarcinoma cell lines), and set the basis number K = 5 for demonstration
purpose. Both PCA and PNMF learn a weight matrix that can project the original scRNA-seq data
onto a 5-dimensional space. Unlike the weight matrix of PCA that has no zero entries, the weight
matrix of PNMF is non-negative, highly sparse, containing 42.6% of entries as zeros, and has
bases that are largely mutually exclusive (i.e., non-zero entries in different columns correspond to
different rows/genes) (Fig. 2a). GO enrichment analysis shows that high weight genes in each
PNMF basis are enriched with conceptually-similar GO terms, and high weight genes in different
PNMF bases are enriched with conceptually-different GO terms (Fig. 2b). This result indicates
that PNMF bases correspond to gene groups with distinct functions. On the contrary, the PCA
bases do not have good functional interpretations: the high weight genes in each PCA basis are
not enriched with conceptually-similar GO terms, and different PCA bases share many high weight
genes (Fig. S3).

10

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 2: Illustration of the sparse and interpretable projection found by scPNMF. We use the FregGold dataset
as an example. (a) Comparison of the weight matrices of PCA and PNMF. Heatmaps visualize the learned weight
matrices of PCA (top) and PNMF (bottom), where rows are genes and columns are bases. Red represents positive
weights while blue represents negative weights. The rows are ordered by gene-wise hierarchical clustering. Compared
to PCA, the weight matrix of PNMF is strictly non-negative, much more sparse and mutually exclusive between bases.
(b) GO analysis result of each basis in the weight matrix of PNMF. Texts in black boxes summarize the functions of
genes in each basis. The enriched GO terms are almost mutually exclusive, implying that each basis represents a
unique gene functional cluster. (c) Statistical tests on each basis in the score matrix of PNMF. Top row: scatter plots
of scores and total log-counts (cell library sizes). Each dot represents a cell. Cell scores in bases 1 and 4 are highly
correlated with cell library sizes. Bottom row: histograms of cell scores in each basis. Scores in bases 2 and 3 show
strong multimodality patterns (adjusted p-value ≤ 0.05). (d) UMAP visualizations of cells based on high weight genes in
the unselected bases 1 and 4 and those in the selected bases 2, 3, and 5. Genes in the unselected bases completely
fail to distinguish the three cell types, while genes in the selected bases lead to a clear separation of the three cell
types.

11

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


To further analyze the PNMF bases, we list the top 10 high weight genes in each basis (Table
S1), from which we identify many well-known genes with important functions. For instance, basis
1 contains classic housekeeping genes, such as GAPDH [34] and ribosomal protein genes (RPS-)
[35]; basis 3 contains well-known tumor-related genes, including EGFR [36] and CDK4 [37]. In
particular, the cells of the HCC827 cell line (one of the three cell types) have overall high scores in
basis 3 (Fig. S4), a reasonable result because the HCC827 cell line contains an EGFR activating
mutation [38]. In summary, scPNMF step I outputs bases representing sparse and functionally
interpretable gene sets.

3.2 Basis selection is an essential step in scPNMF

Here we explain why basis selection is an essential step in scPNMF. In the last section, we
show that each PNMF basis of the FregGold dataset approximately represents one functional
gene group. It is well known that housekeeping genes (basis 1) and cell-cycle genes (basis
4) are usually irrelevant to cell type distinctions. However, such biological knowledge is not
always available or certain. Therefore, scPNMF mainly relies on the two data-driven strategies:
correlations with cell library sizes and multimodality tests (Section 2.2.2) for selecting informative
bases.

Fig. 2c visualizes the two strategies: cell scores in bases 1 and 4 are highly correlated with cell
library sizes (Pearson correlations > 0.9); cell scores in bases 2 and 3 show strong evidence as
multi-modally distributed (adjusted p-value < 0.05). Hence, strategy 1 will not retain bases 1 and
4, and strategy 2 will not retain bases 1, 4, and 5; together, bases 1 and 4 will be removed, and
bases 2, 3, and 5 will be selected. To verify the effectiveness of basis selection, we use UMAP to
visualize cells based on the top 50 high weight genes in the unselected bases 1 and 4 vs. those in
the selected bases 2, 3, and 5 (Fig. 2d). We observe that the top genes in the unselected bases
completely fail to separate the three cell types, while the top genes in the selected bases perfectly
distinguish the three cell types. This result strongly supports that basis selection is a necessary
step of scPNMF.

3.3 scPNMF outperforms state-of-the-art gene-selection methods on
diverse scRNA-seq datasets

In this section, we demonstrate scPNMF’s capacity for informative gene selection. We compre-
hensively benchmark scPNMF against 11 other single cell informative selection methods (Table
S2) on seven scRNA-seq datasets (Table S3) using three clustering methods (Louvain clustering,
K-means clustering, and hierarchical clustering). For fair benchmarking, the seven scRNA-seq
datasets cover both unique molecule identifier (UMI) and non-UMI protocols and include various
biological samples. Using the adjusted Rank index (ARI) as the metric of clustering accuracy, we
calculate the ARI values of the three clustering methods on each dataset using 100 informative

12

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 3: Benchmarking scPNMF against 11 informative gene selection methods on seven scRNA-seq
datasets. (a) Clustering accuracies (ARI values) of three clustering methods based on the informative genes selected.
Gene selection methods are ordered from left to right by their average ARI across the three clustering methods and
the seven datasets. (b) UMAP visualization of cells in the Zheng4 dataset based on 100 informative genes selected by
each method. Genes selected by scPNMF lead to a clear separation between naive cytotoxic T cells and regulatory T
cells, while the genes selected by others methods do not.

genes selected by each gene selection method, as 100 genes are commonly used in targeted
gene profiling.

Fig. 3a shows that scPNMF has overall the highest ARI values across datasets and clustering
methods. In particular, scPNMF has the highest average ARI value with each clustering method
(Louvain: 0.83; K-means: 0.74; hierarchical clustering: 0.69) and the highest overall average ARI
(0.75) across datasets and clustering methods. Note that the mean of the overall average ARI
values of all methods except scPNMF is only 0.66.

We further show the UMAP visualization of cells in the Zheng4 dataset based on the informa-
tive genes selected by each of the 12 gene selection methods (Fig. 3b). Only scPNMF leads to
a clear separation of naive cytotoxic T cells and regulatory T cells, while the informative genes
selected by other methods except corFS and irlbaPcaFS cannot distinguish the two cell types at
all.

We also compare the 12 methods under a varying number of informative genes: 20, 50, 200,
and 500, the commonly used gene numbers in targeted gene profiling. We observe that the overall
average ARI values of scPNMF are consistently higher than those of other methods, across all
informative gene numbers (Fig. S6). Moreover, compared with other methods, scPNMF leads to
more stable overall average ARI values under varying numbers of informative genes, indicating
its stronger robustness to the gene number constraint of targeted gene profiling. These results
strongly support the superior performance of scPNMF as an informative gene selection method.

13

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


3.4 scPNMF guides targeted gene profiling experimental design and
cell-type prediction

In this section, we demonstrate how scPNMF can guide the selection of genes to be measured in
a targeted gene profiling experiment, and how scPNMF enables subsequent cell type annotation
on the targeted gene profiling data. We design two case studies with paired scRNA-seq reference
data and “pseudo” targeted gene profiling data, whose per-cell sequencing depth is higher than
that of the corresponding scRNA-seq data.

In the first case study, we use the Zheng8 dataset (measured by the 10x protocol) as the refer-
ence dataset. To generate the pseudo targeted gene profiling data, we use a new single-cell gene
expression simulator that captures gene correlations, scDesign2 [39], to generate data with a 100-
time higher per-cell sequencing depth. In the second case study, we use the PBMC10x dataset
(measured by 10x protocol) as the reference dataset, and we use PBMCSmartseq (measured by
Smart-Seq2) as the pseudo targeted gene profiling data because Smart-Seq2 has a higher per-
gene sequencing depth than 10x does. In both case studies, for each gene selection method,
the corresponding pseudo targeted gene profiling datasets only contain the M informative genes
selected by the method.

We benchmark scPNMF against the 11 gene selection methods in terms of cell type prediction
on the pseudo targeted gene profiling data. To avoid the bias for a specific classification algorithm,
we apply three popular algorithms for cell type prediction: random forest (RF) [40], k-nearest
neighbors (KNN) [41], and support vector machine (SVM) [41]. In each case study, we first train
each classification algorithm on the low-dimensional embeddings of the reference cells SRef

(M)
given

the M = 100 informative genes selected by each gene selection method. Then we apply the
trained classifier to the low-dimensional embeddings of the cells in the pseudo targeted gene
profiling data SNew

(M)
. Table 2 shows that scPNMF leads to the highest average prediction accuracy

(0.81) across six combinations (two case studies × three classification algorithms). Moreover,
scPNMF achieves the highest accuracy in each combination except Zheng8 + random forest where
it is the second best. These results confirm that scPNMF effectively guides the selection of genes
to measure in targeted gene profiling experiments, and it enables accurate cell type annotation on
newly generated targeted gene profiling datasets.

4 Discussion

We propose scPNMF, an unsupervised gene selection and data projection method for scRNA-seq
data. The major goal of scPNMF is to select a fixed number of informative genes to distinguish
cell types and guide gene selection for targeted gene profiling experiments. Moreover, scPNMF
can project a new targeted gene profiling dataset with the selected genes to the low-dimensional
space that embeds a reference scRNA-seq dataset. We perform a comprehensive benchmark to
evaluate scPNMF in terms of informative gene selection against the state-of-the-art gene selection

14

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Table 2: Prediction accuracy of cell types based on 100 informative genes selected by 12 gene selection methods in
the two case studies with paired reference scRNA-seq data and targeted gene profiling data

Method
Zheng8 PBMC Average

RF KNN SVM RF KNN SVM Accuracy

scPNMF 0.85
(0.83,0.87)

0.80
(0.78,0.83)

0.87
(0.85,0.89)

0.84
(0.79,0.88)

0.84
(0.79,0.88)

0.67
(0.61,0.73)

0.81

M3Drop 0.85
(0.83,0.87)

0.80
(0.77,0.83)

0.87
(0.84,0.89)

0.84
(0.79,0.88)

0.77
(0.71,0.82)

0.63
(0.57,0.69)

0.79

SeuratDISP 0.84
(0.81,0.86)

0.78
(0.75,0.81)

0.86
(0.84,0.88)

0.80
(0.75,0.84)

0.75
(0.70,0.80)

0.64
(0.58,0.70)

0.78

corFS 0.80
(0.77,0.82)

0.75
(0.73,0.78)

0.82
(0.80,0.85)

0.82
(0.77,0.86)

0.81
(0.76,0.86)

0.62
(0.56,0.68)

0.77

GiniClust 0.86
(0.83,0.88)

0.79
(0.76,0.81)

0.86
(0.83,0.88)

0.80
(0.75,0.84)

0.76
(0.71,0.81)

0.53
(0.47,0.60)

0.75

scran 0.79
(0.76,0.81)

0.72
(0.69,0.75)

0.82
(0.80,0.85)

0.78
(0.72,0.82)

0.73
(0.67,0.78)

0.67
(0.61,0.72)

0.75

SeuratMVP 0.83
(0.81,0.85)

0.77
(0.74,0.80)

0.85
(0.82,0.87)

0.82
(0.77,0.86)

0.74
(0.69,0.79)

0.47
(0.40,0.53)

0.74

Scanpy 0.79
(0.77,0.82)

0.71
(0.68,0.74)

0.80
(0.78,0.83)

0.80
(0.75,0.84)

0.76
(0.71,0.81)

0.52
(0.46,0.58)

0.73

SCMarker 0.77
(0.74,0.79)

0.68
(0.65,0.71)

0.74
(0.71,0.77)

0.77
(0.71,0.81)

0.71
(0.65,0.76)

0.45
(0.39,0.52)

0.69

SeuratVST 0.73
(0.70,0.76)

0.68
(0.65,0.71)

0.75
(0.73,0.78)

0.74
(0.68,0.79)

0.68
(0.63,0.74)

0.40
(0.34,0.46)

0.67

DANB 0.71
(0.68,0.73)

0.69
(0.66,0.71)

0.75
(0.73,0.78)

0.73
(0.67,0.78)

0.74
(0.68,0.79)

0.28
(0.23,0.34)

0.65

irlbaPcaFS 0.68
(0.65,0.71)

0.61
(0.58,0.64)

0.71
(0.68,0.74)

0.71
(0.65,0.76)

0.77
(0.71,0.82)

0.16
(0.12,0.21)

0.61

Parentheses are 95% confidence intervals. Highest number within each column is labeled by underline.

methods. Our results show that scPNMF consistently outperforms existing methods for a wide
range of informative gene numbers (from 20 to 500) on diverse scRNA-seq datasets. We also
demonstrate that the informative genes selected by scPNMF can effectively guide gene selection
for targeted gene profiling and lead to accurate cell type annotation on targeted gene profiling data
based on reference scRNA-seq data.

Besides gene selection and data projection, scPNMF also works as a dimensionality reduction
method with good interpretability. Each dimension in the low-dimensional space found by scPNMF
can be considered as a new functional “feature” (as a linear combination of correlated and thus
functionally related genes). Moreover, the mutual exclusiveness makes the PNMF bases used
in scPNMF advantageous over the PCA bases in terms of removing confounding effects. For
example, cell-cycle genes obscure the identification of cell types and should be removed from
low-dimensional embeddings of cells. For PCA, cell-cycle genes affect many PCA bases, so the
popular scRNA-seq pipeline Seurat implements a complicated approach that first calculates “cell-
cycle scores” and then regresses each basis (principal component) on these scores to remove
the effects of cell-cycle genes [12]. In contrast, cell-cycle genes are concentrated in only one
PNMF basis, so it is easy to remove that basis to clear the effects of cell-cycle genes. Therefore,
scPNMF has great potentials in deciphering cell heterogeneity in single-cell data by working as an
interpretable dimensionality reduction method.

The current implementation of scPNMF focuses on single-cell gene expression data. Consid-

15

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


ering the rapid development of single-cell multi-omics technologies, we plan to extend scPNMF
to accommodate other technologies that measure other genomics features such chromatin ac-
cessibility landscapes measured by single-cell ATAC-seq [42], or even to integrate data across
multi-omics datasets. Another note is that the multimodality test for basis selection in scPNMF
only accounts for discrete cell types but not continuous cell trajectories. Therefore, other tests or
strategies are needed to select informative bases to capture biological variations along continuous
cell trajectories.

An important question for gene selection is: how many genes should be selected as informative
genes to fully capture the biological variations of interest? In our studies, we observe that, after the
informative gene number reaches 200, the clustering accuracies based on the selected informative
genes plateau for most gene selection methods including scPNMF. Therefore, 200 genes may be
sufficient for capturing biological variations in scRNA-seq data. However, it remains challenging
to decide the minimum number of informative genes, given that the underlying cell sub-population
structure is data-specific and might be complex. We plan to explore this problem in future with the
possible use of information theory.

Software and code

The R package scPNMF is available at https://github.com/JSB-UCLA/scPNMF.

Acknowledgements

We acknowledge the comments and feedback from the members of the Junction of Statistics and
Biology at UCLA (http://jsb.ucla.edu).

Funding

This work was supported by the following grants: NSF DMS-1613338 and DBI-1846216, NIH/NIGMS
R01GM120507, PhRMA Foundation Research Starter Grant in Informatics, Johnson and Johnson
WiSTEM2D Award, and Sloan Research Fellowship (to J.J.L.); NIH/NINDS R01NS117148 (to
R.W).

Competing interests

None.

16

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://github.com/JSB-UCLA/scPNMF
http://jsb.ucla.edu
https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


References

[1] S Steven Potter. Single-cell rna sequencing for the study of development, physiology and
disease. Nature Reviews Nephrology, 14(8):479–492, 2018.

[2] Kenneth D Birnbaum. Power in numbers: single-cell rna-seq strategies to dissect complex
tissues. Annual review of genetics, 52:203–221, 2018.

[3] Chenxu Zhu, Sebastian Preissl, and Bing Ren. Single-cell multimodal omics: the power of
many. Nature methods, 17(1):11–14, 2020.

[4] Olivier Thellin, Willy Zorzi, Bernard Lakaye, B De Borman, Bernard Coumans, Georges
Hennen, Thierry Grisar, Ahmed Igout, and Ernst Heinen. Housekeeping genes as internal
standards: use and limits. Journal of biotechnology, 75(2-3):291–295, 1999.

[5] Eli Eisenberg and Erez Y Levanon. Human housekeeping genes, revisited. TRENDS in
Genetics, 29(10):569–574, 2013.

[6] Aravind Subramanian, Rajiv Narayan, Steven M Corsello, David D Peck, Ted E Natoli,
Xiaodong Lu, Joshua Gould, John F Davis, Andrew A Tubelli, Jacob K Asiedu, et al. A
next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171
(6):1437–1452, 2017.

[7] Arjun Raj, Patrick Van Den Bogaard, Scott A Rifkin, Alexander Van Oudenaarden, and Sanjay
Tyagi. Imaging individual mrna molecules using multiple singly labeled probes. Nature
methods, 5(10):877–879, 2008.

[8] Jeffrey R Moffitt, Junjie Hao, Guiping Wang, Kok Hao Chen, Hazen P Babcock, and Xiaowei
Zhuang. High-throughput single-cell gene-expression profiling with multiplexed error-robust
fluorescence in situ hybridization. Proceedings of the National Academy of Sciences, 113
(39):11046–11051, 2016.

[9] Fatma Uzbas, Florian Opperer, Can Sönmezer, Dmitry Shaposhnikov, Steffen Sass, Christian
Krendl, Philipp Angerer, Fabian J Theis, Nikola S Mueller, and Micha Drukker. Bart-seq:
cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and
single-cell analysis. Genome biology, 20(1):1–16, 2019.

[10] Jamie L Marshall, Benjamin R Doughty, Vidya Subramanian, Philine Guckelberger, Qingbo
Wang, Linlin M Chen, Samuel G Rodriques, Kaite Zhang, Charles P Fulco, Joseph Nasser,
et al. Hypr-seq: Single-cell quantification of chosen rnas via hybridization and sequencing
of dna probes. Proceedings of the National Academy of Sciences, 117(52):33404–33413,
2020.

17

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[11] Christoph Hafemeister and Rahul Satija. Normalization and variance stabilization of single-
cell rna-seq data using regularized negative binomial regression. Genome biology, 20(1):
1–15, 2019.

[12] Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William
M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive
integration of single-cell data. Cell, 177:1888–1902, 2019. doi: 10.1016/j.cell.2019.05.031.
URL https://doi.org/10.1016/j.cell.2019.05.031.

[13] Aaron TL Lun, Karsten Bach, and John C Marioni. Pooling across cells to normalize single-
cell rna sequencing data with many zero counts. Genome biology, 17(1):75, 2016.

[14] Tallulah S Andrews and Martin Hemberg. M3drop: dropout-based feature selection for
scrnaseq. Bioinformatics, 35(16):2865–2867, 2019.

[15] Lan Jiang, Huidong Chen, Luca Pinello, and Guo-Cheng Yuan. Giniclust: detecting rare
cell types from single-cell gene expression data with gini index. Genome biology, 17(1):144,
2016.

[16] Fang Wang, Shaoheng Liang, Tapsi Kumar, Nicholas Navin, and Ken Chen. Scmarker: ab
initio marker selection for single cell transcriptome profiling. PLoS computational biology, 15
(10):e1007445, 2019.

[17] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa
Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly
parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell,
161(5):1202–1214, 2015.

[18] Maayan Baron, Adrian Veres, Samuel L Wolock, Aubrey L Faust, Renaud Gaujoux, Amedeo
Vetere, Jennifer Hyoje Ryu, Bridget K Wagner, Shai S Shen-Orr, Allon M Klein, et al. A
single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell
population structure. Cell systems, 3(4):346–360, 2016.

[19] Xun Zhu, Travers Ching, Xinghua Pan, Sherman M Weissman, and Lana Garmire. Detecting
heterogeneity in single-cell rna-seq data by non-negative matrix factorization. PeerJ, 5:e2888,
2017.

[20] Philippe Boileau, Nima S Hejazi, and Sandrine Dudoit. Exploring high-dimensional biological
data with sparse contrastive principal component analysis. Bioinformatics, 36(11):3422–
3430, 2020.

[21] Zhana Duren, Xi Chen, Mahdi Zamanighomi, Wanwen Zeng, Ansuman T Satpathy, Howard Y
Chang, Yong Wang, and Wing Hung Wong. Integrative analysis of single-cell genomics
data by coupled nonnegative matrix factorizations. Proceedings of the National Academy
of Sciences, 115(30):7723–7728, 2018.

18

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1016/j.cell.2019.05.031
https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[22] Ghislain Durif, Laurent Modolo, Jeff E Mold, Sophie Lambert-Lacroix, and Franck Picard.
Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics,
35(20):4011–4019, 2019.

[23] Shuqin Zhang, Liu Yang, Jinwen Yang, Zhixiang Lin, and Michael K Ng. Dimensionality
reduction for single cell rna sequencing data using constrained robust non-negative matrix
factorization. NAR Genomics and Bioinformatics, 2(3):lqaa064, 2020.

[24] Chao Gao and Joshua D Welch. Iterative refinement of cellular identity from single-cell data
using online learning. In International Conference on Research in Computational Molecular
Biology, pages 248–250. Springer, 2020.

[25] Zi Yang and George Michailidis. A non-negative matrix factorization method for detecting
modules in heterogeneous omics multi-modal data. Bioinformatics, 32(1):1–8, 2016.

[26] Joshua D Welch, Velina Kozareva, Ashley Ferreira, Charles Vanderburg, Carly Martin, and
Evan Z Macosko. Single-cell multi-omic integration compares and contrasts features of brain
cell identity. Cell, 177(7):1873–1887, 2019.

[27] Zhijian Yuan, Zhirong Yang, and Erkki Oja. Projective nonnegative matrix factorization:
Sparseness, orthogonality, and clustering. Neural Process. Lett, pages 11–13, 2009.

[28] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization.
IEEE Transactions on Neural Networks, 21(5):734–749, 2010.

[29] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788–791, 1999.

[30] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and
molecular pattern discovery using matrix factorization. Proceedings of the national academy
of sciences, 101(12):4164–4169, 2004.

[31] Jose Ameijeiras-Alonso, Rosa M Crujeiras, and Alberto Rodrı́guez-Casal. Mode testing,
critical bandwidth and excess mass. Test, 28(3):900–919, 2019.

[32] Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy
Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and
accurate integration of single-cell data with harmony. Nature methods, 16(12):1289–1296,
2019.

[33] Saskia Freytag, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. Comparison
of clustering tools in r for medium-sized 10x genomics single-cell rna-sequencing data.
F1000Research, 7, 2018.

19

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[34] Robert D Barber, Dan W Harmer, Robert A Coleman, and Brian J Clark. Gapdh as a
housekeeping gene: analysis of gapdh mrna expression in a panel of 72 human tissues.
Physiological genomics, 21(3):389–395, 2005.

[35] Nicholas Silver, Steve Best, Jie Jiang, and Swee Lay Thein. Selection of housekeeping
genes for gene expression studies in human reticulocytes using real-time pcr. BMC molecular
biology, 7(1):33, 2006.

[36] Collin M Blakely, Thomas BK Watkins, Wei Wu, Beatrice Gini, Jacob J Chabon, Caroline E
McCoach, Nicholas McGranahan, Gareth A Wilson, Nicolai J Birkbak, Victor R Olivas, et al.
Evolution and clinical impact of co-occurring genetic alterations in advanced-stage egfr-
mutant lung cancers. Nature genetics, 49(12):1693–1704, 2017.

[37] Ben O’leary, Richard S Finn, and Nicholas C Turner. Treating cancer with selective cdk4/6
inhibitors. Nature reviews Clinical oncology, 13(7):417–430, 2016.

[38] Carminia Maria Della Corte, Umberto Malapelle, Elena Vigliar, Francesco Pepe, Giancarlo
Troncone, Vincenza Ciaramella, Teresa Troiani, Erika Martinelli, Valentina Belli, Fortunato
Ciardiello, et al. Efficacy of continuous egfr-inhibition and role of hedgehog in egfr acquired
resistance in human lung cancer cells with activating mutation of egfr. Oncotarget, 8(14):
23020, 2017.

[39] Tianyi Sun, Dongyuan Song, Wei Vivian Li, and Jingyi Jessica Li. scdesign2: an interpretable
simulator that generates high-fidelity single-cell gene expression count data with gene
correlations captured. bioRxiv, 2020.

[40] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[41] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal
margin classifiers. In Proceedings of the fifth annual workshop on Computational learning
theory, pages 144–152, 1992.

[42] Sebastian Pott and Jason D Lieb. Single-cell atac-seq: strength in numbers. Genome Biology,
16(1):1–4, 2015.

[43] Angelo Duò, Mark D Robinson, and Charlotte Soneson. A systematic performance evaluation
of clustering methods for single-cell rna-seq data. F1000Research, 7, 2018.

[44] Jiarui Ding, Xian Adiconis, Sean K Simmons, Monika S Kowalczyk, Cynthia C Hession,
Nemanja D Marjanovic, Travis K Hughes, Marc H Wadsworth, Tyler Burks, Lan T Nguyen,
et al. Systematic comparison of single-cell and single-nucleus rna-sequencing methods.
Nature biotechnology, pages 1–10, 2020.

[45] Jose Alquicira-Hernandez, Anuja Sathe, Hanlee P Ji, Quan Nguyen, and Joseph E Powell.
scpred: accurate supervised method for cell-type classification from single-cell rna-seq data.
Genome biology, 20(1):1–17, 2019.

20

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[46] Spyros Darmanis, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M
Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. A survey of human
brain transcriptome diversity at the single cell level. Proceedings of the National Academy of
Sciences, 112(23):7285–7290, 2015.

[47] Itay Tirosh, Benjamin Izar, Sanjay M Prakadan, Marc H Wadsworth, Daniel Treacy, John J
Trombetta, Asaf Rotem, Christopher Rodman, Christine Lian, George Murphy, et al. Dissect-
ing the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science, 352
(6282):189–196, 2016.

21

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Supplementary Materials

S1 Choice of parameters and robustness analysis

S1.1 Low rank K

In the development of scPNMF, motivated by the objective function of the PNMF method,

min
W∈Rp×K≥0

‖X−WWTX‖ , (S1)

PNMF aims to inherit the advantages such as basis orthogonality and the ability to project the new
data from PCA. However, a key constraint in PCA, WTW = I, is sacrificed in order to meet with
the condition W ≥ 0 in PNMF. To get closer to PCA and thus attain its nice properties, we propose
to use the normalized difference between WTW and I to measure the orthonality of W:

dev.ortho = ‖I−WTW‖/K2, (S2)

which is an implication of the performance in the downstream analysis as well.
It naturally follows a method for determining the number of basis, K: we perform PNMF for

a sequence of K’s, calculate the dev.ortho measure for each W ∈ Rp×K≥0 optimized by PNMF for
each K, and then look at the plot of dev.ortho against K. Users can decide cutoff where it reaches
stability or there is a clear elbow in the graph.

In Fig. S1, with Zheng4 [43] dataset, we demonstrate that (1) the dev.ortho measure is highly
correlated with the performance of W in the downstream analysis; (2) in real data application, the
dev.ortho measure shows a clear elbow pattern, which is helpful for users to determine K.

Empirically, we see that dev.ortho reaches stability at K = 20 for most scRNA-seq data. For
the purpose of providing suggestion for users and saving computational energy, we set the default
number of bases in scPNMF to be K = 20.

S1.2 R0: threshold for correlations between score vectors and cell
library sizes in scPNMF step II: basis selection

In real data application, the threshold for correlations between score vectors and cell library sizes
in scPNMF step II: basis selection, R0, needs to be pre-defined. In the field, researchers often use
thresholds as accurate as with one decimal digit, such as 0.5. By empirically running K-means
clustering on the seven datasets (see Table S3) with different thresholds {0.5, 0.6, 0.7, 0.8, 0.9}, as
shown in Fig. S2, we suggest setting R0 = 0.7 for K ≥ 10, and more conservatively, R0 = 0.8
when the basis number K is small (K < 10).

22

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


S2 Functional annotation

We use the R package clusterProfiler [Y] to perform GO analysis. We set the gene ontology
as “BP”, adjusted p-value cutoff as 0.1. The output GO terms are simplified by clusterProfiler.

In this paper, we only perform a very conservative filtering based on functionality. We define
the common housekeeping gene list as ACTB, ACTG1, B2M, GAPDH, MALAT1. If the top 10 high
weight genes from one basis contain any of these genes, this basis will be filtered out.

S3 Data preprocessing

scPNMF only performs minimum data preprocessing to avoid information loss. Denote a scRNA-
seq count matrix scPNMF further investigates as XC ∈ Np×n, with rows representing p genes
and columns representing n cells. Users make the log count matrix X ∈ Rp×n≥0 by taking the log
transformation with a pseudo count 1:

Xij = log
(
XCij + 1

)
, i = 1, · · · , p, j = 1, · · · , n. (S1)

scPNMF takes the log count matrix X ∈ Rp×n≥0 as the input. With log transformation, the effect of
a few extremely large counts will be alleviated, and the transformed continuous values are more
flexible to model. We introduce the pseudo count 1 to avoid negative and infinite values in the later
PNMF optimization step.

For scRNA-seq data used in this paper (Table S1), we filtered out genes that are expressed
in fewer than 5% of the cells, and then filtered out cells that are expressed in fewer than 5% of
the remaining genes. Additionally, MALAT1, mitochondrial and ribosomal genes are filtered for
datasets PBMC10x and PBMCSmartSeq according to the reference paper [44]. Users are able to
adjust the filtering process before they input the log count matrix into scPNMF.

S4 Details in informative gene selection and clustering

In this paper, we compare scPNMF with other 11 different informative gene selection methods
(Table S2). Some gene selection methods cannot let users pre-define an arbitrary gene number;
for those methods (e.g., SCMarker [16]), we shift the tuning parameters until their output gene
numbers equals the desired gene number. Therefore, their outputs might not achieve their the
optimal results.

We apply three clustering algorithm, Louvain clustering (by Seurat), K-means clustering (by R
function kmeans), hierarchical clustering (by R function hclust). We perform PCA on informative
genes and use the top 20 PCs for clustering. The Adjusted Rank Index (ARI) is as the metric of

23

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


clustering accuracy. ARI is defined as:

ARI (P, T) =

∑
l,s

(
nls

2

)
−

[∑
l

(
al

2

)∑
s

(
bs

2

)]
/

(
n

2

)
1
2

[∑
l

(
al

2

)
+
∑

s

(
bs

2

)]
−

[∑
l

(
al

2

)∑
s

(
bs

2

)]
/

(
n

2

) , (S1)

where P = (p1, · · · , pl) denotes the inferred cluster labels, and T = (t1, · · · , ts) denotes the true
cluster labels. l and s are not necessarily to be equal. nls =

∑
ij I(pi = l)I(tj = s), al =

∑
s nls,

bs =
∑

l nls. ARI ∈ [0, 1], an ARI value close to 1 means more accurate inferred clusters. To
minimize the effects caused by parameters (resolution r in Louvain and number of cluster k in
K-means and hierarchical clustering), we try a sequence of parameters:

r ∈{0.02, 0.04, 0.06, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} , k ∈{2, 3, 4, · · · , 15} , (S2)

and use the average of top three high ARI across different parameters as the final output.

S5 Details in new data projection and cell type predic-
tion

We use two datasets, Zheng8 and PBMC10x, as the reference scRNA-seq datasets. For Zheng8
dataset, we first use scDesign2 [39] to learn the underlying parameters, and then simulate a new
dataset with same genes and cell types but 100 times higher sequencing depth compared to the
Zheng8 dataset. For PBMC10x dataset, we use the PBMCSmartSeq dataset, which measures the
exact same example and contains all genes measured in PBMC10x. Given M selected genes,
the simulated Zheng8 and PBMC10x are extracted with those certain genes, and play role as the
“pseudo” targeted gene profiling only measuring M genes.

For cell type prediction, we project every targeted gene profiling dataset and its scRNA-seq
reference on the same low-dimensional space, which mainly follows the idea from scPred [45].
When applying scPNMF, we use the weight matrix WS,(M) to project both the reference dataset
and the targeted gene profiling dataset. For other gene selection methods, we first subset the
reference dataset with only M selected genes, run PCA to get a weight matrix WPCA, and use it to
project both the reference dataset (with only M genes) and targeted gene profiling dataset. After
getting two low-dimensional embeddings of reference and targeted gene profiling data, we run
the Harmony algorithm [32] to remove the technical variations between two low-dimensional em-
beddings. Then we apply three classification algorithms, random forest (rf), k-nearest neighbors
(knn) and support vector machine with radial kernel (svmRadial) in R package caret [K]. When
fitting the training model, we use 5-fold cross-validation with three repeats.

24

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Table S1: Top 10 high weight genes in each PNMF basis of FretagGold dataset

Basis Gene symbol Description

1 RPS2, TMSB4X, GAPDH, RPL41, RPL13, FTH1, MALAT1,
COX2, RPL10, RPS18

Highly expressed housekeeping genes

2 CD74, PTGR1, HLA-B, ALDH3A1, C15orf48, LCN2,
IGFBP3, SAA1, CXCL1, HLA-DRA

Immune-related genes

3 SEC61G, CDK4, CCN1, G0S2, ELOC, VOPP1, EGFR, F3,
CDKN2A, EPCAM

Tumor-related genes (oncogenes, tumor suppressor
genes)

4 H4C3, CKS1B, HMGB2, SMC4, PTTG1, KPNA2, CCNB1,
CDKN3, CKS2, CDC20

Genes related to mitotic cell cycle

5 HSPB1, UBE2S, CALD1, TMEM256, FIS1, ISOC2, ZN-
HIT1, C20orf27, NDUFA3, PPP2R1A

Genes related to mitochondrion

25

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Table S2: Overview of informative gene selection used in this study

Method User-defined gene # Language Package Reference

corFS Yes R M3Drop (version 1.14.0) [14]
DANB Yes R M3Drop (version 1.14.0) [14]
GiniClust Yes R M3Drop (version 1.14.0) [14]
irlbaPcaFS Yes R M3Drop (version 1.14.0) [14]
M3Drop Yes R M3Drop (version 1.14.0) [14, 15]
Scanpy Yes Python Scanpy (version 1.6.0) [W]
SCMarker No R SCMarker1 [16]
scran Yes R scran (version 1.18.3) [13]
SeuratDISP Yes R Seurat (version 3.2.2) [11, 12]
SeuratMVP No R Seurat (version 3.2.2) [12]
SeuratVST Yes R Seurat (version 3.2.2) [12]

1: Due to failure in SCMarker R package installation, we run the R script downloaded from https://github.com/KChen-lab/SCMarker
on September 17, 2020.

26

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://github.com/KChen-lab/SCMarker
https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Table S3: Overview of datasets used in this study

Dataset Sequencing proto-
col

Gene # Cell # Cell type # True label Description Ref

Darmanis Smart-Seq2 13256 420 8 No Human adult corti-
cal samples

[46]

FreytagGold 10xGenomics
Chromium

15410 925 3 Yes Mixture of
human lung
adenocarcinoma
cell lines

[33]

Tirosh Smart-Seq2 11934 2887 6 No Human melanoma
tumors

[47]

PBMC10x 10xGenomics
Chromium

11714 3308 9 No Human
peripheral blood
mononuclear
cells. 10x-v2 for
sample 1 in the
original paper.

[44]

PBMCSmartSeq Smart-Seq2 17479 273 6 No Human
peripheral blood
mononuclear
cells. Smart-Seq2
for sample 1 in the
original paper.

[44]

Zheng4 10xGenomics
GemCode

2192 3994 4 Yes Mixture of human
peripheral blood
mononuclear cells

[43, Z]

Zheng8 10xGenomics
GemCode

2390 3994 8 Yes Mixture of human
peripheral blood
mononuclear cells

[43, Z]

27

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


0.000

0.005

0.010

0.015

0.020

0.00

0.25

0.50

0.75

0 10 20 30 40 50 60 70 80 90 100
K

d
e
v.

o
rt

h
o

A
R

I

Choice of K

Figure S1: Comparison of dev.ortho and K-means ARI against low rank K on Zheng4 [43] dataset.

0.5

0.6

0.7

0.8

0.9

1.0

0.5 0.6 0.7 0.8 0.9
R0

A
R

I

Choice of R0

Figure S2: Comparison of K-means ARI against R0, the threshold for correlations between score vectors and cell
library sizes in scPNMF step II: basis selection. The mean ARI and the error bars are calculated across seven datasets
(See Table S3).

28

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure S3: GO annotation on weight matrix of PCA. The enriched GO terms between basis are largely overlapped.

29

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure S4: scPNMF scores versus total log-counts of FregGold dataset colored by cell types. Basis 2 distinguishes
H2228 from the other two cell types and basis 3 distinguishes HCC827 from the other two cell types.

30

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure S5: Benchmarking scPNMF and other informative gene selction methods using 20, 50, 200, 500 genes.

31

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure S6: Comparison of overall average ARI of different methods versus gene numbers. The y-axis indicates the
average ARI values across seven datasets and three clustering methods for each gene selection methods.

32

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


References

[1] S Steven Potter. Single-cell rna sequencing for the study of development, physiology and
disease. Nature Reviews Nephrology, 14(8):479–492, 2018.

[2] Kenneth D Birnbaum. Power in numbers: single-cell rna-seq strategies to dissect complex
tissues. Annual review of genetics, 52:203–221, 2018.

[3] Chenxu Zhu, Sebastian Preissl, and Bing Ren. Single-cell multimodal omics: the power of
many. Nature methods, 17(1):11–14, 2020.

[4] Olivier Thellin, Willy Zorzi, Bernard Lakaye, B De Borman, Bernard Coumans, Georges
Hennen, Thierry Grisar, Ahmed Igout, and Ernst Heinen. Housekeeping genes as internal
standards: use and limits. Journal of biotechnology, 75(2-3):291–295, 1999.

[5] Eli Eisenberg and Erez Y Levanon. Human housekeeping genes, revisited. TRENDS in
Genetics, 29(10):569–574, 2013.

[6] Aravind Subramanian, Rajiv Narayan, Steven M Corsello, David D Peck, Ted E Natoli,
Xiaodong Lu, Joshua Gould, John F Davis, Andrew A Tubelli, Jacob K Asiedu, et al. A
next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171
(6):1437–1452, 2017.

[7] Arjun Raj, Patrick Van Den Bogaard, Scott A Rifkin, Alexander Van Oudenaarden, and Sanjay
Tyagi. Imaging individual mrna molecules using multiple singly labeled probes. Nature
methods, 5(10):877–879, 2008.

[8] Jeffrey R Moffitt, Junjie Hao, Guiping Wang, Kok Hao Chen, Hazen P Babcock, and Xiaowei
Zhuang. High-throughput single-cell gene-expression profiling with multiplexed error-robust
fluorescence in situ hybridization. Proceedings of the National Academy of Sciences, 113
(39):11046–11051, 2016.

[9] Fatma Uzbas, Florian Opperer, Can Sönmezer, Dmitry Shaposhnikov, Steffen Sass, Christian
Krendl, Philipp Angerer, Fabian J Theis, Nikola S Mueller, and Micha Drukker. Bart-seq:
cost-effective massively parallelized targeted sequencing for genomics, transcriptomics, and
single-cell analysis. Genome biology, 20(1):1–16, 2019.

[10] Jamie L Marshall, Benjamin R Doughty, Vidya Subramanian, Philine Guckelberger, Qingbo
Wang, Linlin M Chen, Samuel G Rodriques, Kaite Zhang, Charles P Fulco, Joseph Nasser,
et al. Hypr-seq: Single-cell quantification of chosen rnas via hybridization and sequencing
of dna probes. Proceedings of the National Academy of Sciences, 117(52):33404–33413,
2020.

33

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[11] Christoph Hafemeister and Rahul Satija. Normalization and variance stabilization of single-
cell rna-seq data using regularized negative binomial regression. Genome biology, 20(1):
1–15, 2019.

[12] Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William
M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive
integration of single-cell data. Cell, 177:1888–1902, 2019. doi: 10.1016/j.cell.2019.05.031.
URL https://doi.org/10.1016/j.cell.2019.05.031.

[13] Aaron TL Lun, Karsten Bach, and John C Marioni. Pooling across cells to normalize single-
cell rna sequencing data with many zero counts. Genome biology, 17(1):75, 2016.

[14] Tallulah S Andrews and Martin Hemberg. M3drop: dropout-based feature selection for
scrnaseq. Bioinformatics, 35(16):2865–2867, 2019.

[15] Lan Jiang, Huidong Chen, Luca Pinello, and Guo-Cheng Yuan. Giniclust: detecting rare
cell types from single-cell gene expression data with gini index. Genome biology, 17(1):144,
2016.

[16] Fang Wang, Shaoheng Liang, Tapsi Kumar, Nicholas Navin, and Ken Chen. Scmarker: ab
initio marker selection for single cell transcriptome profiling. PLoS computational biology, 15
(10):e1007445, 2019.

[17] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa
Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly
parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell,
161(5):1202–1214, 2015.

[18] Maayan Baron, Adrian Veres, Samuel L Wolock, Aubrey L Faust, Renaud Gaujoux, Amedeo
Vetere, Jennifer Hyoje Ryu, Bridget K Wagner, Shai S Shen-Orr, Allon M Klein, et al. A
single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell
population structure. Cell systems, 3(4):346–360, 2016.

[19] Xun Zhu, Travers Ching, Xinghua Pan, Sherman M Weissman, and Lana Garmire. Detecting
heterogeneity in single-cell rna-seq data by non-negative matrix factorization. PeerJ, 5:e2888,
2017.

[20] Philippe Boileau, Nima S Hejazi, and Sandrine Dudoit. Exploring high-dimensional biological
data with sparse contrastive principal component analysis. Bioinformatics, 36(11):3422–
3430, 2020.

[21] Zhana Duren, Xi Chen, Mahdi Zamanighomi, Wanwen Zeng, Ansuman T Satpathy, Howard Y
Chang, Yong Wang, and Wing Hung Wong. Integrative analysis of single-cell genomics
data by coupled nonnegative matrix factorizations. Proceedings of the National Academy
of Sciences, 115(30):7723–7728, 2018.

34

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1016/j.cell.2019.05.031
https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[22] Ghislain Durif, Laurent Modolo, Jeff E Mold, Sophie Lambert-Lacroix, and Franck Picard.
Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics,
35(20):4011–4019, 2019.

[23] Shuqin Zhang, Liu Yang, Jinwen Yang, Zhixiang Lin, and Michael K Ng. Dimensionality
reduction for single cell rna sequencing data using constrained robust non-negative matrix
factorization. NAR Genomics and Bioinformatics, 2(3):lqaa064, 2020.

[24] Chao Gao and Joshua D Welch. Iterative refinement of cellular identity from single-cell data
using online learning. In International Conference on Research in Computational Molecular
Biology, pages 248–250. Springer, 2020.

[25] Zi Yang and George Michailidis. A non-negative matrix factorization method for detecting
modules in heterogeneous omics multi-modal data. Bioinformatics, 32(1):1–8, 2016.

[26] Joshua D Welch, Velina Kozareva, Ashley Ferreira, Charles Vanderburg, Carly Martin, and
Evan Z Macosko. Single-cell multi-omic integration compares and contrasts features of brain
cell identity. Cell, 177(7):1873–1887, 2019.

[27] Zhijian Yuan, Zhirong Yang, and Erkki Oja. Projective nonnegative matrix factorization:
Sparseness, orthogonality, and clustering. Neural Process. Lett, pages 11–13, 2009.

[28] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization.
IEEE Transactions on Neural Networks, 21(5):734–749, 2010.

[29] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788–791, 1999.

[30] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and
molecular pattern discovery using matrix factorization. Proceedings of the national academy
of sciences, 101(12):4164–4169, 2004.

[31] Jose Ameijeiras-Alonso, Rosa M Crujeiras, and Alberto Rodrı́guez-Casal. Mode testing,
critical bandwidth and excess mass. Test, 28(3):900–919, 2019.

[32] Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy
Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and
accurate integration of single-cell data with harmony. Nature methods, 16(12):1289–1296,
2019.

[33] Saskia Freytag, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. Comparison
of clustering tools in r for medium-sized 10x genomics single-cell rna-sequencing data.
F1000Research, 7, 2018.

35

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[34] Robert D Barber, Dan W Harmer, Robert A Coleman, and Brian J Clark. Gapdh as a
housekeeping gene: analysis of gapdh mrna expression in a panel of 72 human tissues.
Physiological genomics, 21(3):389–395, 2005.

[35] Nicholas Silver, Steve Best, Jie Jiang, and Swee Lay Thein. Selection of housekeeping
genes for gene expression studies in human reticulocytes using real-time pcr. BMC molecular
biology, 7(1):33, 2006.

[36] Collin M Blakely, Thomas BK Watkins, Wei Wu, Beatrice Gini, Jacob J Chabon, Caroline E
McCoach, Nicholas McGranahan, Gareth A Wilson, Nicolai J Birkbak, Victor R Olivas, et al.
Evolution and clinical impact of co-occurring genetic alterations in advanced-stage egfr-
mutant lung cancers. Nature genetics, 49(12):1693–1704, 2017.

[37] Ben O’leary, Richard S Finn, and Nicholas C Turner. Treating cancer with selective cdk4/6
inhibitors. Nature reviews Clinical oncology, 13(7):417–430, 2016.

[38] Carminia Maria Della Corte, Umberto Malapelle, Elena Vigliar, Francesco Pepe, Giancarlo
Troncone, Vincenza Ciaramella, Teresa Troiani, Erika Martinelli, Valentina Belli, Fortunato
Ciardiello, et al. Efficacy of continuous egfr-inhibition and role of hedgehog in egfr acquired
resistance in human lung cancer cells with activating mutation of egfr. Oncotarget, 8(14):
23020, 2017.

[39] Tianyi Sun, Dongyuan Song, Wei Vivian Li, and Jingyi Jessica Li. scdesign2: an interpretable
simulator that generates high-fidelity single-cell gene expression count data with gene
correlations captured. bioRxiv, 2020.

[40] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[41] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal
margin classifiers. In Proceedings of the fifth annual workshop on Computational learning
theory, pages 144–152, 1992.

[42] Sebastian Pott and Jason D Lieb. Single-cell atac-seq: strength in numbers. Genome Biology,
16(1):1–4, 2015.

[43] Angelo Duò, Mark D Robinson, and Charlotte Soneson. A systematic performance evaluation
of clustering methods for single-cell rna-seq data. F1000Research, 7, 2018.

[44] Jiarui Ding, Xian Adiconis, Sean K Simmons, Monika S Kowalczyk, Cynthia C Hession,
Nemanja D Marjanovic, Travis K Hughes, Marc H Wadsworth, Tyler Burks, Lan T Nguyen,
et al. Systematic comparison of single-cell and single-nucleus rna-sequencing methods.
Nature biotechnology, pages 1–10, 2020.

[45] Jose Alquicira-Hernandez, Anuja Sathe, Hanlee P Ji, Quan Nguyen, and Joseph E Powell.
scpred: accurate supervised method for cell-type classification from single-cell rna-seq data.
Genome biology, 20(1):1–17, 2019.

36

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/


[46] Spyros Darmanis, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M
Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. A survey of human
brain transcriptome diversity at the single cell level. Proceedings of the National Academy of
Sciences, 112(23):7285–7290, 2015.

[47] Itay Tirosh, Benjamin Izar, Sanjay M Prakadan, Marc H Wadsworth, Daniel Treacy, John J
Trombetta, Asaf Rotem, Christopher Rodman, Christine Lian, George Murphy, et al. Dissect-
ing the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science, 352
(6282):189–196, 2016.

37

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430550doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.09.430550
http://creativecommons.org/licenses/by-nc-nd/4.0/

	Introduction
	Methods
	scPNMF step I: PNMF
	scPNMF step II: basis selection
	Strategy 1: examine bases by functional annotations (optional)
	Data-driven strategies

	Applications of scPNMF output: informative gene selection and new data projection
	M-truncation and informative gene selection
	New data projection


	Results
	scPNMF outputs a sparse and functionally interpretable representation of scRNA-seq data
	Basis selection is an essential step in scPNMF
	scPNMF outperforms state-of-the-art gene-selection methods on diverse scRNA-seq datasets
	scPNMF guides targeted gene profiling experimental design and cell-type prediction


	Discussion
	Choice of parameters and robustness analysis
	Low rank K
	R0: threshold for correlations between score vectors and cell library sizes in scPNMF step II: basis selection

	Functional annotation
	Data preprocessing
	Details in informative gene selection and clustering
	Details in new data projection and cell type prediction