The Landscape of Precision Cancer Combination Therapy: A Single-Cell Perspective


1 
 

The Landscape of Precision Cancer Combination Therapy: A Single-Cell 

Perspective  

Saba Ahmadi1,7^, Pattara Sukprasert2,8^, Rahulsimham Vegesna3, Sanju Sinha3, Fiorella 

Schischlik3, Natalie Artzi4,5,6, Samir Khuller2,8, Alejandro A. Schäffer3*, Eytan Ruppin3* 

1 Dept. of Computer Science, University of Maryland, College Park MD 20742 USA 

2 Dept. of Computer Science, Northwestern University, Evanston IL 60208 USA 

3 Cancer Data Science Laboratory, National Cancer Institute, Bethesda, MD 20892 USA  

4 Dept. of Medicine, Engineering in Medicine Division, Brigham and Women’s Hospital, 

Harvard Medical School, Boston, MA 02139 USA 

5 Broad Institute of Harvard and MIT, Cambridge, MA 02139 USA 

6 Institute for Medical Engineering and Science, MIT, Cambridge, MA 02139 USA 

7 Part of this research done while at Dept. of Computer Science, Northwestern University, 

Evanston IL 60208 USA 

8 Part of this research was done while at Dept. Computer Science, University of Maryland, 

College Park MD 20742 USA 

^ Equally contributing first authors 

* Equally contributing corresponding authors 

Correspondence should be addressed to alejandro.schaffer@nih.gov and eytan.ruppin@nih.gov. 

Physical address: Cancer Data Science Laboratory, National Cancer Institute, Bldg. 15-C1, 

Bethesda, MD 20892 USA  

 
Keywords: targeted cancer therapy, combination therapy, personalized medicine, combinatorial 

optimization, hitting set, single-cell transcriptomics 

  
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

mailto:alejandro.schaffer@nih.gov
mailto:eytan.ruppin@nih.gov
https://doi.org/10.1101/2020.01.28.923532


2 
 

Abbreviations: 

CTS: cohort target set, synonym of global hitting set 

GEO: Gene Expression Omnibus 

GHS: global hitting set, synonym of cohort target set 

GTEx: Genotype-Tissue Expression (project or consortium) 

HPA: Human Protein Atlas 

HUGO: Human Genome Organization 

IHS: individual hitting set, synonym of individual target set 

ILP: integer linear programming 

ITS: individual target set 

lb: lower bound on fraction of tumor cells killed 

RME: receptor-mediated endocytosis 

TPM: transcripts per million 

ub: upper bound on fraction of non-tumor cells killed 

 
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


3 
 

Abstract 

The availability of single-cell transcriptomics data opens new opportunities for rational design of 

combination cancer treatments in a systematic manner. Mining such data, we employed 

combinatorial optimization techniques to explore the landscape of optimal combination therapies 

in solid tumors, including brain, head and neck, melanoma, lung, breast and colon cancers. We 

assume that each individual therapy can target any one of 1269 genes encoding cell surface 

receptors, which may be targets of CAR-T, conjugated antibodies or coated nanoparticle therapies. 

In most cancer types, personalized combinations composed of at most four targets are sufficient to 

kill at least 80% of the tumor cells while killing at most 10% of the non-tumor cells in each patient. 

The number of distinct targets needed to do that for all patients in 8 of the 9 cohorts we studied is 

at most 11, while one larger melanoma cohort requires over 30 distinct targets. Further requiring 

that the target genes be lowly expressed across many different healthy tissues uncovers 

qualitatively similar trends.  However, as one requires either more stringent killing thresholds or 

more stringent sparing of non-cancerous tissues beyond these baseline values, the number of 

targets needed rises rapidly. Emerging promising targets include the gene PTPRZ1, which is 

frequently found in the optimal combinations for brain and head and neck cancers, and EGFR, a 

recurring target in multiple tumor types. In sum, this is the first systematic single-cell based 

characterization of the landscape of combinatorial receptor-mediated cancer treatments, 

identifying promising targets for future development.  

  
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


4 
 

Introduction 

Personalized oncology offers hope that each patient's cancer can be treated based on its genomic 

characteristics1,2. Several trials have suggested that it is possible to collect genomics data fast 

enough to inform treatment decisions3-5.  Meta-analysis of Phase I clinical trials completed 

during 2011-2013 showed that overall, trials that used molecular biomarker information to 

influence treatment plans gave better results than trials that did not6.  However, most precision 

oncology treatments utilize only one or two medicines, and resistant clones frequently emerge, 

emphasizing the need to deliver personalized medicine as multiple agents combined6-11. 

Important opportunities to combine systems biology and design of nanomaterials have been 

recognized to deliver medicines in combination to overcome drug resistance and combine 

biological effects12. 

 Here, we propose and rigorously study a new conceptual framework for designing future 

precision oncology treatments. It is motivated by the growing recognition that tumors typically 

have considerable intra-tumor heterogeneity (ITH)13,14 and thus need to be targeted with a 

combination of medicines such that as many as possible tumor cells are hit by at least one 

medicine.   Our analysis is based on two recently emerging technologies: (1) the advancement of 

single-cell transcriptomics and proteomics measurements from patients’ tumors, which is 

anticipated to gradually enter into clinical use15, and (2) the introduction of “modular” treatments 

that target specific overexpressed genes/proteins  to recognize cells in a  specific manner and 

then use either the T cell immune response or a lethal toxin to kill the tumor cells preferentially.   

Based on these two foundations, we formulate and systematically answer two basic 

questions. First, how many targeted treatments are needed to selectively kill most tumor cells 

while sparing most of the non-tumor cells in a given patient? And second, given a cohort of 

patients to treat, how many distinct single-target treatments need to be prepared beforehand so 

that there is a combination that kills at least a specified proportion of the tumor cells of each 

patient?  

We focus our analysis on genes encoding protein targets that encode receptors on the cell 

surface, as these may be precisely targeted by any one of  at least six technologies: e.g., by CAR-

T therapy16, immunotoxins ligated to antibodies17-18, immunotoxins ligated to mimicking 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


5 
 

peptides19, conventional chemotherapy ligated to nanoparticles20, degraders associated with 

ubiquitin E3 ligases21 and designed ankyrin repeat proteins (DARPins)22,23. These treatments are 

all “modular”, including one part that specifically targets the tumor cell via one gene/protein and 

another part, the cytotoxic mechanism that kills the cells.  

Two recent genome-wide analyses of modular therapies have focused on CAR-T 

therapy24,25, so we focus first on this technology to put our work in context. In the original 

formulation, CAR-T therapy used one cell surface target that marks the cells of interest, such as 

CD19 as a marker for B cells. To date, CAR-T therapy has been effective in achieving 

remissions for some blood cancers16,26, but less effective for solid tumors. MacKay et al.25 

focused primarily on single targets and looked at combinations of two targets and did all analysis 

in silico. Dannenfelser et al.24 focused on predicting combinations of two and three targets and 

did most of their work in silico, with in vitro validation of two high-scoring predicted 

combinations in renal cancer. Importantly, these studies have analyzed bulk tumor and normal 

expression data to identify likely targets. Here we present the first analysis that aims to identify 

modular targets based on the analysis of tumor single-cell transcriptomics. This enables to study 

the research questions at a higher resolution but presents new analytical challenges that need to 

be addressed.  

Two related difficulties with CAR-T therapy are i) toxicity to non-cancer cells27,28 and ii) 

difficulty in finding single targets that are sufficiently selective25. To address the toxicity 

problem, MacKay et al.25 selected 533 targets that had low expression in most tissues in the 

Genotype-Tissue Expression (GTEx) data; however, their analysis did not require that the targets 

are cell surface proteins. We proceed in a stepwise manner; we start with a formal analysis of a 

space of 1269 candidate cell surface receptors. Then, we proceed to add a low-expression 

requirement like that of MacKay et al.25 and parameterized by a transcripts per million (TPM) 

expression threshold. For completeness, we also tested their set of 533 genes. 

To address the selectivity problem, various groups have engineered composite forms of 

CAR-T treatments that implement Boolean AND, OR, and NOT gates that have been tested for 

combinations of up to three target proteins29-33. Both MacKay et al. and Dannenfelser et al. 

presented in silico methods focusing on AND gates and pairs or trios of targets; Dannenfelser et 

al. analyzed 2538 likely cell surface proteins that are not necessarily receptors. We have chosen 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


6 
 

to focus on the simpler logical OR construction because that can be achieved not only by CAR-T 

technology30,31, but can also be implemented via other modular treatment technologies by 

combining multiple single-target treatments, assuming that the composite treatment kills a cell if 

any one of the single treatments kill the cell. Conceptually, such a logical OR combination 

treatment can still achieve selectivity by choosing targets, each of which is expressed on a much 

higher proportion of cancer cells than non-cancer cells. One of our key contributions is to show 

that by using techniques from combinatorial optimization, one can find such effective 

combinations involving a large number of targets, while previous studies were limited to at most 

three targets.   

Beyond CAR-T, our analysis applies to several additional types of modular treatment 

technologies that rely instead on receptor-mediated endocytosis (RME) delivering a toxin via a 

targeted receptor to enter the cell34,35. Like CAR-T, these RME-based technologies do not 

downregulate the target receptor. For RME technologies and other technologies that work 

intracellularly, we anticipate combining modular treatments from one technology such that all 

treatments use the same toxin or mechanism of cell killing, thereby mitigating the need to test for 

interaction effects between pairs of different treatments.  

To address these research questions, we designed and implemented a computational 

approach named MadHitter (after the Mad Hatter from Alice in Wonderland) to identify optimal 

precision combination treatments that target membrane receptors (Figure 1, A-C). We define 

three key parameters related to the stringency of killing the tumor and protecting the non-tumor 

cells and explore how the optimal treatments vary with those parameters (Figure 1B, C). Solving 

this problem is analogous to solving the classical “hitting set problem” in combinatorial 

algorithms36, which is formally defined in the Methods (see also Supplementary Materials 1). 

Unlike the previous studies on CAR-T targets, we define the problem in a personalized manner, 

intending that each patient will get optimal treatments for her or his tumor from among a 

collection of available treatments. 

 
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


7 
 

Figure 1. Conceptual schematic example of MadHitter analysis of single-cell data transcriptomics from 

three cancer patients. (A) A cohort of patients (three in this example) arrives for a study in which single-cell 

tumor microenvironment (TME) transcriptomics data are collected from each patient; the data are analyzed with 

MadHitter and each patient receives an optimal personalized combination of targeted therapies from a pre-

specified set (pill bottle).  MadHitter is aimed at optimizing combinations of targeted therapies that are modular, 

that is, having a recognition unit that is gene/protein-specific, and a joint killing subunit (similar for all gene 

targets). Icons of four such modular therapies are shown; for three of these, the target protein must be on the cell 

surface and for two it must be a receptor, so we focus our analyses on cell surface receptors. Three main 

algorithm parameters are denoted near the MadHitter icon in panel A and explained in the later panels. (B) The 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


8 
 

single-cell TME data are represented in two matrices with the genes as rows and cells as columns, partitioned 

into tumor (T) and non-tumor (N) cells. The expression ratio r determines by how much a gene must be 

overexpressed for a cell to be considered as a targeted. A gene is considered ‘overexpressed’ in either a non-

tumor cell or a tumor cell if its expression is at least r times the mean, reference level; e.g, the reference level 

for FLT1 is (7+11+9)/3 = 9 and only cell T3 has FLT1 expression above 9×2 = 18. The matrices on the right 

side show a Boolean representation of which targets kill which cells, based on the expression values presented 

in this toy problem in matrix B and taking r=2. Accordingly, the combination of EGFR and KDR would kill all 

tumor cells and would spare all non-tumor cells. (C) The main algorithm in MadHitter seeks a combination of 

targets that is as small as possible and would kill many tumor cells and few non-tumor cells, in a patient-

specific manner. The 𝑙𝑏 and 𝑢𝑏 parameters are the lower bound on the fraction of tumor cells killed and the 

upper bound on the fraction of non-tumor cells whose killing is tolerated, respectively. Baseline settings used in 

our analyses are 𝑟 = 2, 𝑙𝑏 = 0.8 and 𝑢𝑏 = 0.1, and are varied in some of the analyses.  The right side of the 

panel shows a hypothetical example of the tradeoff between killing tumor cells and sparing non-tumor cells. 

While target set A could kill a larger fraction of tumor cells than target set B, MadHitter would select target set 

B since only it satisfies both our baseline settings and kills at most 0.1 fraction of the non-tumor cells.

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


9 
 

Results 

The Data and the Combinatorial Optimization Framework 

We focused our analysis searching for optimal treatment combinations in nine single-cell 

RNAseq data sets that include tumor cells and non-tumor cells from at least three patients for 

that were publicly available at the onset of our investigation (Methods; Table 1). Those data sets 

include four brain cancer data sets and one each from head and neck, melanoma, lung, breast and 

colon cancers. Most analyses were done for all data sets, but for clarity of exposition, we focused 

in the main text analyses on four data sets from four different cancer types (brain, head and neck, 

melanoma, lung) that are larger than the other five and hence, make the optimization problems 

more challenging.  Results on the other five data sets are provided in Supplementary Materials 

2.  

Analyzing separately each of these data sets, we ask how many targets are needed to kill 

most cells of a given tumor and what is the tradeoff between cancer cells killed and non-cancer 

cells spared?  Figure 2 shows a small schematic example in which there are alternative target 

sets of sizes two and three. One would prefer the target set of size two because the patients 

would need to receive only two distinct treatments rather than three treatments.  

 
Figure 2. A schematic small example of killing a four tumor cells illustrating why choosing a 

minimum-size combination of targets may be non-trivial. The schematic tumor has four cancer cells 

(A, B, C, D in separate columns), which may express any of five cell-surface receptor genes (rows) that 

may be targeted selectively by modular treatments (pills). If one targets {APP, KDR, MET}, all cancer 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


10 
 

cells will be killed (left panel). However, if, instead, one would target {CR2, TEK} then all cancer cells in 

the given example tumor will be killed with just two targets (right panel) instead of three, providing a 

smaller solution.  

To formalize our questions as combinatorial optimization hitting set problems, we define 

the following parameters and baseline values and explore how the optimal answers vary as 

functions of these parameters: We specify a lower bound on the fraction of tumor cells that 

should be killed, 𝑙𝑏, which ranges from 0 to 1. Similarly, we define an upper bound on the 

fraction of non-tumor cells killed, 𝑢𝑏, which also ranges from 0 to 1. Our baseline settings are 

𝑙𝑏 = 0.8 and 𝑢𝑏 = 0.1. To represent the concept that only cells that overexpress the target, we 

introduce an additional parameter 𝑟. The expression ratio 𝑟 defines which cells are killed, as 

follows (Figure 1B):  Denote the mean expression of a gene 𝑔 in non-cancer cells that have non-

zero expression by E(𝑔). A given cell is considered killed if gene 𝑔 is targeted and its expression 

level in that cell is at least 𝑟 ×  𝐸(𝑔). Higher values of 𝑟 thus model more selective killing.  

Having 𝑟 as a modifiable parameter anticipates that in the future one could 

experimentally tune the overexpression level at which cell killing occurs24. In this respect, 

technologies that rely on RME to get a toxin into the cell are particularly tunable because there is 

known to be a non-linear relationship between the number of protein copies on the cell surface 

and the probability that RME occurs successfully37. In these technologies, the toxin  or other 

therapy delivered by the modular treatment enters cells in a gene-specific manner38, while CAR-

T therapy activates T-cell killing against cells in a gene-specific manner24,25. For most of our 

analyses, the expression ratio 𝑟 is varied from 1.5 to 3.0, with a baseline of 2.0, based on 

experiments in the lab of N.A. and related to combinatorial chemistry modeling37; in one 

analysis, we varied r up to 5.0 (Supplementary Table S1).   

Given these definitions, we solve the following combinatorial optimization hitting set 

problem (Methods): Given an input of a single-cell transcriptomics sample of non-tumor and 

tumor cells  for each patient in a cohort of multiple patients, bounds 𝑢𝑏 and 𝑙𝑏, ratio 𝑟, and a set 

of target genes, we seek to find a solution that finds a minimum-size combination of targets in 

each individual patient, while additionally minimizing the size of all targets given to the patients 

cohort. The latter is termed the global minimum-size hitting set (GHS) in computer science 

terminology or the cohort target set (CTS) in terminology specific to our problem, while the 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


11 
 

optimal hitting set of genes targeting one patient is termed the individual target set (ITS). This 

optimum hitting set problem with constraints can be solved to optimality using integer linear 

programming (ILP) (Methods). We solve different optimization problem instances, each of 

which considers a different set of candidate target genes: 1269 genes encoding cell surface 

receptor proteins, and subset of 58 out of these 1269 genes that already have published ligand-

mimicking peptides, and a nested collection of sets of 424-900 out of the 1269 genes that are 

lowly expressed below a series of decreasing gene expression thresholds25. From a 

computational standpoint, there is no inherent limit on the size of the candidate gene set. 

Our formulation is personalized as each patient receives the minimum possible number of 

treatments. The global optimization comes into play only when there are multiple solutions of 

the same size to treat a patient. For example, suppose we have two patients such that patient A 

could be treated by targeting either {EGFR, FGFR2} or {MET, FGFR2} and patient B could be 

treated by targeting either {EGFR, CD44} or {ANPEP, CD44}. Then we prefer the CTS  

{EGFR, FGFR2, CD44} of size 3 and we treat patient A by targeting {EGFR, FGFR2} and 

patient B by targeting {EGFR, CD44}. 

As the number of cells per patient varies by three orders of magnitude across data sets, 

we use random sampling to obtain hitting set instances of comparable sizes and yet adequately 

capture tumor heterogeneity. We found that sampling hundreds of cells from the tumor is 

sufficient to get enough data to represent all cells. In most of the experiments shown, the number 

of cells sampled, which we denote by 𝑐, was 500. In some smaller data sets, we had to sample 

smaller numbers of cells (Methods). As shown in (Supplementary Materials 2, Figures S1-S2), 

500 cells, when available, are roughly sufficient for CTS size to plateau for our baseline 

parameter settings, 𝑙𝑏 = 0.8, 𝑢𝑏 = 0.1, 𝑟 = 2.0. For each individual within a data set, we 

performed independent sampling of c cells 20 times and their results were summarized.  

 
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


12 
 

Cohort and Individual Target Set Sizes as Functions of Tumor Killing and 

Non-Tumor Sparing Goals 

Given the single-cell tumor data sets and the ILP optimization framework described above, we 

first studied how the resulting optimal cohort target set (CTS) may vary as a function of the 

parameters defining the optimization objectives in different cancer types. Figures 3 and S3-S7 

in Supplementary Materials 3 show heatmaps of CTS sizes when varying lb, ub, and r around 

the baseline values of 0.8, 0.1, and 2.0, respectively. The CTS sizes for melanoma were largest, 

partly due to the larger number of patients in that data set (Table 1). Indeed, as we sampled 

subsets of 5 or 10 patients uniformly and observed that the mean CTS sizes grew from 7.9 (5 

patient subsets) to 12.3 (10 patient subsets) to 31.0 (all patients, as shown in Figure 3). 

Encouragingly, for most data sets and parameter settings, the optimal CTS sizes are in the 

single digits. However, in several data sets, we observe a sharp increase in CTS size as 𝑙𝑏 values 

are increased above 0.8 and/or as the 𝑢𝑏 is decreased below 0.1, with a more pronounced effect 

of varying 𝑙𝑏. This transition is more discernable at the lowest value of 𝑟 (1.5), probably because 

when 𝑟 is lower, it becomes harder to find genes that are individually selective in killing tumor 

cells and sparing non-tumor cells (Supplementary Figures S3-S7). The qualitative transition 

observed in CTS sizes occurs robustly regardless of the threshold for filtering out low expressing 

cells when preprocessing the data (Supplementary Materials 4, Figures S8-S10).  

 
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


13 
 

Figure 3. Heat maps showing how the cohort target set size (CTS) varies as a function of 𝒍𝒃, 𝒖𝒃, 𝒓 

and across data sets. For each plot the x-axis and y-axis represent lb and ub parameter values, 

respectively. The scale on the right shows the cohort target set sizes by color scale. We show separate 

plots for 𝑟 = 2.0, 3.0 here and a larger set {1.5, 2.0, 2.5, 3.0} in Supplementary Materials 3. Individual 

values are not necessarily integers because each value represents the mean of 20 replicates of sampling 𝑐 

(500 for each of the data sets shown here) cells (Figure S1). 

We next examined what are the resulting individual target set (ITS) sizes obtained in the 

optimal combinations under the same conditions. In all data sets, the mean ITS sizes are in the 

single digits for most values of 𝑙𝑏 and 𝑢𝑏. The distributions of ITS sizes are shown for four data 

sets and two combinations of (𝑙𝑏, 𝑢𝑏) (Figure 4) and for additional data sets in Supplementary 

Materials 5, Figure S11. Overall, the mean ITS sizes with the baseline parameter values (𝑟 =

2.0, 𝑙𝑏 =  0.8, 𝑢𝑏 =  0.1) range from 1.0 to 3.91 among the nine data sets studied 

(Supplementary Table S2); on average 4 targets per patient should hence suffice if enough 

single-target treatments are available in the cohort target set. However, there is considerable 

variability across patients. 

Evidently, as we make the treatment requirements more stringent (by increasing 𝑙𝑏 from 

0.8 to 0.9 and decreasing 𝑢𝑏 from 0.1 to 0.05), the variability in ITS size across patients became 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


14 
 

larger. Importantly, this analysis provides rigorous quantifiable evidence grounding the 

prevailing observation that among tumors of the same type, some individual tumors may be 

much harder to treat than others. Taken together, these results show that we can compute precise 

estimates of the number of targets needed for cohorts (in the tens) and individual patients (in the 

single digits usually) and that these estimates are sensitive to the killing stringency, especially 

when the 𝑙𝑏 increases above 0.8. The variation for more aggressive killing regimes, with values 

of 𝑙𝑏 up to 0.99 for the baseline 𝑟 = 2.0 is displayed in Figures S12-S13 in Supplementary 

Materials 6. For fixed 𝑙𝑏 =  0.8, 𝑢𝑏 =  0.1 and varying 𝑟, smallest CTS sizes are typically 

obtained for 𝑟 values close to 2.0, further motivating our choice of 𝑟 =  2 as the default value 

(Supplementary Materials 7, Figures S14-S15, Supplementary Table S1). Finally, we show 

that, as expected, a ‘control’ greedy heuristic algorithm searching for small and effective target 

combinations finds ITS sizes substantially larger than the optimal ITS sizes identified using our 

optimization algorithm (Figure 4). The greedy CTS size is greater than the ILP optimal CTS size 

for eight out of nine data sets (Table S2 in Supplementary Materials 8, Methods). 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


15 
 

Figure 4. The distribution of optimal and greedy individual treatment combination sizes (ITS) 

values in four different cancer types. We study both our baseline parameter setting (upper row panels) 

and a markedly more stringent one (middle row plots).  For the more stringent parameter setting, we 

compare the ITS sizes obtained using MadHitter (middle row plots) and a greedy algorithm that tries to 

add pairs of genes at a time (bottom row plots). In each plot, the patients are sorted from left to right 

according to their mean ITS values in the optimal stringent regime. Additional comparisons between ITS 

sizes at different parameter settings can be found in Supplementary Materials 5. Description of the 

greedy algorithm and more comparisons between the optimal and greedy algorithms are provided in 

Supplementary Materials 8. 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


16 
 

The Landscape of Combinations Achievable with Receptors Currently 

Targetable by Published Ligand-Mimicking Peptides 

To get a view of the combination treatments that are possible with receptor targets for which 

there are already existing modular targeting reagents, we conducted a literature search 

identifying 58 out of the 1269 genes with published ligand-mimicking peptides that have been 

already tested in in vitro models, usually cancer models (Methods; Tables 3 and 4). We asked 

whether we could find feasible optimal combinations in this case and if so, how do the optimal 

CTS and ITS sizes compare vs. those computed for all 1269 genes? 

 
Figure 5. Comparison of Individual Target Set Sizes with 1269 or 58 targets for three out of 

the six data sets that have feasible solutions. We attempted to find feasible solutions for all 

patients using 58 cell surface receptors that have published ligand-mimicking peptides that have 

been tested in vitro or in pre-clinical models. There are feasible solutions for all patients in six 

data sets, but not for the brain (GSE84465), melanoma (GSE115978), and lung (E-MTAB-6149), 

which were displayed in previous figures. Instead, we show here results for breast and colorectal 

cancers, for which other analyses, such as those in Figures 3 and 4, are in the Supplementary 

Materials. Some of the optimal solutions obtained on the 58-receptors restricted set are of the 

same size to those obtained on the whole receptors set and some are larger. 

Computing the optimal CTS and ITS solutions for this basket of 58 targets, we found 

feasible solutions for six of the data sets across all parameter combinations we surveyed and 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


17 
 

three of these six are illustrated for each patient in Figure 5. However, for three data sets, in 

numerous parameter combinations we could not find optimal solutions that satisfy the 

optimization constraints (Supplementary Materials 9, Figures S16-S18).  That is, the currently 

available targets do not allow one to design treatments that may achieve the specified selective 

killing objectives, underscoring the need to develop new targeted cancer therapies, to make 

personalized medicine more effective for more patients.  

Overall, comparing the optimal solutions obtained with 58 targets to those we have 

obtained with the 1269 targets, three qualitatively different behaviors are observed 

(Supplementary Materials 9, Figures S16-S18): (1) In some datasets, it is just a little bit more 

difficult to find optimal ITS and CTS solutions with the 58-gene pool, while in others, the 

restriction to a smaller pool can be a severe constraint making the optimization problem 

infeasible. (2) The smaller basket of gene targets may force more patients to receive similar 

individual treatment sets and thereby reduces the size of the CTS. (3) Unlike the CTS size, the 

ITS size must stay the same or increase when the pool of genes is reduced, because we find the 

optimal ITS size for each patient. Overall, the average ITS sizes across each cohort using the 

pool of 58 genes for baseline settings range from 1.16 to 4.0. Among cases that have any 

solution, the average increases in the ITS sizes at baseline settings in the 58 genes case vs. that of 

the 1269 case were moderate, ranging from 0.16 to 1.33.   

Optimal Fairness-Based Combination Therapies for a Given Cohort of 

Patients 

Until now we have adhered to a patient-centered approach that aims to find the minimum-size 

ITS for each patient, first and foremost. We now study a different, cohort-centered approach, 

where given a cohort of patients, we seek to minimize the total size of the overall CTS size, 

while allowing for some increase in the ITS sizes. The key question is how much larger are the 

resulting ITS sizes if we optimize for minimizing the cohort (CTS size), rather than the 

individuals (ITS size)?  This challenge is motivated by a ‘fairness’ perspective  (Supplementary 

Materials 1), where we seek solutions that are beneficial for the entire community from a social 

or economic perspective (in terms of CTS size) even if they are potentially sub-optimal at the 

individual level (in terms of ITS sizes). Here, the potential benefit is economic since running a 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


18 
 

basket trial would be less expensive if one reduces the size of the basket of available treatments 

(Figure 6A-B).  

We formalized this ‘fair CTS problem’ by adding a cost parameter 𝛼 that specifies the 

limit on the excess number of (ITS) targets selected for any individual patient, compared to the 

number selected in the individual-based approach that was studied up until now (formally, the 

latter corresponds to setting 𝛼 =  0). We formulated and solved via ILP this fair CTS problem 

for up to 1269 possible targets on all nine data sets (Methods). We fixed 𝑟 = 2 and 𝑢𝑏 = 0.1 

while varying 𝛼 and 𝑙𝑏. Figure 6C and Figures S19-S23 in Supplementary Materials 10 show 

the optimal CTS and ITS sizes for 𝛼 = 0, . . . ,5. 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


19 
 

Figure 6. A schematic example demonstrating the rationale and workings of fairness-based 

solutions. (A, B) Let us assume that each of three patients has two tumor cells (columns), each 

displaying five membrane receptors that are highly expressed only on the tumor cells and not on 

the non-tumor ones (rows). If we target {APP, MET} (panel A, 𝛼 = 1) in all patients, then this 

achieves a CTS size of 2, which is the minimum possible. Employing the original individual-

based optimizing objective, each patient could instead be treated by an ITS of size 1 by targeting 

the distinct receptors called Target 1 (specific to Patient 1), Target 2 and Target 3, respectively, 

but this would result in an optimal CTS of size 3 (panel B, 𝛼 = 0). The solution in panel A has 

an unfairness value 𝛼 = 1 because the worst difference among all patients is that a patient 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


20 
 

receives 1 more treatment than necessary. (C) Heatmaps showing how the CTS size varies as 𝛼 

increases (y-axis), starting from its baseline value of 0 where each patient is assigned a 

minimum-sizes individual treatment set (top row). The lower bound on tumor cells killed (x-

axis) is also varied while the upper bound on non-tumor cells killed is kept fixed at 0.1. We are 

particularly interested in finding the smallest value on the y-axis at which the CTS size reaches 

its minimum value, which is circled for the baseline  𝑙𝑏 = 0.8, because this bounds the tradeoff 

between the achievable reduction in the number of targets needed to treat the whole cohort and 

the number of extra targets above the ITS minimum that any patient might need to receive. 

For 8 out of 9 data sets, we encouragingly find that the unfairness cost parameter 𝛼 is 

bounded by a constant of 3; i.e., it is sufficient to increase 𝛼 by no more than 3 to obtain the 

smallest CTS sizes in the optimally fair solutions. For the largest data set (melanoma),  𝛼 = 4. 

As we show in Supplementary Materials 10, empirically, even if one requires lower α  values, 

then as those approach 0, the size of the fairness-based CTS grows fairly moderately and remains 

in the lower double digits, and the mean size of the number of treatments given to each patient 

(their ITS) is overall < 5. Theoretically, we show that one can design instances for which 𝛼 

would need to be at least √𝑛 − 1 to get a CTS of size less than the overall number of targets 𝑛 

(Supplementary Materials 10). However, in practice, we find that given the current tumor 

single-cell expression data, fairness-based treatment strategies are likely to be a reasonable 

economic option in the future.  

The Landscape of Optimal Solutions Targeting Receptors that are Lowly 

Expressed Across Many Healthy Tissues 

We turn to examine the space of optimal solutions when restricting the set of eligible surface 

receptor gene targets to those that have lower expression across many noncancerous human 

tissues (Methods), aiming to mitigate potential damage to tissues unrelated to the tumor site. To 

this end, we selected subsets of the 1269 cell surface receptor targets in which the genes have 

overall lower expression across multiple normal tissues, by mining GTEx and the Human Protein 

Atlas (HPA) (Methods). Varying the selectivity expression thresholds (expressed in transcripts 

per million (TPM)) used to filter out genes whose mean expression across the normal adult 

tissues is above values of 10, 5, 2, 1, 0.5, and 0.25 (i.e., employing more and more extensive 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


21 
 

filtering as this threshold is decreased), decreases the size of the target cell surface receptor gene 

list by more than half (Table 2).  

As shown in Figures 7A, B (and Supplementary Figures S24-S26), MadHitter identifies very 

different cohort target sets (which are larger than the original optimal solutions, as expected) as 

the TPM selectivity threshold value is decreased.  Furthermore, different ITS instances may 

become infeasible (Supplementary Figure S27). At an individual patient level, using lower 

selectivity threshold levels, which leads to a smaller space of membrane receptors to choose 

from, also leads to increased mean ITS sizes (Supplementary Figures S28, S29). Across the 

nine data sets, the selectivity threshold at which the CTS problem became infeasible varied 

(Supplementary Figure S27). The differences observed could be the result of expression 

heterogeneity of the cancer, number of patients within the data set, size of target gene set, lack of 

expression of available gene targets and other unknown factors.  In the future, further 

experimentation is required to identify tissue-specific optimal gene expression thresholds that 

will minimize side effects while allowing cancer cells to be killed by combinations of targeted 

therapies. Finally, for completeness, we also tested MadHitter on the set of 533 lowly expressed 

genes suggested by MacKay et al.25 All instances with default setting of 𝑟, 𝑙𝑏, 𝑢𝑏 have feasible 

solutions for all patients. Mean ITS sizes are below 4 for eight of nine data sets, but close to 10 

for the brain cancer data set GSE84465. More details can be found in Supplementary Materials 

11 and Table S3. 

 
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


22 
 

Figure 7. Variation in the CTS size and composition as function of the magnitude of 

filtering of genes expressed in noncancerous human tissues, for different tumor types. (A-

B) The number of times a gene (cell-surface receptor) is included in the CTS (out of 20 

replicates, which is therefore the Max count in panels A-B), where each column presents the 

CTS solutions when the input target genes sets are filtered using a specific TPM filtering 

threshold (Methods), for (A) a breast cancer and (B) brain cancer. These data sets were selected 

due to their relatively small cohort target set sizes, permitting their visualization. (C-F) CircOs 

plots of the genes occurring most frequently in optimal CTS solutions (length of arc along the 

circumference) and their pairwise co-occurrence (thickness of the connecting edge) for the four 

main cancer types, in our original target space of 1269 encoding cell-surface receptors. For each 

data set, we sampled up to 50 optimal CTS solutions. Network representations of the 12 most 

common target genes out of 1269 encoding cell-surface receptors (with greater than 5% 

frequency of occurrence) are represented in a cancer specific manner for (C) brain cancer, (D) 

head and neck cancer, (only seven genes have a frequency of 5% or more across optimal 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


23 
 

solutions), (E) melanoma, and (F) lung cancer. Genes and connections have distinct colors for 

improved visibility. 

Key Targets Composing Optimal Solutions Across the Space of 1269 

membrane receptors 

To identify the genes that occur most often in optimal solutions for our baseline settings, since 

there may be multiple distinct optimal solutions composed of different target genes,  we sampled 

up to 50 optimal solutions for each optimization instance solved and recorded how often each 

gene occurs and how often each pair of genes occur together (Methods). We analyzed and 

visualized these gene (co-)occurrences in three ways. First, we constructed co-occurrence circus 

plots in which arcs around the circle represent frequently occurring genes and edges connect 

targets that frequently co-occur in optimal CTS solutions. Figure 7C-F shows the co-occurrence 

visualizations for optimal CTS solutions obtained with the original, unfiltered target space of 

1269 genes and in baseline parameter settings. The genes frequently occurring in optimal 

solutions are quite specific and distinct between different cancer types. In melanoma, the edges 

form a clique-like network because virtually all optimal solutions include the same clique of 12 

genes (Figure 7E). The head and neck cancer data set has only one commonly co-occurring pair 

{GPR87, CXADR} (Figure 7D). Of the cancer types not depicted in Figure 7, the breast cancer 

data set has a commonly co-occurring set of size 4, {CLDN4, INSR, P2RY8, SORL}, and the 

colorectal cancer data set has a different commonly co-occurring set of size 4, {GABRE, GPRR, 

LGR5, PTPRJ} (data not shown). 

We next tabulated sums of how often each gene occurred in optimal solutions for all nine 

data sets (Supplementary Materials 12, Tables S4, S5 and S6), obtained when solving for 

either 58 gene targets or 1269 gene targets. Strikingly, one gene, PTPRZ1 (protein tyrosine 

phosphatase receptor zeta 1), appears far more frequently than others, especially in three brain 

cancer data sets (GSE70630, GSE89567, GSE102130, Supplementary Table S6). PTPRZ1 also 

occurs commonly in optimal solutions for the head and neck cancer data set (Figure 7D). The 

brain cancer finding coincides with previous reports that PTPRZ1 is overexpressed in 

glioblastoma (GBM)39,40. PTPRZ1 also forms a fusion with the nearby oncogene MET in some 

brain tumors that have an overexpression of the fused MET41. Notably, various cell line studies 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


24 
 

and mouse studies have shown that inhibiting PTPRZ1, for example by shRNAs, can slow 

glioblastoma tumor growth and migration42,43. There have been some attempts to inhibit PTPRZ1 

pharmacologically in brain cancer and other brain disorders44,45.   In the four brain cancer data 

sets, PTPRZ1 is expressed selectively above the baseline 𝑟 = 2.0 in 0.99 (GSE89567), 0.84 

(GSE70630), 0.96 (GSE102130) and 0.27 (GSE84465) proportion of cells in each cohort. The 

much lower relative level of PTPRZ1 expression in GSE84465 is likely due to the heterogeneity 

of brain cancer types in this data set46. Among the 58 genes with known ligand-mimicking 

peptides, EGFR stands out as most common in optimal solutions (Supplementary Table S4). 

Even when all 1269 genes are available, EGFR is most commonly selected for the brain cancer 

data set (GSE84465) in which PTPRZ1 is not as highly overexpressed (Figure 7C). 

PTPRZ1 was the fifth most frequently occurring gene in optimal solutions for the head 

and neck cancer data set (GSE103322). The two most common genes by a large margin are 

CXADR and GPR87. CXADR has been studied primarily by virologists and immunologists 

because it encodes a receptor for cocksackieviruses and adenoviruses47.  In one breast cancer 

study, CXADR was found to play a role in regulating PTEN in the AKT pathway, but CXADR 

was underexpressed in breast cancer48 whereas it is overexpressed in the head and neck cancer 

data we analyzed. GPR87 is a rarely studied G protein-coupled receptor with an unknown natural 

ligand49. In the context of cancer, GPR87 has previously been reported as overexpressed in 

several tumor types including lung and liver49 and its overexpression may play an oncogenic role 

via either the p53 pathway50 the NFκB pathway51 or other pathways. 

Finally, we analyzed the set of genes in optimal solutions via the STRING database and 

associated tools52 to perform several types of gene set and pathway enrichment analyses. Figures 

S30-S33 (Supplementary Materials 12) show STRING-derived protein-protein interaction 

networks for the 25 most common genes in the same four data for which we showed co-

occurrence graphs in Figure 7C-F. Again, EGFR stands out as being a highly connected protein 

node in the solution networks for both the brain cancer and head and neck cancer data sets.  

Among the 30 genes in the 1269-gene set most commonly in optimal solutions (Supplementary 

Table S5), there are six kinases (out of 88 total human transmembrane kinases with a catalytic 

domain, STRING gene set enrichment 𝑝 < 1𝑒 − 6), namely {EGFR, EPHB4, ERBB3, FGFR1, 

INSR, NTRK2} and two phosphatases {PTPRJ, PTPRZ1}. The KEGG pathways most 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


25 
 

significantly enriched, all at 𝐹𝐷𝑅 < 0.005, are  (“proteoglycans in cancer”) represented by 

{CD44, EGFR, ERBB3, FGFR1, PLAUR}, (“adherens junction”) represented by {EGFR, 

FGFR1, INSR, PTPRJ}, and (“calcium signaling pathway”) represented by {EDNRB, EGFR, 

ERBB3, GRPR, P2RX6}. The one gene in the intersection of all these pathways and functions is 

EGFR. 

Discussion 

In this multi-disciplinary study, we harnessed techniques from combinatorial optimization to 

analyze publicly available single-cell tumor transcriptomics data to chart the landscape of future 

personalized combinations that are based on ‘modular’ therapies, including CAR-T therapy.  We 

showed that, for most tumors we studied, four modular medications targeting different 

overexpressed receptors may suffice to selectively kill most tumor cells, while sparing most of 

the non-cancerous cells (Figures 3 and 4 and Table S2). For the more restricted sets of low-

expression genes25 or the 58 receptors with validated ligand-mimicking peptides (Tables 3 and 

4), some patients do not have feasible solutions, especially as we reduce the TPM expression 

used for filtering the gene set to avoid targeting non-cancerous tissues.  These findings indicate, 

on one hand, that researchers designing ligand-mimicking peptides have been astute in choosing 

targets relevant to cancer. On the other hand, these results suggest that there is a need for 

extending the set of cell surface receptors that can be targeted to enter tumor cells with ligated 

chemotherapy agents.  

Remarkably, we found that if one designs the optimal set of treatments for an entire 

cohort adopting a fairness-based policy, then the size of the projected treatment combinations for 

individual patients are at most 3 targets larger, and in most data sets at most 1 target receptor 

larger than the optimal solutions that would have been assigned for these patients based on an 

individual-centric policy (Figure 6, Supplementary Materials 10). This suggests that the 

concern that the personalized treatment for any individual will be suboptimal solely because that 

individual happens to have registered for a cohort trial appears to be tightly bounded. 

Like the study of MacKay et al.25, our study is a conceptual computational investigation. 

We studied nine data single-cell expression data sets for the first time, but it would be helpful to 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


26 
 

analyze more and larger data sets in the future. Even among four data sets of the same (brain) 

cancer type, we observed considerable variability in CTS and ITS sizes. Since our approach is 

general and the software is freely available as source code, other researchers can test the method 

on new data sets or add new variables, constraints, and optimality criteria. These investigations 

may of course lead other to further improve our method and broaden its applicability. In future 

work, we plan to apply our approach to study ways for selectively killing specific populations of 

immune cells, such as myeloid-derived suppressor cells, because they inhibit tumor killing, while 

sparing most other non-cancer cells.  

We compared gene expression levels between non-cancer cells and cancer cells sampled 

from the same patient, which avoids inter-patient expression variability53. However, we did little 

to account for “expression dropout” beyond the normalization performed by the providers of the 

data sets, aiming to preserve the public data as it was submitted to GEO or Array Express. To 

achieve some uniformity and also to take as cautious an approach as possible, we added a step to 

filter low expressing cells because some data sets had already been filtered in this way. One 

could instead apply imputation methods such as MAGIC54 or scImpute55 to infer denser gene 

expression matrices and then apply our method to the adjusted input data.  Therefore, our results 

about the sizes of optimal ITS should be viewed as estimated upper bounds that are likely to 

decrease if the dropout rate decreases or if cells expressing few genes are eliminated from the 

analysis more stringently. Another limitation of our method is that we viewed the measured gene 

expression as being valid over all time, even though gene expression is known to be a stochastic 

process56. In the future, we would like to extend our approach to use the single-cell data to infer 

how stochastic is the expression of each gene57 and to prefer targets whose expression is more 

stable.  

Even though the combinatorial optimization problems solved here are in the worst-case 

exponentially hard (NP-complete36 in computer science terminology), the actual  instances that 

arise from the single-cell data could be either formally solved to optimality or shown to be 

infeasible with modern optimization software. Of note, Delaney et al., have recently formalized a 

related optimization problem in analysis of single-cell clustered data for immunology38. Their 

optimization problem is also NP-complete in the worst case and they could solve sets of up to 

size four using heuristic methods38. We have shown that the optimal ILP solutions we obtained 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


27 
 

are often substantially smaller than solutions obtained via a greedy heuristic (Figure 4, 

Supplementary Materials 8 including Table S2). 

On the cautionary side, experiments with target gene sets that were further filtered by low 

expression in normal tissues showed that the individual target set problem can become infeasible 

in many instances. Even when the instance remained feasible, optimal cohort treatment set sizes 

increased rapidly as the expression levels allowed decreased (Figure 7), pointing to potential 

inherent limitations of applying such combination approaches to patients in the clinic and the 

need to carefully monitor their putative safety and toxicity in future applications. Finally, 

functional enrichment analysis of genes commonly occurring in the optimal target sets reinforced 

the central role of the widely studied oncogene EGFR and other transmembrane kinases. We also 

found that that the less-studied phosphatase PTPRZ1 is a useful target, especially in brain cancer. 

In summary, this study is the first to harness combinatorial optimization tools to analyze 

emerging single-cell data to portray the landscape of feasible personalized combinations in 

cancer medicine. Our findings uncover promising membranal targets for the development of 

future oncology medicines that may serve to optimize the treatment of cancer patient cohorts in 

several cancer types. The MadHitter approach presented and the accompanying software made 

public can be readily applied to address additional fundamental related research questions and 

analyze additional cancer data sets as they become available.  

Methods  

Data Sets 

We retrieved and organized data sets from NCBI’s Gene Expression Omnibus (GEO)58 and 

Ensembl’s ArrayExpess59 and the Broad Institute’s Single Cell Portal 

(https://portals.broadinstitute.org/single_cell). Nine data sets had sufficient tumor and non-tumor 

cells and were used in this study; an additional five data sets had sufficient tumor cells only and 

were used in testing early versions of MadHitter. Suitable data sets were identified by searching 

scRNASeqDB60, CancerSea61, GEO, ArrayExpress, Google Scholar, and the 10x Genomics list 

of publications (https://www.10xgenomics.com/resources/publications/). We required that each 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://portals.broadinstitute.org/single_cell
https://www.10xgenomics.com/resources/publications/
https://doi.org/10.1101/2020.01.28.923532


28 
 

data set contain measurements of RNA expression on single cells from human primary solid 

tumors of at least two patients and the metadata are consistent with the primary data. We are 

grateful to several of the data depositing authors of data sets for resolving metadata 

inconsistencies by e-mail correspondence and by sending additional files not available at GEO or 

ArrayExpress. We excluded blood cancers and data sets with single patients. When it was easily 

possible to separate cancer cells from non-cancer cells of a similar type, we did so.  

The main task in organizing each data set was to separate the cells from each sample or 

each patient into one or more single files. Representations of the expression as binary, as read 

counts, or as normalized quantities such as transcripts per million (TPM) were retained from the 

original data. When the data set included cell type assignments, we retained those to classify 

cells as “cancer” or “non-cancer”, except in the data set of Karaayvaz et al.62 where it was 

necessary to reapply filters described in the paper to exclude cells expressing few genes and to 

identify likely cancer and likely non-cancer cells. If cell types were not distinguished, all cells 

were treated as cancer cells. To achieve partial consistency in the genes included, we filtered 

data sets to include only those gene labels recognized as valid by the HUGO Gene Nomenclature 

Committee  (http://genenames.org), but otherwise we retained whatever recognized genes that 

the data submitters chose to include. After filtering out the non-HUGO genes, but before 

reducing the set of genes to 1269 or 900 or 424 or 58, we filtered out cells as follows. Some data 

sets came with low expressing cells filtered out. To achieve some homogeneity, we filtered out 

any cells expressing fewer than 10% of all genes before we reduced the number of genes. In 

Supplementary Materials 4, we tested the robustness of this 10% threshold. Finally, we 

retained either all available genes from among either our set of 1269 genes encoding cell-surface 

receptors that met additional criteria on low expression or available ligand-mimicking peptides. 

Table 1. Summary descriptions of single-cell data sets from solid tumors used either for analysis 

(9) or preliminary testing (5 additional). Data sets are ordered so that those from the same or 

similar tumor types are on consecutive rows. The first 13 data sets were obtained either from 

GEO or the Broad Institute Single Cell Portal, but the GEO code is shown. The data set on the 

last row was obtained from ArrayExpress. In some data sets that have both cancer and non-

cancer cells, there may be samples for which only one type or the other is provided. Hence, the 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

http://genenames.org/
https://doi.org/10.1101/2020.01.28.923532


29 
 

numbers in parentheses in the third and fourth columns may differ. Data set GSE11597863 

supersedes and partly subsumes GSE725664. 

 
Data set 

code 

Cancer type(s) Cancer 

cells(samples) 

Non-

cancer 

cells 

(samples) 

Clinical 

follow-up 

Reference(s) 

GSE75688 Breast 441(11) -- Metastasis or 

not 

65 

GSE118389 Breast 804(6) 314(6) Metastasis or 

not 

62 

GSE89567 Brain (glioma) 5097(10) 1146(9) No 66 

GSE103224 Brain (glioma) 23793(8) -- No 61 

GSE70630 Brain (glioma) 4044(6) 303(6) No 67 

GSE57872 Brain (glioma) 440(6) -- No 68 

GSE102130 Brain (6 glioma 

and 3 glioblastoma) 

2858(9) 94(5) No 69 

GSE84465 Brain 

(glioblastoma) 

1091(4) 651(4) No 46 

GSE81861 Colorectal 272(10) 160(6) No 70 

GSE103322 Head and Neck 2093(13) 3197(15) No 71 

GSE115978 Melanoma 2018(23) 4334(32) Yes, immuno-

therapy 

63,64 

GSE118828 Ovarian 1415(11 

primary) 

973 (5 

metastasis) 

578(2) No 72 

GSE67980 Prostate 124(21) -- Metastasis or 

not 

73 

E-MTAB-

6149 

Lung 7351(5) 2730(5) No 74 

 
105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


30 
 

Sampling Process to Generate Replicates of Data Sets 

As shown in Table 1, the number of cells available in the different single-cell data sets varies by 

three orders of magnitude; to enable us to compare the findings across different data sets and 

cancer types on more equal footing, we employed sampling from the larger sets to reduce this 

difference to one order of magnitude. This goes along with the data collection process in the real 

world as we might get measurements from different samples at different times. Suppose for a 

data set we have 𝑛 genes, and 𝑚 cells comprising tumor cells and non-tumor cells. We want to 

select a subset of 𝑚 
′ < 𝑚 cells. We select a set of 𝑚′ cells uniformly at random without 

replacement from among all cells. Then we partition the selected cells into 𝑚𝑡 ′  tumor cells and 

𝑚𝑛
′   non-tumor cells to define one replicate. In most of the computational experiments shown we 

used 20 replicates and we report either the arithmetic mean or entire distribution of quantities 

such as the CTS size. 

Considering a previously defined set of target genes and of HPA gene 

expression across different normal tissues 

The general aim of our methods is to target the cancer cells while sparing the adjacent non-

cancer cells as much as possible. A related concern is that genes within the target set could be 

expressed at high levels in other normal tissues that are not part of the non-cancer cells from the 

tumor microenvironment included in the input data sets. One way to address this problem is to 

identify genes that have low expression in the majority of the tissues and to use them to obtain a 

target set. This approach has been pioneered in a recent paper on selecting gene targets suitable 

for CAR-T therapy25. The authors selected 533 candidate genes that they judged could be 

reasonable targets for CAR-T. They made this selection based on expression data from the 

Human Protein Atlas75 and the Genotype-Tissue Expression consortium (GTEx)76, which have 

expression information from multiple tissues which was used to identify low expressed target 

genes.  

McKay et al.25 used a threshold of 15 TPM units of expression (written in their work as 

log2(TPM+1) ≤ 4), but they allowed a small number of tissues to exceed this threshold. Instead, 

we used quantitative levels of expression for finer granularity in analysis, as described in the next 

subsection. One clinical difference is that we looked only at adult tissues because we are 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


31 
 

analyzing adult tumors, while CAR-T therapy can be used for either childhood or adult tumors. 

The reason to focus on cell-surface receptors, as suggested by Dannenfelser et al.24, is that CAR-

T therapy requires a cell-surface target that may or may not be a receptor, antibody technologies 

require a cell surface receptor, and the ligand-mimicking peptide nanotechnology that we 

summarized in the Introduction also requires cell surface receptor targets. 

Construction of target gene sets that are lowly expressed in normal tissues 

To analyze the tissue specificity of the 1269 candidate target genes, the RNAseq based multiple 

tissue expression data was obtained from the Human Protein Atlas (HPA) database 

(https://www.proteinatlas.org/about/download ; Date: May 29, 2020). The HPA database 

includes expression values (in units of transcripts per million (TPM)) for 37 tissues from HPA 

(rna_tissue_hpa.tsv.zip)75 and 36 tissues from the Genotype-Tissue Expression consortium 

(rna_tissue_gtex.tsv.zip)76. Next, to identify target genes with low or no expression within 

majority of adult human tissues, for the 1269 candidate genes we identified genes whose average 

expression across tissues is below certain threshold value (0.25, 0.5, 1, 2, 5, and 10 TPM) in both 

HPA and GTEx data sets. Using the intersection of low expression candidate genes from HPA 

and GTEx data sets, we generated lists of high confidence targets. The size of the resulting high 

confidence target genes varied from 424 (average expression less than 0.25 TPM) to 900 

(average expression across tissue less than 10 TPM) genes (Table 2). While the total number of 

genes decreases slowly, the decrease is much steeper if one excludes olfactory receptors and 

taste receptors (Table 2). These sensory receptors are not typically considered as cancer targets, 

although a few of these receptors are selected in optimal target sets when there are few 

alternatives (Figure 7).  MadHitter was run on all nine data sets using the expression information 

from the high confidence gene lists. 

 
Table 2: Size of high confidence target gene sets for different thresholds. 

Thresholds 

expression across 

Size of gene set No of genes which 

are NOT olfactory 

No of genes with 

ligand mimicking 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://www.proteinatlas.org/about/download
https://doi.org/10.1101/2020.01.28.923532


32 
 

tissues (TPM) (OR*) and taste 

receptors (TAS*) 

peptides (intersection 

with Tables 3 and 4) 

0.25 424 97 1 

0.5 494 141 3 

1 547 187 3 

2 632 269 6 

5 762 398 10 

10 900 536 19 

 
Assembling Lists of Membrane Target Genes 

We are interested in the set of genes 𝐺 that i) have the encoded protein expressed on the cell 

surface and ii) for which some biochemistry lab has found a small peptide (i.e. amino acid 

sequences of 5-30 amino acids) that can attach itself to the target protein and get inside the cell 

carrying a tiny cargo of a toxic drug that will kill the cell and iii) encode proteins that are 

receptors. The third condition is needed because many proteins that reside on the cell surface are 

not receptors that can undergo RME. The first condition can be reliably tested using a recently 

published list of 2799 genes encoding human predicted cell surface proteins77; we reduced the 

list to 1269 by requiring that the proteins be receptors, which is necessary for RME-based 

therapies but not for CAR-T therpy24. For condition ii), we found two review articles in the 

chemistry literature19-20 that list targets effectively meeting this condition. Intersecting the lists 

meeting conditions i) and ii) gave us 38 genes/proteins that could be targeted (Table 3).  

Most of the data sets listed in Table 1 had expression data on 1200-1220 of these genes 

because the list of 1269 includes many olfactory receptor genes that may be omitted from 

standard genome-wide expression experiments.  Among the 38 genes in Table 3, 13/14 data sets 

have all 38 genes, but GSE57872 was substantially filtered and has only 10/38 genes; since 

GSE57872 lacks non-tumor cells, we did not use this data set in any analyses shown. 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


33 
 

Because the latter review20 was published in 2017, we expected that there are now 

additional genes for which ligand-mimicking peptides are known. We found 20 additional genes 

and those are listed in Table 4. Thus, our target set analyses restricted to genes with known 

ligand-mimicking peptides use 58 =  38 + 20 targets. 

 
Table 3. Single proteins that can be targeted by peptides based on references 18, 19 and are 

expressed on the cell surface77. For easier correspondence with the gene expression data, the 

entries are listed in alphabetical order by gene symbol. In this table, we follow the clinical 

genetics formatting convention that proteins are in Roman and gene symbols are in italics. 

 
Protein Gene Symbol 

APN/CD13 ANPEP 

APP APP 

PD-L1 CD274 

CD44 CD44 

P32/gC1qR CD93 

E-cadherin CDH1 

N-cadherin CDH2 

CD21 CR2 

EGFR EGFR 

Epha2 EPHA2 

EphB4 EPHB4 

HER2 ERBB2 

FGFR1 FGFR1 

FGFR2 FGFR2 

FGFR3 FGFR3 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


34 
 

FGFR4 FGFR4 

VEGFR1 FLT1 

VEGFR3 FLT4 

PSMA FOLH1 

GPC3 GPC3 

IL-10RA IL10RA 

IL-11Rα IL11RA 

IL-13Rα2 IL13RA2 

IL-6Rα IL6R 

GP130 IL6ST 

VEGFR2 KDR 

MUC18 MCAM 

Met MET 

MMP9 MMP9 

Thomsen-Friedenreich carbohydrate antigen MUC1 

NRP-1 NRP1 

PDGFRβ PDGFRB 

CD133 PROM1 

PTPRJ PTPRJ 

HSPG SDC2 

E-selectin SELE 

Tie2 TEK 

VPAC1 VIPR1 

 
Table 4. Single proteins that can be targeted by ligand-mimicking peptides but are not included 

in the two principal reviews that we consulted19-20 and are among 1269 cell surface receptors77. 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


35 
 

Since the evidence that these 20 genes have ligand-mimicking peptides is scattered in the 

literature, we include at least one PubMed ID of a paper describing a suitable peptide. 

 
Protein Gene Symbol At Least One PubMed ID 

ActRIIB ACVR2B 28955765 

CD163 CD163 27563889 

CXCR4 CXCR4 19482312, 22523575 

ephrin A4 EPHA4 15681844, 22523575 

ephrin B1 EPHB1 15722342, 22523575 

ephrin B2  EPHB2 15722342, 22523575 

ephrin B3  EPHB3 15722342, 22523575 

gonadotrophin releasing 

hormone receptor 

GNRHR 20814857, 22523575 

G Protein coupled receptor 55 GPR55 28029647 

bombesin receptor 2 GRPR 20814857, 22523575 

IL4 receptor IL4R 19012727 

low density lipoprotein 

receptor 

LDLR 27656777 

leptin receptor LEPR 19233229, 26265355 

LRP1 LRP1 29090274 

melanocortin 1 receptor MC1R 22964391 

melanocortin 4 receptor MC4R 17591746 

CD206 MRC1 30768279 

urokinase plasminogen 

activator receptor 

PLAUR 25080049 

neurokinin-1 receptor TACR1 29498264 

VPAC2 VIPR2 30077368 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


36 
 

Definition of the Minimum Hitting Set Problem and Solution Feasibility 

One of Karp’s original NP-complete problems is called “hitting set” and is defined as follows36. 

Let 𝑈 be a finite universal set of elements. Let 𝑆1, 𝑆2, 𝑆3, . . . , 𝑆𝑘  be subsets of 𝑈. Is there a small 

subset 𝐻 ⊆ 𝑈 such that for 𝑖 = 1, 2, 3, . . . , 𝑘, 𝑆𝑖 ∩ 𝐻 is non-empty. In our setting, U is the set of 

target genes and the subsets 𝑆𝑖are the single cells.  In reference 78, numerous applications for 

hitting set and the closely related problems of subset cover and dominating set are described; in 

addition, practical algorithms for hitting set are compared on real and synthetic data.  

Among the applications of hitting set and closely related NP-complete problems in 

biology and biochemistry are stability analysis of metabolic networks79-83, identification of 

critical paths in gene signaling and regulatory networks84-86 and selection of a set of drugs to treat 

cell lines87-88 or single patients89-90.  More information about related work can be found in 

Supplementary Materials 1. 

Two different difficulties arising in problems such as hitting set are that 1) an instance 

may be infeasible meaning that there does not exist a solution satisfying all constraints and 2) an 

instance may be intractable meaning that in the time available, one cannot either i) prove that the 

instance is infeasible or ii) find an optimal feasible solution. All instances of minimum hitting set 

that we considered were tractable on the NIH Biowulf system. Many instances were provably 

infeasible; in almost all cases. we did not plot the infeasible parameter combinations. However, 

in Figure 4, the instance for the melanoma data set with the more stringent parameters was 

infeasible because of only one patient sample, so we omitted that patient for both parameter 

settings in Figure 4. 

Basic Optimal Target Set Formulation 

Given a collection 𝑆 = {𝑆1, 𝑆2, 𝑆3, . . . } of subsets of a set 𝑈, the hitting set problem is to find the 

smallest subset 𝐻 ⊆ 𝑈 that intersects every set in 𝑆. The hitting set problem is equivalent to the 

set cover problem and hence is NP-complete. The following ILP formulates this target set 

problem: 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


37 
 

𝑚𝑖𝑛 ∑  

 
𝑔∈𝑈

𝑥(𝑔) 

∑   𝑔∈𝐶  𝑥(𝑔)  ≥ 1    ∀𝑆𝑖 ∈ 𝑆    (1) 

In this formulation, there is a binary variable 𝑥(𝑔) for each element 𝑔 ∈ 𝑈that denotes whether 

the element 𝑔  is selected or not. Constraint (1) makes sure that from each set 𝑆𝑖 in S, at least one 

element is selected.  

For any data set of tumor cells, we begin with the model that we specify a set of genes 

that can be targeted, and that is 𝑈. Each cell is represented by the subset of genes in 𝑈 whose 

expression is greater than zero. In biological terms, a cell is killed (hit) if it expresses at any level 

on one of the genes that is selected to be a target (i.e., in the optimal target set) in the treatment. 

In this initial formulation, all tumor cells are combined as if they come from one patient because 

we model that the treatment goal is to kill (hit) all tumor cells (all subsets).  In a later subsection, 

we consider a fair version of this problem, taking into account that each patient is part of a 

cohort. Before that, we model the oncologist’s intuition that we want to target genes that are 

overexpressed in the tumor. 

Combining Data on Tumor Cells and Non-Tumor Cells 

To make the hitting set formulation more realistic, we would likely model that a cell (set) is 

killed (hit) only if one of its targets is overexpressed compared to typical expression in non-

cancer cells. Such modeling can be applied in the nine single-cell data sets that have data on non-

cancer cells to reflect the principle that we would like the treatment to kill the tumor cells and 

spare the non-tumor cells.  

Let 𝑁𝑇 be the set of non-tumor cells. For each gene 𝑔, define its average expression 

𝐸(𝑔) as the arithmetic mean among all the non-zero values of the expression level of 𝑔 and cells 

in 𝑁𝑇. The zeroes are ignored because many of these likely represent dropouts in the expression 

measurement. Following the design of experiments in the lab of N. A., we define an expression 

ratio threshold factor 𝑟 whose baseline value is 2.0. We adjust the formulation of the previous 

subsection, so that the set representing a cell (in the tumor cell set) contains only those genes 𝑔 

such that the expression of 𝑔 is greater than 𝑟 × 𝐸(𝑔) instead of greater than zero. We keep the 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


38 
 

objective function limited to the tumor cells, but we also store a set to represent each non-tumor 

cell, and we tabulate which non-tumor cells (sets) would be killed (hit) because for at least one of 

the genes in the optimal target set, the expression of that gene in that non-tumor cell exceeds the 

threshold 𝑟 × 𝐸(𝑔). We add parameters 𝑙𝑏 and 𝑢𝑏  each in the range [0,1] and representing 

respectively a lower bound on the proportion of tumor cells killed and an upper bound on the 

proportion of non-tumor cells killed. The parameters 𝑙𝑏, ub are used only in two constraints, and 

we do not favor optimal solutions that kill more tumor cells or fewer non-tumor cells, so long as 

the solutions obey the constraints. 

The Fair Cohort Target Set Problem for a Multi-Patient Cohort 

We want to formulate an integer linear program that selects a set of genes 𝑆∗ from available 

genes in such a way that, for each patient, there exists an individual target set 𝐻𝑖
𝑆∗ ⊆ 𝑆∗of a 

relative small size (compared to the optimal ITS of that patient alone which is denoted by 𝐻(𝑖)). 

Let U = {g1, g2, ..., g|U|} be the set of genes. There are 𝑛 patients. For the i
th patient, we denote by 

𝑆𝑃(𝑖), the set of tumor cells related to patient i. For each tumor cell 𝐶 ∈ 𝑆𝑃(𝑖), we describe it as a 

set of genes which is known to be targetable to cell 𝐶. That is, 𝑔 ∈ 𝐶 if and only if a drug 

containing 𝑔 can target the cell 𝐶. 

In the ILP, there is a variable 𝑥(𝑔) corresponding to each gene 𝑔 ∈ 𝑈 that shows whether 

the gene g is selected or not. There is a variable 𝑥(𝑔, 𝑃(𝑖)) which shows whether a gene g is 

selected in the target set of patient 𝑃(𝑖). The objective function is to minimize the total number 

of genes selected, subject to having a target set of size at most 𝐻(𝑖) + 𝛼 for patient 𝑃(𝑖) where 

1 ≤ 𝑖 ≤ 𝑛. 

Constraint (3) ensures that, for patient 𝑃(𝑖),we do not select any gene 𝑔  that are not selected in 

the global set. 

Constraint (4) ensures all the sets corresponding to tumor cells of patient 𝑃(𝑖) are hit.   

𝑚𝑖𝑛 ∑   𝑔∈𝑈  𝑥(𝑔)       (1) 

∑   𝑔∈𝑆𝑃(𝑖)  𝑥(𝑔, 𝑃(𝑖))  ≤ 𝐻(𝑖) + 𝛼    ∀𝑖    (2) 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


39 
 

𝑥(𝑔, 𝑃(𝑖))  ≤  𝑥(𝑔)    ∀𝑖∀𝑔 ∈ 𝑈      (3) 

∑   𝑔∈𝐶  𝑥(𝑔, 𝑃(𝑖))  ≥ 1    ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖)       (4) 

Parameterization of the Fair Cohort Target Set Problem 

In the Fair Cohort Target Set ILP shown above, we give more preference towards minimizing 

number of genes needed in the CTS. However, we do not take into account the number of non-

tumor cells killed. Killing (covering) too many non-tumor cells potentially hurts patients. In 

order to avoid that, we add an additional constraint to both the ILP for the local instances and the 

global instance. Intuitively, for patient 𝑃(𝑖), given an upper bound of the portion of the non-

tumor cell killed 𝑈𝐵, we want to find the smallest cohort target set 𝐻(𝑖)  with the following 

properties: 

1. 𝐻(𝑖)  covers all the tumor cells of patient 𝑃(𝑖). 

2. 𝐻(𝑖)  covers at most 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| where 𝑁𝑇𝑃(𝑖) is the set of non-tumor cells known for 

patient 𝑃(𝑖); the number of non-tumor cells killed is represented by the variable 𝑦. 

The ILP can be formulated as follows: 

𝑚𝑖𝑛 ∑   𝑔∈𝑈  𝑥(𝑔)       (1) 

∑   𝑔∈𝐶  𝑥(𝑔)  ≥ 1   ∀𝐶 ∈ 𝑆𝑃(𝑖)       (2) 

𝑦(𝐶) ≥ 𝑚𝑎𝑥 
𝑔∈𝐶

𝑥(𝑔)   ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖)       (3) 

∑   𝐶  𝑦(𝐶)  ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)|   ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖)       (4) 

With this formulation, the existence of a feasible solution is not guaranteed. However, covering 

all tumor cells might not always be necessary either. This statement can be justified as (1) 

measuring data is not always accurate, and some tumor cells could be missing and (2) in some 

cases, it might be possible to handle uncovered tumor cells using different methods. Hence, we 

add another parameter 𝐿𝐵 to let us model this scenario. In the high-level, this is the ratio of the 

tumor cells we want to cover. The ILP can be formulated as follows: 

𝑚𝑖𝑛 ∑   𝑔∈𝑈  𝑥(𝑔)       (1) 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


40 
 

∑   𝐶  𝑦(𝐶)  ≥ 𝐿𝐵 ∗ |𝑆𝑃(𝑖)|   ∀𝐶 ∈ 𝑆𝑃(𝑖)       (2) 

𝑦(𝐶)  ≥ 𝑚𝑎𝑥 
𝑔∈𝐶

𝑥(𝑔)   ∀𝐶 ∈ 𝑆𝑃(𝑖) ∪ 𝑁𝑇𝑃(𝑖)       (3) 

∑   𝐶  𝑦(𝐶)  ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)|   ∀𝐶 ∈ 𝑁𝑇𝑃(𝑖)       (4) 

Notice that the constraint (2) here is different from the one above as we only care about the total 

number of tumor cells covered. 

Even with both 𝑈𝐵 and 𝐿𝐵, the feasibility of the ILP is still not guaranteed. However, 

modeling the ILP in this way allows us to parameterize the ILP for various other scenarios of 

interest. While the two ILPs above are designed for one patient, one can extend these ILPs for 

multi-patient cohort.  

𝑚𝑖𝑛 ∑   𝑔∈𝑈  𝑥(𝑔)       (1) 

∑   𝑔∈𝐶  𝑥(𝑔, 𝑃(𝑖))  ≤ 𝐻(𝑖) + 𝛼    ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖)    (2) 

𝑥(𝑔, 𝑃(𝑖))  ≤  𝑥(𝑔)    ∀𝑖, 𝑔 ∈ 𝑈      (3) 

𝑦(𝐶, 𝑃𝑃(𝑖)) ≥ 𝑚𝑎𝑥 
𝑔 ∈𝐶

𝑥(𝑔, 𝑃(𝑖)) ∀𝑖∀𝐶 ∈ 𝑆𝑃(𝑖)    (4)   

∑   𝐶  𝑦(𝐶, 𝑃𝑃(𝑖)) ≥ 𝐿𝐵 ∗ |𝑆𝑃(𝑖)| ∀𝑖     (5) 

∑   𝐶  𝑦(𝐶, 𝑃𝑃(𝑖)) ≤ 𝑈𝐵 ∗ |𝑁𝑇𝑃(𝑖)| ∀𝑖       (6) 

Implementation Note, Accounting for Multiple Optima and Software 

Availability 

We implemented in Python 3 the above fair cohort target set formulations, with the expression 

ratio 𝑟 as an option when non-tumor cells are available. The parameters 𝛼, 𝑙𝑏, 𝑢𝑏 can be set by 

the user in the command line. To solve the ILPs to optimality we usually used the SCIP library 

and its Python interface891. To obtain multiple optimal solutions of equal size we used the Gurobi 

library (https://www.gurobi.com) and its python interface. When evaluating multiple optima, for 

all feasible instances, we sampled 50 optimal solutions that may or may not be distinct, using the 

Gurobi function select_solution(). To determine how often each gene or pair of genes occur in 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://www.gurobi.com/
https://doi.org/10.1101/2020.01.28.923532


41 
 

optimal solutions, we computed the arithmetic mean of gene frequencies and gene pair 

frequencies over all sampled optimal solutions.  

The software package is called MadHitter. The main program is called hitting_set.py. We 

include in MadHitter a separate program to sample cells and generate replicates, called 

sample_columns.py. So long as one seeks only single optimal solutions for each instance, exactly 

one of SCIP and Gurobi is sufficient to use MadHitter. We verified that SCIP and Gurobi give 

optimal solutions of the same size. If one wants to sample multiple optima, this can be done only 

with the Gurobi library. The choice between SCIP and Gurobi and the number of optima to 

sample are controlled by command-line parameters use_gurobi and num_sol, respectively. The 

MadHitter software is available on GitHub at https://github.com/ruppinlab/madhitter 

Acknowledgements 

This research is supported in part by the Intramural Research program of the National Institutes 

of Health, National Cancer Institute. This research is supported in part by the University of 

Maryland Year of Data Science Program. This research is supported in part by start-up funds 

from Northwestern University and a research award from Amazon to support the research of 

S.K. This work utilized the computational resources of the NIH HPC Biowulf cluster. 

(http://hpc.nih.gov). Thanks to E. Michael Gertz for technical assistance with SCIP, Gurobi, and 

Biowulf. Thanks to Allon Wagner, Keren Yizhak and Sushant Patkar for assistance in 

identifying and retrieving suitable single-cell RNAseq data sets. Thanks to Leandro Hermida for 

technical advice. 

Competing Interests 

The authors declare that they have no competing interests. 

References 

1. Von Hoff D.D., et al. Pilot study using molecular profiling of patients’ tumors to find 
potential targets and select treatments for their refractory cancers. J. Clin. Oncol. 28(33), 

4877-4883 (2010). 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://github.com/ruppinlab/madhitter
http://hpc.nih.gov/
https://doi.org/10.1101/2020.01.28.923532


42 
 

2. Schütte M., et al. Cancer precision medicine: Why more is more and DNA is not enough. 
Pub. Health Genomics 20(2):70-80 (2017). 

3. Jameson, G.S., et al. A pilot study utilizing multi-omic molecular profiling to find potential 
targets and select individualized treatments for patients with previously treated metastatic 

breast cancer. Breast Cancer Res. Treat. 147(3), 579-588 (2014).  

4. Saulnier Sholler, G.L., et al. Feasibility of implementing molecular-guided therapy for the 
treatment of patients with relapsed or refractory neuroblastoma. Cancer Med. 4(6):871-886 

(2015). 

5. Byron, S.A., et al. Prospective feasibility trial for genomics-informed treatment in recurrent 
and progressive glioblastoma. Clin. Cancer Res. 24(2), 295-305 (2018). 

6. Schwaederle, M., et al. Association of biomarker-based treatment strategies with response 
rates and progression-free survival in refractory malignant neoplasms: A meta-analysis. 

JAMA Oncology  2(11), 1452-1459 (2016). 

7. Arnedos, M., Vielh, P., Soria, J.C. & Andre, F. The genetic complexity of common cancers 
and the promise of personalized medicine: is there any hope? J. Pathol.  232(2): 274-282 

(2014). 

8. Nikanjam, M., Liu, S., Yang, J. & Kurzrock, R. Dosing three-drug combinations that include 
targeted anti-cancer agents: Analysis of 37,763 patients. Oncologist 22(5), 576-584 (2017). 

9. Rebollo, J. et al. Gene expression profiling of tumors from heavily pretreated patients with 
metastatic cancer for the selection of therapy: A pilot study. Am. J. Clin. Oncol. 40(2), 140-

145 (2017). 

10. Sureda, M., et al. Determining personalized treatment by gene expression profiling in 
metastatic breast carcinoma patients: a pilot study. Clin. Trans. Oncol. 20(6), 785-793 

(2018). 

11. Sicklick, J.K., et al. Molecular profiling of cancer patients enables personalized combination 
therapy: the I-PREDICT study. Nat. Med. 25(5), 744-750 (2019). 

12. Joo, J.I., et al. Realizing cancer precision medicine by integrating systems biology and 
nanomaterial engineering. Adv. Mater. 32(35):e1906783 (2020). 

13. Marusyk, A. & Polyak, K. Tumor heterogeneity: Causes and consequences. Biochimica et 
Biophysica Acta 1805(1), 105-117 (2010). 

14. McGranahan, N. & Swanton, C. Biological and therapeutic impact of intratumor 
heterogeneity in cancer evolution. Cancer Cell 27(1):15-26 (2015). 

15. Yofe, I., Dahan, R. & Amit I. Single-cell genomic approaches for developing the next 
generation of immunotherapies. Nat. Med. 26(2), 171-177 (2020). 

16. Neelapu, S.S., et al. Axicabatagene ciloleucel CAR T-cell therapy in refractory large B-cell 
lymphoma. New Engl. J. Med. 377(26):2531-2544, 2017. 

17. Bjorn, M.J., Ring, D. & Frankel, A. Evaluation of monoclonal antibodies for the 
development of breast cancer immunotoxins. Cancer Res. 45(3), 1214-1221 (1985). 

18. Pastan, I., Willingham, M.C. & FitzGerald D.J.P. Immunotoxins. Cell 47(5), 641-648 (1986). 
19. Gray, B.P. & Brown, K.C. Combinatorial peptide libraries: mining for cell-binding peptides. 

Chem. Rev.  114(2), 1020-1081 (2014). 

20. Liu, R., Li, X., Xiao, W. & Lam, K.S. Tumor-targeting peptides from combinatorial libraries. 
Adv. Drug Delivery Rev. 110-111, 13-37 (2017). 

21. Fisher, S.L. & Phillips, A.J. Targeted protein degradation and the enzymology of degraders. 
Curr. Opin. Chem. Biol.  44, 47-55 (2018). 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://www.ncbi.nlm.nih.gov/pubmed/28595192
https://www.ncbi.nlm.nih.gov/pubmed/?term=Jameson%20GS%5BAuthor%5D&cauthor=true&cauthor_uid=25209003
https://www.ncbi.nlm.nih.gov/pubmed/?term=Jameson+GS%5Bauth%5D+2014
https://www.ncbi.nlm.nih.gov/pubmed/25720842
https://www.ncbi.nlm.nih.gov/pubmed/25720842
https://www.ncbi.nlm.nih.gov/pubmed/27273579
https://www.ncbi.nlm.nih.gov/pubmed/27273579
https://www.ncbi.nlm.nih.gov/pubmed/28424323
https://www.ncbi.nlm.nih.gov/pubmed/28424323
https://www.ncbi.nlm.nih.gov/pubmed/25144266
https://www.ncbi.nlm.nih.gov/pubmed/25144266
https://doi.org/10.1101/2020.01.28.923532


43 
 

22. Plückthun A. Designed ankyrin repeat proteins (DARPins): binding proteins for research, 
diagnostics, and therapy. Annu. Rev. Pharmacol. Toxicol. 55, 489-511 (2015). 

23. Sokolova, E.A., et al. HER2-specific targeted toxin DARPin-LoPE: Immunogenicity and 
antitumor effect on intraperitoneal ovarian cancer xenograft model. Int. J. Mol. Sci. 20(10). 

pii: E2399 (2019). 

24. Dannenfelser, R., et al., Discriminatory power of combinatorial antigen recognition in cancer 
T cell therapies. Cell Syst. 11(3):215-228 (2020). 

25. MacKay, M., et al. The therapeutic landscape for cells engineered with chimeric antigen 
receptors. Nat. Biotech. 38(2), 233-244 (2020). 

26. Maude, S.L., et al. Tisagenlecleusel in children and young adults with B-cell lymphoblastic 
leukemia. New Engl. J. Med. 378(5):439-448 (2018). 

27. Lamers, C.H., et al. Treatment of metastatic renal cell carcinoma with CAIX CAR-
engineered T cells: clinical evaluation and management of on-target toxicity. Mol. Ther. 

21(4):904-912 (2013). 

28. Thistlethwaite, F.C., et al. The clinical efficacy of first-generation carcinoembryonic antigen 
(CEACAM5)-specific CAR T cells is limited by poor persistence and transient pre-

conditioning-dependent respiratory toxicity. Cancer Immunol. Immunother. 66(11):1425-

1436 (2017). 

29. Fedorov, V.D., Themeli, M., Sadelain, M. PD-1- and CTLA-4-based inhibitory chimeric 
antigen receptors (iCARs) divert off-target immunotherapy responses. Sci. Transl. Med. 

5(215):215ra172 (2013). 

30. Grada, Z., et al. TanCAR: A novel bispecific chimeric antigen receptor for cancer 
immunotherapy. Mol. Ther. Nucl. Acids 2:e105 (2013). 

31. Hegde, M., et al. Combinational targeting offsets antigen escape and enhances effector 
function of adoptively transferred T cells in glioblastoma. Mol. Ther. 21(11):2087-2101 

(2013). 

32. Roybal, K.T., et al. Engineering T cells with customized therapeutic response using synthetic 
Notch receptors. Cell 167(2):419-43.e16 (2016). 

33. Williams, J.Z., et al. Precise T cell recognition programs designed by transcriptionally 
linking multiple receptors. Science 370(6520):1099-1104 (2020). 

34. Říhová, B. Receptor-mediated targeted drug or toxin delivery. Adv. Drug Deliv. Rev. 29(3), 
273-289 (1998). 

35. Tortorella, S. & Karagiannis, T.C. Transferrin receptor-mediated endocytosis: a useful target 
for cancer therapy. J. Membr. Biol. 247(4), 291-307 (2014). 

36. Karp, R.M. Reducibility among combinatorial problems. In Complexity of Computer 
Computations, pp. 85-103 (Plenum Press, New York, 1972). 

37. Martinez-Veracoechea, F.J. & Frenkel, D. Designing super selectivity in multivalent nano-
particle binding. Proc. Natl Acad. Sci. USA 108, 10963-10968 (2011). 

38. Delaney, C., et al. Combinatorial prediction of marker panels from single-cell transcriptomic 
data. Mol. Syst. Biol. 15(10), e9005 (2019).  

39. Müller, S., et al. A role for receptor tyrosine phosphatase zeta in glioma cell migration. 
Oncogene 22(43), 661-668 (2003). 

40. Ulbricht, U. et al. Expression and function of the receptor protein tyrosine phosphatase zeta 
and its ligand pleiotrophin in human astrocytomas. J. Neuropathol. Exp. Neurol. 62(12), 

1265-1275 (2003). 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


44 
 

41. Chen, H.M., et al. Enhanced expression and phosphorylation of the MET oncoprotein by 
glioma-specific PTPRZ1-MET fusions. FEBS Lett. 589(13):1437-1443 (2015). 

42. Ulbricht, U., Eckerich, C., Fillbrandt, R., Westphal, M. & Lamszus, K. RNA interference 
targeting protein tyrosine phosphatase zeta/receptor-type protein tyrosine phosphatase beta 

suppresses glioblastoma growth in vitro and in vivo. J. Neurochem. 98, 1497–1506 (2006). 

43. Bourgonje. A.M., et al. Intracellular and extracellular domains of protein tyrosine 
phosphatase PTPRZ-B differentially regulate glioma cell growth and motility. Oncotarget 

5(18), 8690-8702 (2014). 

44. Fujikawa, A., et al. Targeting PTPRZ inhibits stem cell-like properties and tumorigenicity in 
glioblastoma cells. Sci. Rep. 7:5609 (2017). 

45. Pastor, M., et al. Development of inhibitors of receptor protein tyrosine phosphatase β/ζ 
(PTPRZ1) as candidates for CNS disorders. Eur. J. Med. Chem. 144:318-329 (2018). 

46. Darmanis, S., et al. Single-cell RNA-Seq analysis of infiltrating neoplastic cells at the 
migrating front of human glioblastoma. Cell Rep. 21, 1399-1410 (2017). 

47. Bergelson, J.M., et al. Isolation of a common receptor for coxsackie B viruses and 
adenoviruses 2 and 5. Science 275(5304): 1320-1323 (1997). 

48. Nilchian, A., et al. CXADR-mediated formation of an AKT inhibitory signalosome at tight 
junctions controls epithelial-mesenchymal plasticity in breast cancer. Cancer Res. 79(1):47-

60 (2019). 

49. Arfelt K.N., et al. Signaling via G proteins mediates tumorigenic effects of GPR87. Cell. 
Signal. 30:9-18 (2017). 

50. Zhang, Y., Qian, Y., Lu, W., Chen, X. The G protein-coupled receptor 87 is necessary for 
p53-dependent cell survival in response to genotoxic stress.  Cancer Res. 69(15):6049-6056 

(2009). 

51. Wang, L., et al. Overexpression of G protein-coupled receptor GPR87 promotes pancreatic 
cancer aggressiveness and activates NF-κB signaling pathway. Mol. Cancer 16:61 (2017). 

52. Szklarczyk, D., et al. STRING v11: protein-protein association networks with increased 
coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic 

Acids Res. 2019; 47(D1), D607-D613 (2019). 

53. Seoane, J. & De Mattos-Arruda L. The challenge of intratumour heterogeneity in precision 
medicine. J. Intern. Med.  276(1), 41-51 (2014). 

54. Van Dijk D, et al. Recovering gene interactions from single-cell data using data diffusion. 
Cell 174(3):716-729.e27 (2018). 

55. Li, W.V. & Li, J.J. An accurate and robust imputation method scImpute for single-cell RNA-
seq data. Nat. Comm. 9, 997 (2018). 

56. Raj, A. & van Oudenaarden, A. Stochastic gene expression and its consequences. Cell 
135(2): 216-226 (2008). 

57. Kim, J.Y. & Marioni, J.C. Inferring the kinetics of stochastic gene expression from single-
cell RNA-sequencing data. Genome Biol. 14, R7 (2013). 

58. Clough, E. & Barrett, T. The Gene Expression Omnibus Database. Meth. Mol. Biol. 1418, 
93-110 (2016). 

59. Kolesnikov, N., et al. ArrayExpress update--simplifying data submissions. Nucleic Acids Res. 
43(Database Issue), D1113-D1116 (2015). 

60. Cao, Y., Zhu, J., Jia, P. & Zhao Z. scRNASeqDB: A database for RNA-seq based gene 
expression profiles in human single cells. Genes 8(12), 368 (2017). 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


45 
 

61. Yuan J, et al. Single-cell transcriptome analysis of lineage diversity in high-grade glioma. 
Genome Med. 10(1), 57 (2018). 

62. Karaayvaz, M., et al. Unravelling subclonal heterogeneity and aggressive disease states in 
TNBC through single-cell RNA-seq. Nat. Comm. 9(1), 3588 (2018). 

63. Jerby-Arnon, L., et al. A cancer cell program promotes T cell exclusion and resistance to 
checkpoint blockade. Cell 175(4):984-997.e24 (2018). 

64. Tirosh, I., et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell 
RNA-seq. Science 352(6282), 189-196 (2016). 

65. Chung, W., et al. Single-cell RNA-seq enables comprehensive tumour and immune cell 
profiling in primary breast cancer. Nat. Comm.  8,15081 (2017). 

66. Venteicher, A.S., et al. Decoupling genetics, lineages, and microenvironment in IDH-mutant 
gliomas by single-cell RNA-seq. Science 355(6332): pii:eaai8478 (2017). 

67. Tirosh, I., et al. Single-cell RNA-seq supports a developmental hierarchy in human 
oligodendroglioma. Nature 539(7628), 309-313 (2016a). 

68. Patel, A.P., et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary 
glioblastoma. Science 344(6190), 1396-1401 (2014). 

69. Filbin, M.G., et al. Developmental and oncogenic programs in H3K27M gliomas dissected 
by single-cell RNA-seq. Science 360(6386), 331-335 (2018). 

70. Li, H., et al. Reference component analysis of single-cell transcriptomes elucidates cellular 
heterogeneity in human colorectal tumors. Nat. Genet. 49(5), 708-718 (2017). 

71. Puram, S.V., et al. Single-cell transcriptomic analysis of primary and metastatic tumor 
ecosystems in head and neck cancer. Cell 171(7), 1611-1624.e24 (2017). 

72. Shih, A.J., et al. Identification of grade and origin specific cell populations in serous 
epithelial ovarian cancer by single cell RNA-seq. PLoS ONE 13(11), e0208778 (2018). 

73. Miyamoto, D.T., et al. RNA-Seq of single prostate CTCs implicates noncanonical Wnt 
signaling in antiandrogen resistance. Science 349, 1351-1356 (2015). 

74. Lambrechts, D., et al. Phenotype molding of stromal cells in the lung tumor 
microenvironment. Nat. Med. 24, 1277-1289 (2018). 

75. Uhlén, M., et al. Tissue-based map of the human proteome. Science 347(6220), 1260419 
(2015). 

76. The GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat. Genet. 45(6), 
580-585 (2013). 

77. Bausch-Fluck, D., et al. The in silico human surfaceome. Proc. Natl. Acad. Sci USA 115, 
E10988-E10997 (2018). 

78. Gainer-Dewar, A. & Vera-Lincona, P. The minimal hitting set generation problem: 
Algorithms and computation. SIAM J. Discr. Math. 31, 63-100 (2017). 

79. Haedlicke, O. & Klamt, S. Computing complex metabolic intervention strategies using 
constrained minimal cut sets. Metabolic Eng. 13, 204-213 (2011). 

80. Haus, U.-U., Klamt, S. & Stephen, T. Computing knock-out strategies in metabolic networks. 
J. Comput. Biol. 15(3), 259-268 (2008). 

81. Jarrah, A.S., Laubenbacher, R., Stigler, B. & Stillman, M. Reverse-engineering of 
polynomial dynamical systems. Adv. Appl. Math.  39, 477-489 (2007). 

82. Klamt, S. & Gilles, E.D. Minimal cut sets in biochemical reaction networks. Bioinformatics 
20, 226-234 (2004). 

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532


46 
 

83. Trinh, C.T., Wlaschin, A. & Srienc F. Elementary mode analysis: a useful metabolic pathway 
analysis tool for characterizing cellular metabolism. Appl. Microbiol. Biotech. 81, 813-826 

(2009). 

84. Ideker, T. Discovery of regulatory interactions through perturbation: inference and 
experimental design. Pac. Symp. Biocomput. 5, 302-313 (2000). 

85. Wang, R.S. & Albert, R. Elementary signaling modes predict the essentiality of signal 
transduction network components. BMC Syst. Biol. 5, 44 (2011). 

86. Zvedei-Oancea, I. & Schuster, S. A theoretical framework for detecting signal transfer routes 
in signaling networks. Comput.  Chem. Engineer. 29, 597-617 (2005). 

87. Vazquez A. Optimal drug combinations and minimal hitting sets. BMC Syst. Biol., 3, 81 
(2009). 

88. Mellor, D., Prieto, E., Mathieson, L. & Moscato, P. A kernelisation approach for multiple d-
hitting set and its application in optimal multi-drug therapeutic combinations. PLoS ONE  

5(10), e13055 (2010). 

89. Vera-Licona, P., Bonnet, E., Brillot, E & Zinovyev, A. OCSANA: optimal combinations of 
interventions from network analysis. Bioinformatics 29, 1571-1573 (2013). 

90. Pang, K., et al. Combinatorial therapy discovery using mixed integer linear programming. 
Bioinformatics 30, 1456-1463 (2014). 

91. Achterberg, T. SCIP: Solving constraint integer programs. Math. Program. Comput.  1(1), 1-
41 (2009).  

105 and is also made available for use under a CC0 license. 
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 

The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.01.28.923532doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.28.923532