key: cord-0822667-oq64d1mb
authors: Goradia, Tushar Madhu; Lange, Kenneth
title: Applications of coding theory to the design of somatic cell hybrid panels
date: 1988-10-31
journal: Mathematical Biosciences
DOI: 10.1016/0025-5564(88)90014-4
sha: 5b1c1827c38b7898739d23b335dd606eb9f57514
doc_id: 822667
cord_uid: oq64d1mb

Abstract The somatic cell hybridization technique for gene mapping depends on assembling panels of rodent-human hybrid clones containing random subsets of the human chromosomes. Such panels should be as informative as possible and permit error detection and error correction for assays of the human gene in the various clones. We derive estimates of the number of randomly generated clones required to be reasonably confident of accurately and unambiguously assigning a gene to a particular human chromosome. The collection of clones in such a random panel is contrasted with minimal panels suggested by algebraic coding theory. To approximate minimal panels we suggest the method of simulated annealing for selecting small, informative panels from larger existing collections of clones. These theoretical insights emphasize the need for more collaboration and coordination among gene mapping groups so that optimal clone panels can be assembled, stored, and distributed.

ing hybrid somatic cells retain all of the rodent chromosomes while losing a random subset of the human chromosomes. A few generations after fusion, clones can be identified with stable subsets of the human chromosomes. All chromosomes, human and rodent, normally remain functional. With a broad enough collection of different hybrid clones, it is possible to establish a correspondence between the presence or absence of a given human gene and the presence or absence of each of the 24 distinct human chromosomes. From this pattern one can infer the particular chromosome on which the gene resides. With the above outline in mind, it is of interest to determine the minimal number of distinct hybrid clones required to accurately and unambiguously assign a gene to a particular human chromosome. Such collections or panels of hybrid clones ideally should be designed to detect and correct a small number of errors in assays for the human gene or its protein product. By using some ideas from algebraic coding theory and probability, we contrast such minimal panels with the random panels typically generated by molecular geneticists.

In developing some of the logical and mathematical consequences of the somatic cell hybrid technique, it seems prudent to state explicitly its underlying assumptions. Some violations of these assumptions will be noted below. The major assumptions are:

(a) The human gene G to be mapped is present on exactly one chromosome.

(b) Any rodent analogue of G is distinguishable from G at either the protein or the DNA level.

(c) Each of the 24 distinct human chromosomes (22 autosomes and the X and Y sex chromosomes) is either absent from a clone or is cytologically or biochemically detectable in the clone.

(d) All cells within a clone share the same chromosome constitution. (e) The presence or absence of G can be accurately detected in each clone.

Although assumption (a) is generally satisfied, a few human genes are scattered at multiple dispersed sites within the human genome; the gene encoding ribosomal RNA is a case in point [9] . Assumption (b) can be fulfilled if the human gene G and its rodent analogue are distinguishable by either electrophoresis of their gene products or by hybridization of an appropriate human DNA probe. Assumption (c) is easily met because human chromosomes can be accurately identified by a combination of human specific isozyme marker assays and karyotypic analysis using standard banding procedures. Violations of this condition usually result from the presence of unrecognized chromosomal aberrations such as insertions, deletions, translocations, and fragmentations. However, such chromosome aberrations are often intentionally employed for the regional mapping of genes on a specific chromosome. Assumption (d) can be violated if some of the cells in a clone continue to undergo human chromosome loss. For this reason several cells from a clone should be karyotyped. It is the maximal subset of human chromosomes in the clone which is relevant to gene localization. As a safeguard, ambiguous clones should be disregarded. The last assumption, (e), causes the most trouble. For instance, ambiguities can arise when phenotypic polymorphism in structural genes cannot be distinguished from phenotypic polymorphism in associated regulatory genes [6] . In addition, not all genes are constitutive in the sense that they are expressed at all times in all cell types [32] . Use of DNA probes to detect the human gene neatly circumvents both of these problems. Finally, laboratory error can enter into both enzyme and probe hybridization assays. The method of in situ hybridization carries the technique of human probe hybridization one step further. If a human cell is karyotyped and radioactive grains corresponding to the hybridized probe cluster predominantly on a given chromosome, then the gene is declared to reside on that chromosome. In practice, the independent results of somatic cell hybridization and in situ hybridization tend to reinforce each other [22] . In the current paper we will be concerned solely with issues of redundancy and efficiency in the somatic cell hybrid method. It has long been appreciated that certain redundancies in panels of somatic cell hybrid clones can self-detect and self-correct phenotyping errors representing violations of assumption (e) [17] . We will attempt to explain the utility of these error detection and correction capabilities as well as the amount of effort necessary to generate at random panels of clones for such purposes. These randomly generated panels we contrast with minimal panels suggested by algebraic coding theory. We also provide a practical solution to the combinatorial problem of selecting small, informative panels from larger existing collections of clones. As examples, we select good panels of sizes 5 through 20 clones from 189 published clones.

To model mathematically the design of hybrid clone panels, we borrow and reinterpret some concepts from communications engineering. A modicum of notation is required. Let n denote the number of distinct hybrid clones in a panel. Since in females the 22 autosomes and the X chromosome occur in homologous pairs, and since the Y chromosome bears few genes of interest, we focus on clones derived from human female cells. We may construct a karyotype matrix K consisting of n rows and 23 columns. The entry in row i and column j of K is 1 if clone i contains chromosome j; otherwise it is 0. See Figure 1 for an example. Note that the X chromosome Besides comparing columns of K, it is appropriate to compare the columns of K with the results of testing for a given gene G in the different hybrid clones. We can construct a phenotype wlur~ vector p whose ith entry is 1 when the ith hybrid clone contains G; otherwise it is 0. From the assumptions (a) through (e) of the introduction, G can reside on chromosome r only if p = cr. If c, is distinct from all other columns of K, then G can be assigned to chromosome r. In terms of the Hamming distance, this is equivalent to the two conditions p(p, c,) = 0 and p(c,, c,) > 0 for all s # r. As the number n of randomly generated clones increases, satisfying these two conditions for unique chromosome assignment becomes more likely.

In practice, errors can occur in detecting G in the various hybrid clones. These errors affect p and are probably more common than errors affecting the definition of the karyotype matrix K. (Karyotype errors are considered in the discussion.) Let pobs represent the observed phenotype test results for the different hybrid clones. The number of phenotyping errors is p( p, pobs). Some of these errors will be false positives in detecting G, and some will be false negatives. If p and pobs are identical, then there are no phenotyping errors.

in K can compensate for a limited number of errors in pobs, as two well-known propositions from coding theory demonstrate [14] . The first deals with the ability to detect errors. Suppose G lies on chromosome r and we know apriori that the number of errors p( p, pobs) < m for some positive integer m. If the minimal Hamming distance to column c,. satisfies minp(c,,c,) > m, S#r then the fact there are errors in pobs is detectable. The idea of the proof consists in showing that pobs is incompatible with G residing on any chromosome if p( c,, pobs) > 0. Indeed, consider any chromosome s different from r. Then by the triangle inequality and the condition (l),

Hence, pobs cannot coincide with any column, and there must be at least one error.

Although error detection can alert us to inconsistencies, it will not remedy them. For error correction, suppose we know a priori that the number of errors p( p, pobs) < m for some positive integer m. Assuming again that G resides on chromosome r, the more stringent condition minp(c,,c,) >2m,

(2) S#r permits error correction. To prove this proposition one needs to argue that p( p, c,) > 0 for any chromosome s different from r. By the triangle inequality and the condition (2),

As a consequence one can infer that G resides on r.

Because we do not know in advance what chromosome G lies on, it is useful to construct karyoty-pe matrices for which the condition (1) or (2) holds for all possible columns r. The matrix K in Figure 1 furnishes some examples. The first five rows alone permit correct gene placement in the absence of errors, since all columns are distinct. In fact, these columns just represent the binary expansions of various numbers between 0 and 31. The first six rows of Figure 1 have all column pairs a distance 2 or more apart. Hence, these six rows permit detection of one error regardless of which chromosome G lies on. The sixth row was constructed by forcing the column sums of the first six rows to be even. The pairwise column distances for the whole matrix in Figure 1 are always at least 3. Hence, all nine rows permit the correction of a single phenotyping error. There are no simple algorithms to construct this and more complicated examples, but many such matrices have been published in the coding theory literature [27, 36, 421. The karyotype matrix in Figure 1 has much better error detecting and correcting properties than most random karyotype matrices of the same size. Our next aim is to investigate the number of random clones a laboratory geneticist would have to generate to achieve comparable results. To facilitate our analysis we require some more definitions and the introduction of simplifying assumptions.

To begin with, we now view the intercolumn distances p( c,, c,) as random variables X,,; the number of clones is still fixed at n. The joint distribution of the X,, over all column pairs {s, t } can be well approximated under the following assumptions:

(f) Human chromsomes are lost independently of one another during the formation of a stable clone.

(g) The probability that at least one member of a homologous pair of human chromosomes is retained by a clone is ).

(h) The chromosome complements of different clones are independently determined.

These assumptions are almost certainly false in any strict accounting [33] . However, they are conservative assumptions in the sense that departures from them will result in panels with less information content on the average. In other words, generating good random panels is more difficult if they are violated. Assumption (g) represents a compromise motivated by the range of chromosome retention probabilities of .07 to .75 published by Rushton [33] . Assumption (h) can be almost guaranteed if different clones are generated by fusing different parental cell lines.

As a consequence of assumptions (f) through (h), the entries of the karyotype matrix K are independent random variables equally likely to take the values 0 or 1. From this, it is evident that each random distance X,, follows a binomial probability distribution P(X,,=m)=(;)($)m($)n-m n ( 10 1 n =m 2.

Due to the central limit theorem, X,, will be approximately normally distributed even for n as small as 10 [lo]. Less obvious is the fact that the X,, are uncorrelated.

If collectively the X,, actually followed a multivariate normal distribution, lack of correlation among the X,, would imply they were independent [29] . We will exploit this near-independence momentarily. Returning to the problem of showing that they are uncorrelated, we note that when two pairs {s, '} and { u, v} do not overlap, it is intuitively obvious that X,, and X,, are independent. Independence is a stronger property than lack of correlation.

If the pairs {s, r} and {u, v} share one column in common, say t = u, then X,, and X,, are still independent.

This becomes clear when one conditions on the outcome of column c,. It must be emphasized here that assumption (g) is critical. Only a retention probability of $ is consistent with independence.

Even larger subsets of the X,, are independent.

For instance, the collection of random distances from column r, {X,: s # r and r fixed}, is independent. Again this follows by conditioning on column c,. However, it is false that the whole collection of X,, is independent. This subtlety enormously complicates the exact calculation of probabilities.

With these preliminaries, it is possible to approximate the probability distributions of two important random variables. N; denotes the random number of clones required for a fixed column r of K to be a distance d or greater from all other columns of K. Nd is the random number of clones required for all pairs of columns to be a distance d or greater apart. Nd is more relevant than NA when a laboratory group intends to map a large number of different genes using the same panel. N& is appropriate for mapping a single gene.

The distributions of N& and Nd can be derived using the random variables X,,. Thus,

where c = 23 is the number of columns. The formula (3) One of the side effects of employing panels with large numbers of clones is that we increase the expected number of gene detection errors. A rigorous analysis of the chances for correct gene placement should take this fact into account. To model errors in the phenotype column vector pobs, suppose they occur independently in the various clones and have common rate q. The total number of errors will then be binomially distributed. If there are m such errors, and G resides on chromosome r, then we can correct the errors ' Nd denotes the random number of distinct hybrid clones required to achieve a karyotype matrix with every column at least a distance d from every other column. Because Nd is discrete, we define the crth percentile as the first integer n such that P(N,<n)>a/lOO. provided min X,, > 2m.

This is just the condition (2). Thus the probability of correct gene placement given n clones reduces to P( correct gene placement 1 n clones) with c = 23. Figure 2 plots this probability versus n for various values of q. For instance, with q = 0.01, about 12 randomly generated clones suffice to place a given gene with 95% certainty. Also, for absurdly large q, e.g., q = 0.5, the probability of correct gene placement diminishes as more randomly generated clones are added.

As a final comment, we note that all the above mathematical results continue to hold when hybrid clones are cultured in a selective medium which promotes the retention of a particular human chromosome. hybrid cells which carry the gene encoding thyxnidine kinase on chromosome 17 will survive. In this case, column 17 of the karyotype matrix is predetermined to be a column of 1's. Thus, the distribution of N& and Nd must be calculated conditional on a column of 1's. However, it is intuitively clear from symmetry considerations that the two events A = {a given column contains all l's} and B = { N& 6 n } or B = { Nd < n } are independent. This P (N;~nlcolumnofl's) =P (N,'<n) , P(N,gn~columnofl's)=P (N,~n) .

Given an existing collection of clones, there are two prerequisites for choosing a subset of them to form a small, informative panel. First, some criterion of merit must be established for measuring the information content of each panel. As we have attempted to demonstrate, one reasonable criterion is the minimum Hamming distance between the column pairs of a panel. A refinement of this criterion is to take into account the number of column pairs which attain this minimum distance. Thus, we will adopt the criterion

where u is a given panel of clones, d is the minimum Hamming distance for u, k is the number of column pairs attaining this minimum distance, and = 253 is the total number of pairs. The best panels have a low value for E. Because of the scaling of the second term in (5), a panel with a higher d will always be preferred to a panel with a lower d. Note that a fixed panel size n is implicit in the definition (5).

Having decided on the criterion E, the next problem is to find a panel which furnishes a minimum or near-minimum of E. Exhaustive enumeration of all possible panels is infeasible. For instance, with 50 clones and a panel size of 12, there are 50 ( 1 12 = 1.2 x 1o22 possible panels. Since no good deterministic algorithms exist for finding the minimum of E, we will describe three random sampling techniques. All three are implemented by a random exchange mechanism. Given an existing panel, a random clone currently in the panel is selected for exchange with a random clone outside the panel, but in the existing collection of clones. The first and most naive algorithm is to always exchange the two clones, Table 4 .

producing a new panel with exactly one new member. As the exchange progress, a record is kept of the best panel encountered. This simple algorithm basically amounts to random sampling from the collection of all possible panels.

A second and more directed algorithm is to make an exchange only if the value of E2 for the new panel is at least as low as the value of El for the current panel. We will call this the random downhill algorithm. It wastes no time taking poor steps, but it potentially can get trapped at a local minimum.

Our third algorithm, the method of simulated annealing, represents a compromise between the first two algorithms [28, 191. The early stages of simulated annealing resemble random sampling; later stages resemble the random downhill algorithm. Simulated annealing is motivated by the observation that a liquid cooled very slowly from a high temperature to a low temperature will crystallize in a state of minimum energy. To implement simulated annealing a parameter T analogous to temperature is gradually reduced to 0. The objective function E to be minimized is termed energy. In the present context simulated annealing can be realized as follows: Suppose a current panel with energy El exists. Generate a new random panel with energy E2 by the exchange step. Move to this new panel if E2 Q El. If E2 > E,, then move to the new panel with the Boltzmann probability exp [ -(E, -E,) /T]. As T tends to 0, this probability tends to 0 also. Thus, fewer and fewer unfavorable steps are taken as T approaches 0. As with the other two algorithms, a record is kept of the best panel encountered.

To illustrate the above three algorithms for panel design we have amassed 189 different hybrid clones from several published articles [37, 11, 23, 25, 31, 24, 15, 39, 45, 44, 7, 8, 26, 16, 4, 30, 21, 1, 38, 5, 34, 35, 411 . Table 3 lists an appropriate identification code for each clone. We have omitted duplicate clones found in more than one reference and clones having ambiguous human chromosome complements. Table 4 presents the best panels resulting from the application of the three algorithms. The total number of iterations for each panel size ranged from 55,000 to 75,000 and was determined by the convergence criterion for simulated annealing. See [28] for a detailed description of how simulated annealing is implemented. Listed in Table 4 are the panel composition, the minimum Hamming distance, and the number of pairs of columns attaining this distance. Table 4 makes it clear that random sampling is not a contender for panel design. The best panels are relatively rare and can only be identified by some type of directed search. Both the random downhill algorithm and simulated annealing produce excellent panels. Simulated annealing performs better, particularly for panel sixes n -12, 17, and 19. In these three cases the best simulated annealing panels have a higher minimum Hamming distance than the best random downhill panels. As expected, random downhill typically achieved its best panels in relatively few iterations, whereas simulated annealing often attained its best panels in the final stages of simulation.

It is interesting to contrast Tables 2 and 4. For instance, under simulated annealing d = 6 is first reached for the panel size n = 17. The minimum n possible for this d is 13. The average n when panels are randomly generated is 25.3, with a standard deviation of 2.6. In other words, by pooling clones one is able to reach the level d = 6 much sooner than by assembling a sequence of panels from clones which are randomly generated one after another. Note that when d = 6, up to five assay errors can be detected and up to two can be corrected.

We have attempted to formalize some notions of redundancy, efficiency, error detection, and error correction for the somatic cell hybrid method. For all practical purposes, it is clear that as the number of clones in a panel increases, the chance of correctly mapping a given gene also increases. Yet it is hardly economical to use large randomly constructed panels when small purposely designed ones will suffice. Even in the context of purely randomly generated panels, Figure 2 demonstrates a phenomenon of diminishing returns in adding further clones to an existing panel of hybrids. It is our contention that current laboratory practice encourages the use of random panels with two many hybrid clones.

Little thought has been devoted to the engineering rather than intuitive construction of panels. (See [17] for a partial exception to this observation.) By applying some simple concepts from algebraic coding theory, rational construction of panels is feasible. We have focused on the minimum Hamming distance for a panel as a measure of its discriminatory power. Selection of nearly optimal panels by this criterion from existing collections of clones is practical using random sampling techniques and results in panels which are uniformly good for all chromosomes. The alternative of choosing good panels by visual inspection is not practical. Once again the method of simulated annealing has proved its versatility. Two other applications of simulated annealing in genetics appear in [13] and [40] . We conjecture that the combinatorial optimization problem of panel design is intrinsically hard in the precise technical sense of being NP-complete [12] . For problems of this category, like the traveling salesman problem, simulated annealing offers a practical, easy to implement approximate solution [3] .

Partially on the basis of this study, we recommend more collaboration and coordination among gene mapping groups so that good panels can be assembled, stored, and distributed. Besides being more efficient, small panels with the same information content as large panels can actually reduce the number of assay errors. More systematic design and distribution of panels will also enhance the proper cytogenetic characterization of clones within the best panels. In fact, the error detection and correction capabilities of a good panel permit careful monitoring of it for corrupted clones and clones experiencing continued chromosome loss. The panels in Table 4 are not meant to be definitive. Some of the clones represented may no longer be available or have stable chromosome complements. However, our techniques for achieving maximal panel redundancy with minimal panel size offer the opportunity to design and disseminate good panels regardless of exactly what clones are currently available. 

The structural 1ecithin:cholesterol acyl transferase (LCAT) maps to 16q22

Bounds for binary codes of length less than 25

The N-city travelling salesman problem: Statistical mechanics and the Metropolis algorithm

Mapping of the gene coding for the human GM2 activator protein to chromosome 5

The gene for human alpha-lactalbumin is assigned to chromosome 12q13

Somatic cell genetics and gene families

Assignment of the gene determining human carbonic anhydrase, CAI, to chromosome 8

The gene for human muscle specific carbonic anhydrase (CA III) is assigned to chromosome 8

Location of the genes coding for 18s and 28s ribosomal RNA in the human genome

The coding sequence for the 32,000-dalton pulmonary surfactant-associated protein A is located on chromosome 10 and identifies two separate restriction-fragment-length polymorphisms

Computers and intractability: A Guide to the theov of NP-completeness

Mapping DNA by stochastic relaxation

Error-correcting codes

Chromosomal location of human genes encoding major heat-shock protein HSP70

The structural gene for aldolase B (ALDB) maps to 9q13 + 32

Somatic cell hybrid mapping panels

Somatic cell genetics and gene mapping

Optimization by simulated annealing

Linear codes

The human placental alkaline phosphatase gene and related sequences map to chromosome 2 band q37

The morbid anatomy of the human genome: A review of gene mapping in clinical medicine

Linkage disequilibrium of plasminogen polymorphisms and assignment of the gene to human chromosome 6q26-6q21

Human gastrin-releasing peptide gene is located on chromosome 18

cDNA cloning and mapping of the human creatine kinase M gene to 19q13

A cyctochrome P-450 gene family mapped to human chromosome 19

Binary codes with specified minimum distance

Minimization or maximization of functions

Multivariate analysis

Localization of the oncogene c-erbA2 to human chromosome 3

Human genes encoding prothrombin and ceruloplasmin map to llpll-q12 and 3q21-24, respectively

Hybrid cells and human genetics

Quantitative analysis of human chromosome segregation in man-mouse somatic cell hybrids

Shows, Coronavirus 229E susceptibility in man-mouse hybrids is located on human chromosome 15

Gene for ghttathione S-transferase-1 (GSTl) is on human chromosome 11

New family of single-error correcting codes

The gene for human liver arginase (ARGl) is assigned to chromosome band 6q23

Chromosomal assignment of the gene encoding the human tissue inhibitor of metalloproteinases to xpll.l-~11

Mapping of the gene coding for the human liver/bone/kidney isozyme of alkaline phosphatase to chromosome 1

Optimal computation of probability functions for pedigree analysis, IMA

Random Simulated Panel Sampling Downhill Annealing Simulated Annealing Size Energy a Energy a Energy a Panel Compositionb 5(072) (1,431 (1,38) {6,78,79,92,151} 6 (1,231 (1,131 (1,121 {29,56,73,78,90,167) 7 (1*8) (133) (l,ll (19,46,52,56,67,80,151] 8 (173) (2,401 (2327) (4,43,50,67,90,113,177,186} 9 (230) (2,5) (2361 (3,10,14,43,50,58,80,105,131} 10 (2,71 (3937) (3,241 (12,28,29,73,79,80,90,92,140,179} 11 (2>3) (3.71 (3,3) {3,4,29,38,43,53,73,74,79,90,181} 12 (2*21 (371) (4,431 {14,36,58,63,71,78,79,105,131,144,179,189} 13 (3,131 (4,271 (4,121 {6,12,23,27,28,36,42,51,79,81,90,105,141} 14 ( (14,19,31,38,49,50,53,59,78, 80,90,91,104,106,116,154,160,165} 19 (5,101 (6,2) (7,321 {7,10,21,28,56,58,63,74,78,79, 104,105,106,131,144,167,174,179,181} 20 (5.5) (737) (7,171 (4,21,22,28,29,31,36,42,59,61, 78,101,113,119,128,131,140,146,159,179) aThe first number listed is the minimum Hamming distance for the column pairs, and the second number is the number of pairs attaining this distance. bThe numbers in braces correspond to the clone numbers in Table 3 . Med. Biol. 3:167-178 (1986