LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 5\O.S4 
 
 no.2_72>-2£><b 
 
 cop. 1 
 
CENTRAL CIRCULATION BOOKSTACKS 
 
 The person charging this material is re- 
 sponsible for its renewal or its return to 
 the library from which it was borrowed 
 on or before the Latest Date stamped 
 below. You may be charged a minimum 
 fee of $75.00 for each lost book. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result In dismissal from 
 the University. 
 TO RENEW CALL TELEPHONE CENTER, 333-8400 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 f SEP 6 1996 
 
 When renewing by phone, write new due date below 
 previous due date. L162 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/adaptivelinearcl284ibar 
 
C. T- 
 
 C^ Report No. 28U 
 Iff 
 
 /TUl^U, 
 
 ADAPTIVE LINEAR CLASSIFIER 
 BY LINEAR PROGRAMMING 
 
 by 
 Toshihide Lbaraki and Saburo Muroga 
 
 September 27, 1968 
 
Report No. 28U 
 
 ADAPTIVE LINEAR CLASSIFIER 
 BY LINEAR PROGRAMMING 
 
 by 
 Toshihide Ibaraki and Saburo Muroga 
 
 September 27, 1968 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, Illinois 6l801 
 
ABSTRACT 
 
 This paper discusses a linear classifier based on linear 
 programming which is adaptive to a change in the set of input vectors. 
 Different from linear classifiers discussed in the past, this linear 
 classifier maintains the maximum rel ability of its operation. A 
 procedure of deriving an optimum structure of the linear classifier for 
 a new set of input vectors is a modification of the ordinary simplex 
 method and yields an optimum strucutre in a much fewer iterations than 
 the determination of the strucutre by straightforward application of the 
 ordinary simplex method. 
 
 The adaptive procedure then is extended to other cases such as a 
 linear classifier which maintains the minimum number of erroneously 
 classified input vectors. This is bases on Gomory's algorithm for 
 integer linear programming. 
 
 The feasibility and efficiency of our linear classifiers are 
 
 computationally proved by some examples. 
 
 Index Terms 
 
 Adaptive network 
 
 Simplex method 
 
 Linear programming 
 
 Integer linear programming 
 
 Gomory ' s algorithm 
 
 Dual linear programming 
 
 Linear classifier 
 
 Reliability of adaptive network 
 
 Input tolerance of adaptive network 
 
1. Introduction 
 
 A linear classifier for pattern recognition has received much 
 attention recently for its cognitive power with simple structure and 
 also as a constituent of a more complex classifier ' . There 
 
 are at least two types of linear classifiers distinguished "by their 
 design, i.e. .the one designed "by adaptive (learning) methods and the 
 other designed by linear programming. In this paper we will discuss a 
 linear classifier which is based on linear programming but which is 
 augmented with adaptive capability. 
 
 An adaptive method was first discussed in conjunction with the 
 
 [2] 
 learning capacity of neural networks . It has some merits such as 
 
 (l) simplicity of training rule, (2) economy of memory space and (3) 
 
 flexibility, in other words, it can adjust itself easily according to 
 
 the change of environment (pattern vectors which are being processed) 
 
 by constantly applying the training procedure. If the given pattern 
 
 vectors are not linearly separable, however, the successive changes in 
 
 the structure of an adaptive classifier is not simple. And even if 
 
 the patterns of a pattern classifier are separable, the concept of 
 
 optimality is not incorporated i n the adaptive methods. 
 
 Linear programming methods, on the other hand, guarantee the 
 
 optimality of a given linear function of the network parameters such 
 
 as the sum of all weights. The paper by Smith shows that 
 
 computation time for adaptive methods is less than that for the linear 
 
 programming methods when the number of pattern vectors is much greater 
 
 than the dimension of vectors, and that the linear programming methods 
 
 are preferable otherwise. However all these papers deal only with the 
 
 1. 
 
static case in which only a single set of pattern vectors is given and 
 do not deal with a general dynamic case when new pattern vectors re- 
 place some of the current vectors. Thus linear classifiers discussed 
 by them cannot adapt to a changing environment. 
 
 Our objective of this paper is to show a linear programming 
 method which has the capability of adaptation even in changing en- 
 vironment. In other words, our linear classifier can adjust itself to 
 a changing set of pattern vectors, always keeping the optimum structure 
 of the classifier and without increase of the size of storage space to 
 be used. This paper will discuss also a linear classifier based on an 
 integer linear programming method which minimizes the number of error 
 vectors, even if current set of pattern vectors is not linearly separable, 
 and which also adapts to a changing set of pattern vectors. 
 
2. Problem Statement 
 
 We consider a linear classifier which separates M N-di- 
 mensional pattern vectors 
 
 If* 35 = (^\ '•-, 4 j) ) 3=1, 2, ..., M (2.1) 
 
 into two categories A and B for which the output f of the classifier 
 is assumed 1 and respectively. The coordinates £. are real num- 
 bers. An objective in designing a linear classifier is determination 
 of a weight vector 
 
 w = (w , .. ., w N ) (2.2) 
 
 and a constant value 
 
 w Q (2.3) 
 
 of real numbers which satisfy 
 
 v«V + w Q > d if f = 1 
 w.| vjy + w Q <-d if f = 
 
 0=1, ..., M , 
 
 - -4i) N (i) 
 
 where w • I ° is the inner product Z w. £. and d > is called 
 
 i=l X X 
 the margin of the classifier*. The margin is provided to secure reliable 
 
 Even if the value of the left hand side of (2.k) does not exceed 
 d (does not reach -d) for the vector of category A(B), the classifier 
 is supposed to work properly as long as the value is positive 
 (negative). This is why d is called as the "margin", d is usually 
 set to 1 without loss of generality. 
 
operation even when small deviation in w • p*" is caused by noise. 
 
 In order to facilitate our computation based on linear 
 programming, let each variable w. be decomposed into non-negative 
 
 variables as follows 
 
 where 
 
 w. = w. + - w i " ^ (i = 0, 1, 2, ..., N) 
 
 w ± + >0, w ± - > . (i =0, 1, 2, ..., N) (2.5) 
 
 Thus (2.k) is rewritten as 
 
 N , .v 
 
 S (w1~ - w7) ^ j; > d if f = 1 
 i=0 x x 
 
 N , .x 
 
 2 (w+ - wT) i\ 3) < -d if f = 0, j=l, ..., M , 
 i=0 ■"■ 
 
 where ^ = 1. ( 2#6 ) 
 
 If (2.6) (i.e. (2.1*)) is consistent, there are generally an 
 infinite number of solutions (w,w ). Henceforth we will consider only 
 a solution which minimizes a linear objective function of the weight 
 vector 
 
 u • w^ , (2. 7 ) 
 
 where u is a (2N + 2) - dimensional vector and 
 
 W±= (W + ' W 0, ^••• w n)- (2.8) 
 
 The coordinates u will be determined in the following paragraph, depending 
 upon which parameters of the classifier we want to minimize. The minimi- 
 zation of the objective function (2.7) under the constraints (2.5) and 
 (2.6) is a typical linear programming problem. 
 
 The objective function is expressed in a general form in (2.7). 
 However it can be correlated to the reliable operation of the classifier 
 
as follows. Assume that the deviation of each input |. due to noise 
 
 or other fluctuations in the circuit of the classifier is 8.. Then 
 
 1 
 
 the actual value of the left side of (2.h) is 
 
 N , .n N , .s N 
 
 Z w ± (i ± Vj; + o ± ) + w Q = E w ± g ± U ' + E w i S + w . 
 i=l i=l i=l 
 
 Now let 6 he the maximum of |£ . | over all i; 
 
 |6. | < 6 , i=l, 2, ..., N. 
 
 Then we have 
 
 N N 
 
 I E w. 5. I < 5 Z I w. I . 
 1=1 1=1 
 
 Therefore if 6 satisfies 
 
 N / .n N 
 
 E w. g: JJ - 8 2 |w. I + w^ > if f = 1 
 1=1 X X 1=1 X °" 
 
 (2.9) 
 N . . N 
 
 E w ± |^ D; + & Z |w i | + w Q < if f = 0, j = 1, ..., M, 
 
 the linear classifier operates correctly* even when all 5 . have the 
 
 maximum deviation +8 or -5- (i.e. |&.| = 8 for all i) . The max- 
 imum value of 8 such that the classifier operates correctly is called 
 the input tolerance . Denoting the input tolerance with y, we have 
 
 (j) 
 
 l. z w i V +w ol 
 
 = Min i~l x x 
 
 y = Min i~l L x Ul (2.10) 
 
 Z Iw.l 
 i=l 
 
 by (2.9). If we determine a solution (w, w ) such that 7 is maximized, 
 
 See the footnote on page 3/ 
 
the classifier is allowed to have the maximum deviation in its inputs 
 and therefore we have maximized the reliability of' the operation of the 
 
 classifier. We will prove that the maximization of (2.10) is equivalent 
 
 N 
 to the minimization of Z w. . 
 
 i=l X 
 
 First of all, if w is a weight vector which maximizes y then 
 k*w also provides the same input tolerance, where k is any positive 
 number such that k • w is still a solution of (2.U), Now we can assume 
 
 N ( .s 
 
 Min | Z V± V ± 3) + w Q | = d (2.11) 
 
 J i=l 
 
 N , v 
 without loss of generality. If Min | Z w.|. j; + w = t > d, we 
 
 i=l 
 
 can simply multiply a certain positive number k = — < 1 in order to 
 
 obtain (2.1l) which of course still satisfies (2.k). 
 
 Consequently the maximization of y is to minimize 
 
 N 
 
 Z I w. 
 i=l X 
 
 under conditions (2.k) and (2.1l). In this case, however, we can 
 
 delete condition (2.1l). This is because if we minijnize 
 
 N 
 
 under condition (2.k) and obtain 
 
 Min | Z w.|: j; + w Q | = e>d, 
 
 then - < 1 and 
 e 
 
 d - 
 w' = — • w 
 e 
 
satisfies (2.k). For this new weight vector w 
 
 N 
 
 
 N 
 
 
 z 1 
 
 1=1 
 
 w! 
 
 1 
 
 < Z 
 i=l 
 
 w. 
 
 i 
 
 holds. 
 
 Thus a contradiction. Therefore we must have e = d, i.e. (2.11) is 
 
 satisfied. An alternative proof is found in ;?J;,L 
 
 By setting 
 
 u« = u,= and u. = u. = 1 for i = 1, 2, .... N, 
 
 li ' ' ' ' 
 
 the objective function (2.7) becomes 
 
 N 
 
 + 
 
 Z (wT + wT ) (2.12) 
 
 .,11 v ' 
 
 i=l 
 
 N ^ 
 which represents Z ' w. ' . Solution of the linear program composed of 
 i=l X 
 
 (2.12), (2.5) and (2.6) will lead to the design of a linear classifier 
 with the maximum reliability of operation. 
 
 Another case was discussed in the papers when the 
 values of |. are limited to +1 and -1, instead of real numbers. If 
 
 w. £. (i=0, 1, 2, ..., N) is permitted to deviate as much as w.5 . (l+s) 
 
 or w. |. (1-5) where 5 is now a percentage deviation, then the input 
 
 tolerance of a majority element is 
 
 N , x 
 1 Min ! Z wA. } ! , (2.13) 
 
 W J 1=0 
 
 where 
 
 N 
 W = Z I w. 
 i=0 
 
 * Obviously the minimization of (2.12) leads to the condition that 
 
 N N 
 
 either wt or w? is always 0. Then Z (w. + w7 ) =2 lw ' follows. 
 
 i=l A ^ i=l 
 
And it was proved that the minimization of W is equivalent to the max- 
 imization of the input tolerance, i.e. the maximization of the reliabil- 
 ity of operation. For this case we set u. = U. = 1, i = 0, 1, 2, . .., N 
 in (2.7). 
 
 Although the input tolerance is a typical objective to be optimized, 
 if a certain parameter of the classifier can be represented in the form 
 of (2.7), we can optimize other characteristic of the classifier rather 
 than the input tolerance. In particular, when the integer programming is 
 used, a wider variety of objective functions may be available for our choice. 
 
 So far we have assumed that the linear classifier processes the 
 
 set of distinct patterns (I ,...,£ ). In other words, only 
 
 ~* (1) ^ (M) 
 
 I , ..., i are supplied repeatedly to the linear classifier as a 
 
 r(t -1) r (t) r (t + 1) r (t + 2) _ . . 
 
 time series . .., £ , \ ,| ,| , ... The structure of 
 
 the linear classifier was optimized for the set {% , ... | }. 
 
 Let us consider the case where the time series of pattern vectors 
 gradually contain new pattern vectors for which the structure of the 
 linear classifier is not optimum and instead some of the existing pattern 
 vectors are eliminated from the time series. A few different schemes are 
 conceivable to find whether the incoming pattern vector is considered as 
 a new pattern vector to which the structure of the classifier is optimized. 
 
These schemes will be discussed in Section 6. Here for the moment let 
 us assume that the incoming pattern vector is found new by a certain 
 scheme at some instant and that the linear classifier is supposed to 
 process the new set of distinct pattern vectors (!,...,£ } 
 instead of [ | , . .., £ } . In other words, | is replaced by \ 
 Now the structure of the classifier is supposed to be optimized for 
 (£,...,£ }• The linear classifier should adapt itself to 
 this new environment. 
 
 By changing the notations appropriately and multiplying the second 
 inequality of (2.6) by -1, our linear program to minimize (2.12) under the 
 constraints (2.5) and (2.6) is converted into 
 
 minimize c x 
 subject to A x > b 
 
 x > 0, (2.1U) 
 
where x is an n-dimensional vector of unknown variables to be determined 
 and 
 
 ( C;L , •-., c n ), 
 
 A = 
 
 ct_ - • • • • cl_ 
 
 11 In 
 
 ml 
 
 mn 
 
 (2.15) 
 
 b = (b_ , ...,b). 
 1 m 
 
 Note that A corresponds to the original given set of pattern vectors, 
 b to d's, and x to the original w. c represents the coefficients of the 
 objective function. Thus the adaptability problem of the classifier with the 
 change of environment may be stated as follows: Given an optimum solution for 
 
 the linear program (2.1^), find an optimum solution for the new linear 
 
 ->(m+l) 
 program with the new coefficient row a^ replacing the oldest row 
 
 M 
 
 7 (i) 
 
 a , representing the change from | to £ 
 
 (Mfl) . 
 
 i.e. 
 
 where 
 
 minimize 
 
 c x 
 
 -- — ■ T 
 subject to A* x > b' 
 
 x > 
 
 p • • • 
 
 T21 
 
 (2.16) 
 
 A* = 
 
 *2n 
 
 m+l 1, • • • , m+1 n 
 
 (2.17) 
 
 V = (b , .-., b _) . 
 2' ' m+l' 
 
 The effectiveness of the methods which we will discuss depends 
 
 on the validity of an assumption that an optimum solution of (2.l6) does 
 
 not differ "too much" from the one of (2.lU). 
 
 10 
 
3- Outline of the Simplex Method 
 
 We assume that readers are familiar with the "basic properties 
 of linear inequalities and the simplex method, which is a computational 
 procedure to solve a linear programming problem. However let us sketch 
 important concepts which will be used often in the rest of this paper. 
 
 3.1 Duality Theorem of a Linear Program 
 
 Consider the following linear program: 
 
 maximize b v 
 
 T -- — 
 subject to A v < c ( 3.1.1) 
 
 v > 
 
 where v is an m-dimensional vector of unknown variables. This linear 
 program is called the dual of (2.1*0 . The coefficient matrix is 
 the transpose of A in (2.1*0; and b and c are interchanged. It is 
 known that when we solve either (3«l«l) or (2.1*0 by the simplex method, 
 we will find either optimum solutions to both or infeasibility of the 
 problems (one of them is possible unbounded.) Therefore we can solve 
 the more conveninet of (3.1.1) or (2.1*0. 
 
 3.2 Simplex Method. 
 
 Let us sketch the simplex method. See the literatures, [8] 
 and [9L -for details. 
 
 In this paper we will work on the dual problem (3»l»l) rather 
 than the primal problem (2.1*0, because (3«l«l) in our case is more ad- 
 vantageous than (2.1*0 in several respects, as will be seen later. 
 
 We reformulate (3.1.1) in the following form by introducing 
 so-called slack variables s., s_, ..., s : 
 
 11 
 
and 
 
 maximize h" v 
 
 subject to 
 
 _ m -fr n 
 
 n 
 
 1 ... 
 
 .T 
 
 ... 1 
 
 v > 0, 
 
 > 0. 
 
 v. 
 
 V 
 
 m 
 
 (3-2.1) 
 
 The simplex method is a systematic procedure to choose a sequence of sets 
 
 of n basic variables (only basic variables can assume non-zero values and 
 
 non-basic variables are assigned zero. ) out of the m+n variables v_ , . . . , 
 
 v , s_ , .... s until we obtain an optimal solution consisting of basic 
 m 1 n 
 
 variables which maximizes b v. A solution which satisfies constraints 
 of (3«2.l) but is not necessarily optimal is called a feasible solution. 
 When c. , c p , ..., c > 0, we may choose 
 
 v, = v~ = 
 
 = v = 
 m 
 
 s. 
 
 1 
 
 = c, 
 
 ( i = lj 2, . . . } n) 
 
 (3.2.2) 
 
 as an initial feasible solution. Let us assume that u of (2.7) be non- 
 negative as seen in (2.12) for example. Then, c_, ..., c are all 
 non-negative as required in (3.2.2). 
 
 The simplex method is most conveniently described by using 
 tableau representation. The first simplex tableau is shown in Fig. 3.2.1, 
 whose elements are shown by the column vectors p and q, the row vector 
 
 12 
 
r, and the matrix H. Here 
 
 T 
 q = (q^ '", %} is the 
 
 values of basic variables in 
 the current tableau (There- 
 fore q > means that the 
 solution q is feasible.) and 
 P = (P-i* '••} P ) denotes the 
 coefficients of the basic variables 
 
 Fig. 3-2.1 
 Simplex Tableau 
 
 n + m 
 
 n 
 
 q in the objective function (i.e. b. 's of b in (3*2. l) corresponding 
 
 J 
 
 to the basic variables in q). H may be divided into two parts 
 
 [hi, EL, ..., K" = [d', b" 1 ], 
 
 L 1' 2' ' m+n J L ' ' 
 
 (3-2.3) 
 
 where h. for 1 < i < m is the column associated with the variable v. and 
 
 h. for m+l<i<n+mis the column associated with s 
 
 l-m 
 
 E^ch entry of 
 
 r = ( v T ±> 
 
 , r ) can be obtained bv the following: 
 m+n 
 
 r Q = P ' q 
 
 r. = p ♦ h. - b. 
 
 (i = 1, 2, . . . , m + n) 
 
 (3.2. U) 
 
 where b 
 
 m+1' 
 
 , b are set to (see (3.2.1)). If we start with the 
 7 m+n 
 
 initial feasible solution (3.2.2), the initial tableau consists of 
 
 
 ' 5 
 
 T 
 
 q = c 
 
 D = A 
 
 B~ = I (the unit matrix) 
 
 r Q = 
 
 r. = -b. (i = 1, 2, .... m+n). 
 
 1 1 7 7 7 
 
 (3.2.5) 
 
 At each simplex tableau, there are three possibilities: 
 
 13 
 
(1) r > . 
 
 (2) Existence of i such that r. < and h. < . 
 
 l l — 
 
 (3) Neither (l) nor (2). 
 
 Case (l) means that the feasible solution at the current tableau is 
 optimal and the case (2) means that the linear program has an un- 
 bounded solution (i.e. the infeasibility of the primal problem (2.lU) ^ -* 
 
 ). Whenever the case (3) is encountered, the so-called pivot 
 operation is applied to the current tableau and the entries are trans- 
 formed according to the simplex rule., deriving the next tableau. Then 
 we examine "which of the above three possibilities holds in the new 
 tableau. With repeated deriviation of tableaux we will eventually 
 obtain the case (l) or (2). The pivot operation is essentially a 
 process of replacing one basic variable (i.e. a column of the current 
 tableau) by a new basic variable. See the literature [8] and [91 for 
 the details of the derivation of simplex tableau. 
 
 If the simplex method ends up with the case (l), an optimal 
 solution of (3«l«l) will be in column q of the last tableau and an 
 
 optimal solution of (2.1*0 will be (r , , .... r ) whose coordinates 
 
 * m+1 ' m+n 
 
 are the values of x- . . . , x , respectively. 
 
 Note that the following relations hold for every simplex 
 
 tableau: 
 
 q = B" 1 c 
 
 _! - T (3.2.6) 
 
 h. = B a. (i=lj 2, ..., m+n) 
 
 — - T T 
 
 where a. is the ith column of [A I] of (3.2.1). These relations 
 
 will be used later. 
 
 Ik 
 
h. Adaptation "by the Simplex Method 
 
 When a classifier characterized by the linear program (2.1^) 
 adapts itself to a new environment "with an old pattern vector replaced 
 "by a new one, it is characterized by the new linear program (2.l6). 
 Let us assume (for a while) that the linear programs (2.lU) and (2.l6) 
 both are feasible. The case where they are infeasible will be considered 
 at the end of this section and also in the next section. 
 
 Now suppose that we have solved the linear program (2.lU) by 
 using the dual formulation (3-1. l) and have obtained an optimum solution 
 in the last simplex tableau as shown in Fig. U.l. The adaptation is 
 essentially an efficient method for deriving an initial tableau of the new 
 problem (2.l6) (strictly speaking, 
 the dual formulation of (2.l6)) and 
 applying the pivot operations until 
 
 Fig. k.l 
 Last Simplex Tableau 
 
 a new optimal solution is obtained. 
 
 p(F) 
 
 W) 
 
 r(F) 
 
 This type of problem has already been studied and is included in the 
 subject of parametric programming. In this paper, however, we propose 
 a new procedure rather than using the standard technique of parametric 
 programming "■ ^ , because (l) the pattern classification problem 
 is suitable for the dual formulation with respect to signs of co- 
 efficients (no need of artificial variables), (2) subsequent pro- 
 cedure for adaptation is simpler and straightforward (the dual 
 simplex method is not needed) and (3) our formulation is easily 
 extendable to the integer linear programming formulation which will 
 be discussed in Section 5. 
 
 .15 
 
This initial tableau is expected to yield faster convergence 
 than the ordinary initial tableau given in (3.2.5). 
 
 In order to obtain this new initial tableau, we have to (l) 
 eliminate the effect of the inequality 
 
 a ll X l + a !2 X 2 + "• + V X n ^ b l (4 - l) 
 
 which corresponds to the pattern vector to be eliminated, and (2) intro- 
 duce a new inequality 
 
 a. n x_+a. x + ...+a. x >b . (U.2) 
 m+1, 1 1 m+1, 2 2 m+1, n n — m+1 ' 
 
 which corresponds to a new pattern vector, without increasing the size 
 
 of simplex tableau. 
 
 The first step corresponding to (l) is to set b equal to - S 
 
 where S is a sufficiently large positive number such that (^.l) is 
 
 satisfied for all feasible solutions of the remaining inequalities. 
 
 This means that (k. l) is now non-restrictive. Let us recall that the 
 
 current linear program is treated by the dual formulation and therefore 
 
 the inequality (*+.l) is associated with the first column h (F) of H(F) 
 
 in the last simplex tableau. The change of b causes the change of 
 
 an entry of p(F) which corresponds to a variable v (or h -(F)) if v 
 
 is in the basis. But it causes no change if h_(F) is not in the basis. 
 The entries of r(F) are accordingly recalculated by (3.2. U). After 
 this, simply delete the first column h (F) and r_. The deletion is 
 permissible because h..(F) henceforth will not enter the basis (since r 
 
 will not be negative.). 
 
 The second step corresponding to (2) above is to introduce 
 the new inequality (k.2) into the column where (U.l) was eliminated. 
 The new entries can be obtained by (3«2. k) and (3-2.6), i.e. 
 
 h - B_1 Vi (U - 3) 
 
 l£ 
 
where a* +1 = (a^^ ^ ^1,2,..., Vl/ 
 
 and 
 
 r i = * ' ff i " Vr <^ 
 
 Note that q was not changed in the above two steps and there- 
 fore the new tableau still has a feasible solution (although it may not 
 be optimal) . Thus this new tableau can be used as an initial tableau 
 for the dual of the problem (2.l6). 
 
 If all the entries of r are non-negative, the old solution 
 is optimal for the new problem also. But if r contains some negative 
 entries, apply the pivot operations until an optimal solution is derived 
 for the new problem. 
 
 Although we discussed the procedure only for the case in which 
 one inequality is replaced, extention to the case of more than one in- 
 equality is obvious. 
 
 The procedure would need less computation time if the coordinates 
 of the new solution do not deviate "too much" from those of the old sol- 
 ution since the new pivot operation starts from a "closer" solution to 
 the new problem than the ordinary initial solution of (3»2.5)« 
 
 Example. Fig. k.2 
 
 Linear Classifier 
 Let us pursue the adaptation x 
 
 w 
 
 of the linear classifier shown in Fig. k.2, 
 when pattern vectors change as follows from 
 (U.5) through (U.7): 
 
 (i) ( X;l ,x 2 ): (11), (10) yield f = l 
 
 fCx^Xg) 
 
 (01) yields f = 
 
 (U.5) 
 
 IT 
 
(ii) (x 1 ,x 2 ): (11), (00) yield f=l 
 
 (01) yields f = ^'^ 
 
 (iii) ( Xl ,x 2 ): (11) yields f = 1 
 
 (01), (00) yield f = 0. ^'^ 
 
 Corresponding to these sets of pattern vectors we have the following 
 sets of inequalities by (2.4): 
 
 (i) w. + w > 1 (4.8) 
 
 w 1 + w 2 
 
 4- W 
 o 
 
 > 1* 
 
 w l 
 
 + w 
 
 
 
 > 1 
 
 W 2 
 
 + w 
 
 
 
 < -1 
 
 w + w + w > 1 
 
 (ii) w > 1 (U.9) 
 
 w~ + w < -1 
 2 o — 
 
 w_. + w_ + w > 1 
 1 2 o — 
 
 (iii) w < -1 (U.10) 
 
 o — 
 
 w + w < -1 
 
 Let us split these variables w. 's as seen in (2.5) in order to keep 
 
 the non-negertiveness of variables in the simplex method. The objective 
 
 function to be minimized is now: 
 
 1^1 + |w 2 | = w 1 + W;L " + w 2 + + w 2 ~. (^.H) 
 Renaming these split variables and changing the direction of some in- 
 equalities, the above sets of inequalities may be rewritten as follows, 
 corresponding to (2.1^): 
 
 d in (2.4) is set to 1 for simplicity. 
 
 18 
 
(i) x 1 - X_ + X_ - Xs > 1 v_ (U.12) 
 
 x l - 
 
 X 2 
 
 + 
 
 X 3 
 
 
 \ 
 
 + 
 
 X 5 
 
 
 x 6 
 
 > 
 
 1 
 
 v l 
 
 x l - 
 
 X 2 
 
 
 
 
 
 + 
 
 X 5 
 
 - 
 
 x 6 
 
 > 
 
 l 
 
 V 2 
 
 
 
 - 
 
 X 3 
 
 + 
 
 X k 
 
 - 
 
 X 5 
 
 + 
 
 x 6 
 
 > 
 
 1 
 
 V 3 
 
 x l - 
 
 X 2 
 
 + 
 
 X 3 
 
 " 
 
 X k 
 
 + 
 
 X 5 
 X 5 
 
 *" 
 
 x 6 
 x 6 
 
 > 
 > 
 
 l 
 1 
 
 v l 
 
 
 
 - 
 
 X 3 
 
 + 
 
 X k 
 
 - 
 
 X 5 
 
 + 
 
 x 6 
 
 > 
 
 l 
 
 V 3 
 
 x l - 
 
 X 2 
 
 + 
 
 X 3 
 
 - 
 
 \ 
 
 + 
 
 X 5 
 
 - 
 
 x 6 
 
 > 
 
 1 
 
 v l 
 
 
 
 
 
 
 
 - 
 
 X 5 
 
 + 
 
 x 6 
 
 > 
 
 1 
 
 V 5 
 
 
 
 - 
 
 X 3 
 
 + 
 
 \ 
 
 - 
 
 X 5 
 
 + 
 
 x 6 
 
 > 
 
 l 
 
 V 3 
 
 (ii) x, - x. > 1 v,, (U.13) 
 
 (iii) - x 5 + x 6 > 1 v 5 (k.lk) 
 
 - x 3 + x k - x 5 + x 6 > 1 v 3 
 
 where v. is shown simply for identification of each inequality. The 
 objective function in the renamed variables is 
 
 x 1 + x 2 + x + x^ (^.15) 
 
 Note that the following relations hold among the original and renamed 
 variables 
 
 x ± - x 2 - w 1 - w^ = w x 
 
 x - x^ = w 2 - w 2 ~ = w 2 (h.l6) 
 
 + 
 x r .-x^=w -w =w 
 56000 
 
 Assume that our classifier has the first set of pattern 
 
 vectors (i). An initial tableau for the dual of this linear program 
 
 will be derived by (3.2.5). Table k.l shows the initial tableau 
 
 derived. After applying the pivot operation three times (Tables h.2 
 
 and U. 3 and k.k) , an optimal solution will result since Table h.k 
 
 contains no negative entry in v. The solution is obtained in (r^, ..., 
 
 r 9 ), i.e., 
 
 L9 
 
x 1 = 2 
 
 x 6 = 1 (U.17) 
 
 x 2 = X 3 = X k = x 5 = ° ' 
 
 which implies 
 
 w l 
 
 = 2 
 
 W 2 
 
 = 
 
 w 
 o 
 
 - -1 
 
 (U.18) 
 
 Then assume that the first set of pattern vectors is changed 
 to the second set (ii). According to the procedure discussed, elimi- 
 nate the inequality which corresponds to the old pattern vector to be 
 replaced, 
 
 x ± - x 2 + x^ - x 6 > 1 v 2 (U.19) 
 
 and instead introduce the inequality 
 
 x 5 - x 6 > 1 v^ > (U.20) 
 
 20 
 
Variables 
 in the 
 basis 
 
 S 5 
 s 6 
 
 1 
 
 -1 
 1 
 
 -1 
 
 1 
 -1 
 
 V 2 
 
 1 
 
 -1 
 
 
 
 
 
 1 
 -1 
 
 v 
 
 -1 
 
 1 
 
 
 
 
 
 
 
 1 
 
 column 
 name 
 
 r-» 
 
 -1 -1 
 
 
 
 
 
 Table k.l Initial Tableau for the First 
 Set of Pattern Vectors 
 
 v. 
 
 v. 
 
 V, 
 
 s l 
 
 
 
 1 
 
 
 
 
 
 <i> 
 
 1 
 
 
 
 
 
 
 
 -1 
 
 
 
 S 2 
 
 
 
 1 
 
 
 
 
 
 -1 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 S 3 
 
 
 
 1 
 
 
 
 -1 
 
 
 
 
 
 
 
 -1 
 
 
 
 -1 
 
 
 
 % 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 v l 
 
 1 
 
 
 
 1 
 
 1 
 
 -1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 s 6 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 - 2 ( 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 Table k.2 Intermediate Tableau for the First 
 Set of Pattern Vectors 
 
 21 
 
V 3 
 
 l 
 
 1 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 -1 
 
 
 
 S 2 
 
 
 
 2 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 S 3 
 
 
 
 1 
 
 
 
 -1 
 
 
 
 
 
 
 
 1 
 
 
 
 -1 
 
 
 
 % 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 v l 
 
 1 
 
 1 
 
 1 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 s 6 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 © 
 
 1 
 
 
 
 2 
 
 
 
 
 
 
 
 2 
 
 
 
 
 
 
 
 -1 
 
 
 
 Table k.3 
 Intermediate Tableau for the First 
 Set of Pattern Vectors 
 
 V 3 
 
 1 
 
 1 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 -1 
 
 S 2 
 
 
 
 2 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 S 3 
 
 
 
 1 
 
 
 
 -1 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 % 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 -1 
 
 v l 
 
 1 
 
 1 
 
 1 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 S 5 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 2 
 
 
 
 
 
 
 
 2 
 
 
 
 
 
 
 
 
 
 1 
 
 Table k.h 
 
 Optimal Tableau for the First Set of 
 Pattern Vectors 
 
 
 
 b-» 
 
 1 
 
 -100 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 V 3 
 
 1 
 
 1 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 -1 
 
 S 2 
 
 
 
 2 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 3 3 
 
 
 
 1 
 
 
 
 -1 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 s l. 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 -1 
 
 v l 
 
 1 
 
 1 
 
 1 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 S 5 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 2 
 
 
 
 101 
 
 
 
 2 
 
 
 
 
 
 
 
 
 
 1 
 
 Table U.5 
 Elimination of the Old Vector V„ 
 
 oo 
 
Replace 1 in b (of v ) by a sufficiently small number, say - 100. 
 Since v p is not in the basis in Table U.U, the replacement does not 
 cause any change of any entry except r p of Table k.k. Table k.5 shows the 
 resultant tableau. Next delete h and r from Table k. 5 and fill in 
 new entries which are derived from the new inequality (^.20) according 
 to (^.3) and (U.U). Note that B " is [hi, h , ..., h ] in this case. 
 The result is shown in Table U.6. Since there is a negative entry in 
 r, apply the pivot operation. After two applications, a new optimal 
 solution is derived as shown in Table h."(. It is 
 
 = 1 (U.21) 
 
 which lead to 
 
 x l 
 
 
 2, 
 
 X ^ = 
 
 2, : 
 
 X 2 
 
 = 
 
 X 3 
 
 = x 6 
 
 = 
 
 W l 
 
 = 
 
 2 
 
 
 
 W 2 
 
 = 
 
 -2 
 
 
 
 w 
 
 
 = 
 
 1 
 
 • 
 
 
 (U.22) 
 
 23 
 

 
 
 V l 
 
 v k 
 
 v 3 
 
 s l 
 
 S 2 
 
 S 3 
 
 % 
 
 S 5 
 
 G 6 
 
 V 3 
 S 2 
 
 S 3 
 % 
 v l 
 S 5 
 
 1 
 
 
 
 
 1 
 
 
 
 1 
 
 2 
 
 1 
 1 
 1 
 
 
 
 
 
 
 
 1 
 
 
 
 -1 
 
 
 
 -1 
 
 Q 
 
 
 
 
 1 
 
 
 
 
 
 
 1 
 1 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 1 
 -1 
 
 
 
 1 
 
 
 
 2 
 
 
 
 -2 
 
 
 *. 
 
 2 
 
 
 
 
 
 
 
 
 
 1 
 
 p 
 
 4 
 
 Table k.6 
 Initial Tableau for the Second 
 Set of Pattern Vectors 
 
 V 3 ' 
 S 2 
 
 S 3 
 v h 
 v l 
 % 
 
 1 ! 
 
 
 
 
 1 
 1 
 
 
 
 2 
 2 
 2 
 1 
 1 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 1 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 1 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 k 
 
 
 
 J 
 
 
 
 
 
 2 
 
 
 
 
 
 2 
 
 1 
 
 
 
 Table k.7 
 
 Optimal Tableau for the Second 
 
 Set of Pattern Vectors 
 
 2k 
 
The change from the second set of pattern vectors ( ii) to 
 the third set ( iii) can be treated in a similar manner. In this case 
 p column is changed "because v. is in the basis as seen in Table k.Q. 
 Then Table U.9 is the initial tableau for the third set (iii). An 
 optimal solution is obtained in Table ^-.10 after two pivot operations. 
 It is 
 
 W l =2 > W 2 = °' W o = _1 (If,23) 
 
 From (k.lQ), (1+.22), (^.23), we can see that the structure of the 
 
 linear classifier has changed as shown in Fig. U.3« 
 
 Fig. k. 3 Adaptation of Linear Classifier 
 
 1 -*♦ 
 
 ^0- ^©- 
 
 
 
 
 v l 
 
 v k 
 
 v 3 
 
 s l 
 
 S 2 
 
 s 3 
 
 S 3 
 
 % 
 
 S 5 
 
 V 3 
 
 l 
 
 2 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 
 S 2 
 
 
 
 2 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 S 3 
 
 
 
 2 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 v k 
 
 -100 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 v l 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 s 6 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 -97 
 
 
 
 
 
 
 
 2 
 
 
 
 
 
 -99 
 
 -100 
 
 
 
 Table k.Q 
 Elimination of the Old Vector v, 
 
 25 
 

 
 
 v l 
 
 V 5 
 
 V 
 
 3 
 
 s l 
 
 S 2 
 
 S 3 
 
 % 
 
 s 5 
 
 s 6 
 
 V 3 
 S 2 
 
 S 3 
 
 Y h 
 v l 
 s 6 
 
 1 
 
 
 
 
 
 -100 
 
 1 
 
 
 
 2 
 2 
 2 
 
 1 
 
 1 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 -1 
 
 
 
 
 1 
 
 
 
 
 
 
 
 1 
 1 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 1 
 
 
 
 1 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 1 
 
 
 
 -97 
 
 
 
 99 
 
 
 
 2 
 
 
 
 
 
 -99 
 
 -100 
 
 
 
 Table k.9 
 Initial Tableau for the Third 
 Set of Pattern Vectors 
 
 V 3 
 
 1 
 
 1 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 1 
 
 S 2 
 
 
 
 2 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 S 3 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 % 
 
 
 
 1 
 
 
 
 -1 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 -1 
 
 v l 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 s 5 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 J 
 
 
 2 
 
 
 
 
 
 
 1 
 
 2 
 
 
 
 
 
 
 
 
 
 1 
 
 Table 1*.10 
 Optimal Tableau for the Third 
 Set of Pattern Vectors 
 
 26 
 
Now let us consider the case in which the given set of pattern 
 
 vectors are not separable, i.e., (2.l4) (or (2.l6)) is not feasible. 
 
 By adding an artificial variable to each inequality of (2.l4), (2.l4) 
 
 can be rewritten as 
 
 a . _ x_ + +a. x + t . > b . 
 
 ll 1 in n i — 1 
 
 t. > (i = 1, 2, . .., m) 
 
 x > (J = 1, 2, ..., n). (4.210 
 
 J 
 
 Obviously (4.24) is feasible because the set of inequalities are 
 satisfied by a sufficiently large positive value of each t.. However 
 the positiveness of t. does not mean that the corresponding ith in- 
 equality of (2.14) is satisfied. In order to circumvent this fact 
 let us use the following new objective function. 
 
 m ^ _ 
 
 V Z t. + c . x (4.25) 
 
 1=1 x 
 
 where V is some large positive number. Roughly speaking, the 
 
 minimization of this objective function means that it first minimizes 
 
 m 
 
 Z t. which will tend to minimize the number of incorrectly separated 
 i = 1 X 
 vectors and that it next minimizes ex, the original objective function. 
 
 The set of inequalities (4.24) with the objective function (4.25) can 
 
 be solved in a similar manner as the case of (2.l4), since (4.24) is 
 
 now feasible. Although the number of artificial variables increases 
 
 m 
 
 Ideally, .Z n t. is to be minimized, no matter what value ex assumes, 
 i=l l 
 
 However, this is not achieved by (4.25) because t. 's are continuous 
 
 m 
 
 variables. In other words, the value of .ILt.,when (4.25) is 
 
 ' 1=1 i » 
 
 minimized, depends upon the relative size of V and ex and may not 
 
 m 
 
 be minimized. Note that if t.'s are discrete variables, .E,t. is 
 
 i i=l l 
 
 minimized as will be discussed in Section 5« 
 
 27 
 
indefinitely if the adaptation is repeated infinitely, the storage 
 space increase can be avoided by replacing the old artificial variable 
 to be eliminated by the new artificial variable. 
 
 The objective function (U.25), however, does not always lead to 
 the minimum number of erroneously separated pattern vectors. As will 
 be discussed in the next section, the exact minimization of the number 
 of incorrectly separated vectors will be obtained by using integer 
 linear programming. 
 
 28 
 
5. Formulation and Adaptation "by Integer Linear Programming 
 
 Given a non- separable set of pattern vectors, one realization of 
 a linear classifier is to minimize the number of incorrectly separated 
 vectors. This can he achieved by the following integer linear pro- 
 gramming approach. 
 
 Corresponding to each input vector £ of (2.k), 
 
 w-I^+w >d if f(r^) = 1 (5.1A) 
 
 o — 
 
 or 
 
 -w-i^^-w o >d if f(1^0 = , (5- IB) 
 
 formulate the two inequalities for each of (5.1A) and (5. IB) as follows: 
 
 r^w + w +UP. > d (5.2A) 
 
 o J - 
 
 TvJ)^ , T . _ TT/n T> \ *r A A-P Wfc(j) 
 
 | vj; w + w -U(l-P,)<-d if f(| vj; ) = 1 (5-3A) 
 
 O o — 
 
 or 
 
 - r w - w +UP. > d (5.2B) 
 
 3 ~ 
 
 -i^'V-w -u(i-p.)<-d if f(r^) = (5.3B) 
 
 ^ J 
 
 where P. is a variable which assumes 1 or 0, and where U is a suf- 
 
 ficiently large positive number to insure the following property. 
 
 When P = 0, (5.2A and B) is obviously identical to (5.1A and B) 
 
 J ^ ( *) 
 
 respectively and (5.3A and B) is satisfied by any value of S i • This 
 
 me 
 
 ans (5.1A or B) is satisfied by w and accordingly the pattern vector is 
 
 classified correctly. On the other hand, when P = 1, (5-3A and B) become 
 
 J 
 
 w 
 or 
 
 ^^tj') +w <-d if f - 1 (5-^A) 
 
 _^^j)_ w _<_d if f = . (5- UB) 
 
 o — 
 
 .29 
 
The pattern vector is separated incorrectly. Therefore 
 
 m 
 
 2 P, (5.5) 
 
 5=1 J 
 
 shows the number of incorrectly separated pattern vectors. 
 Now consider the objective function: 
 
 m 
 
 minimize V Z P . + u w -, (5.6)- 
 
 0=1 J 
 
 where u is the vector in (2.7) and where V is a sufficiently large 
 
 positive number (i.e. V > Max u w -). Different from the objective 
 function (k.25) in Section h, the minimization of this objective function 
 means the minimization of both Z P. and u w -, because Z P. assumes 
 discrete values only. This is the property which we desired. 
 
 Of course, we can consider other objective functions if they are 
 needed. The incorporation of integer variables extensively widens the 
 concept of objective functions. For example, it is possible to minimize 
 the number of non-zero weights, i.e., the number of inputs actually needed, 
 by reformulating the integer linear programming with additional integer 
 variables and changing the objective function. ** 
 
 This new linear program is a mixed-integer linear programming 
 problem because P . must be integral. There are several known methods 
 to solve this type of problem. Among them, we will use Gomory's all- 
 integer integer linear programming method by assuming that the 
 other variables are also integers, without loss of generality in the 
 
 * A weighted sum Z e. P. may be used, when there is preference 
 among pattern vectors. 
 
 ** This formulation is due to F. Chen. 
 
 30 
 
sense that both formulations will yield the same minimum number of 
 incorrectly separated pattern vectors. This integral condition is 
 adopted because his method is very similar to the simplex method which 
 was applied to the dual problem discussed in the earlier sections. 
 The adaptation process will be accordingly analogous to the case of 
 these earlier sections. 
 
 In order to apply Gomory's method we need another constraints, 
 
 P < 1 , J = 1, 2, ..., m. (5.7) 
 
 J 
 
 For notational convenience, let us henceforth denote our new problem 
 of (5-2), (5.3), (5-7) together with the objective function (5.6) 
 by: 
 
 minimize ^ - 
 
 subject to 
 
 A*x>b** T (5-8) 
 
 x -^ o and integers, 
 where x is an n dimensional vector of unknown variables to be de- 
 termined and 
 
 c* = (c*, , c*,) 
 
 A* : f a iV"> a in« (5 
 
 a m'l"--> a m'n« 
 b*= (bj, ...., b*,). 
 In our case, m' is equal to 3M and n' is the number of variables 
 
 X l' ' V P l' ' P m' which is 2 ( N + !) + 3 M. 
 
 Before discussing our adapatation process, let us outline 
 
 Gomory's method. Rigorous description and proofs are found in [10]. 
 
 31 
 
The method is very similar to the ordinary simplex method applied 
 to the dual formulation. A simplex tableau of the method is shown in 
 Fig. 5.1. Each column of H corresponds to an inequality of (5«8). 
 (Slack variables are taken into the account. See (3*2.1).) 
 
 However when some entries 
 of the row r are negative, 
 we pick a certain column 
 according to a rule 
 stated in [10], and 
 form a new column 
 called a "cut" from 
 that column. The 
 cut is the column 
 
 Fig. 5.1 Tableau for 
 Integer Linear Programming 
 
 Cut 
 
 *V + n* + 1 
 
 □ 
 
 I r m ' + n' + 
 
 (h , , , ..r , , -.Jin Fig. 5.1. The procedure to derive the 
 m'+n'+l'm'+n'+l/ r 
 
 cut is also discussed in [l0]. Then the ordinary pivot operation is 
 performed using this cut. This process is repeated until we obtain 
 all non-negative entries in r or find at least one column, say i, 
 such that r. < o and IT. < o (infeasibility of (5-8)). Thus, only 
 cuts can enter the basis. The initial tableau could be the one as 
 shown in (3*2. 5) • 
 
 Although Gomory did not include the column p in his method, p is 
 necessary for our adaptation procedure. In the simplex method of 
 Section 3, the colum p* had the following meaning: suppose p. corresponds 
 
 to column h* which was transformed from the inequality 
 J 
 
 a„ x., + . . . + a. x > b ., 
 
 Jl 1 jn n - y 
 
 and then p. = b. holds, 
 i J 
 
 32 
 
The column p. in Fig*. 5.1->-of integer linear programming also has the 
 similar meaning except that since each basis is derived through the 
 cut rather than each inequality as in the simplex method, p. cor- 
 responds to the right hand side of the cut through which p. was 
 introduced. 
 
 p can "be calculated "by (3'2.k) each time a cut is introduced 
 into a basis. Assume that a cut is to be introduced into the i-th 
 row by the pivot operation, then the equation from (3»2.U) 
 
 r m» + n« + 1 = P * h m» + n 1 +1 ~^t + n i + x 
 is rewritten as 
 
 b m« + n« + 1 = * ' V +h' + 1 " V + n« + 1 ^- 10 ) 
 
 where only b* „ , is unknown variable because all the inequality in 
 m T + n + 1 
 
 the m' + n' + 1 -th column is obtained by Gomory's algorithm [10]. Now 
 the new p. which will be used in the next step is 
 
 »i = K< + »■ + 1 {%11) 
 
 The other entries of p do not change. 
 
 Applying this pivot operation until all the entries in r 
 
 become non-negative, an optimal integer solution for (5«8) is found in 
 
 (r , .. r , ,) as in the case of the ordinary simplex method. 
 m f +1, ..., m 1 +n f 
 
 Suppose that we have obtained an optimal solution for the 
 problem (5.8). Now eliminate the effect of the oldest pattern vector 
 and let us introduce a new pattern vector. Let us express the oldest 
 pattern vector by 
 
 Bll x 1+ .... + a lnXn + UP 1 >b 1 (5.12) 
 
 a,, x +...+ a x - U(l-P.) < b_ - 1 (5-13) 
 
 11 J- In n 1 — 1 
 
 33 
 
or by renaming the variable coefficients and the constant and 
 
 by changing the direction of the second inequality, 
 
 aj 1 x 1 .... + a^, x n , >b* (5.1*0 
 
 a| 1 x 1 + .... + ag n , x n » >b*. (5.15) 
 
 Similarly let us denote the new pattern vector by 
 
 a m' + l, 1*1 + .... + a m> + l,n- V * X' + 1 (5 " l6 > 
 
 a m- +2,1 X 1 + "•• +a m' + 2,n- V ^ b S- +2. (5.17) 
 which coorespond to (5.2 or 5-3) for the given pattern vector. 
 
 The elimination of the oldest pattern vector can be done in a 
 similar way as before: replace b* and b* by - S where S is a 
 sufficiently large positive number. However, this affects the tableau 
 only through the cuts which are derived from the inequalities, (5.1*0 
 and (5-15). (The classifier must ha^ memory space for this informa- 
 tion. ) In other words, this means replacement of the entries of p 
 which correspond to the cuts which were derived from (5.1*0 (5»15), by 
 - S so that the cuts become non-restrictive. More than one entry or 
 no entry may exist, depending on what the current basis is. 
 
 The next step is to delete the columns ft , ft and the entries 
 r_, r which correspond to (5. 1*0 and (5-15). And then calculate the 
 new columns hi, ft from the new inequalities, (5.16) and (5.17), by 
 using the relation (3.2.6). Of course, B = [ft t , 
 
 HI "T - _L^ • • • y 
 
 ft ]. Note that the variable P , , which is implicitly in 
 
 m* + n* m' + 1 
 
 (5.l6) and (5.17) should be replaced by P of the oldest pattern 
 vector in order to prevent the increase of the number of variables. 
 Finally the new row r can be obtained by (3.2. U). (Here we need p 
 which was obtained by (5.11).) 
 
 3k 
 
If there are negative entries in the new row r, Gomory's pivot 
 operation is repeated as discussed before until we obtain an optimal 
 solution. This completes our adaptation procedure. (Note that the 
 condition r. < o and E\ < o will not be reached since our problem 
 is always feasible. ) The whole process will be repeated when 
 pattern vectors are changed. 
 
 35 
 
6. Entire Scheme Of Adaptation And New Pattern Vector Identification 
 
 The adaptation procedure of a linear classifier which we have 
 discussed in the previous sections is summarized as follows: Given an 
 optimal structure for the set of distinct pattern vectors {£ " ,% ,..., 
 
 | ] , the classifier readjusts itself so that its structure is optimal 
 
 for the new set of distinct pattern vectors {| ,...,% , % } 
 
 ~* r l) - 'M+l) 
 
 where | is replaced by | . However we did not discuss how to 
 
 ""■ 'M+l) 
 identify | among the pattern vectors incoming as a time series, in order 
 
 to get the new set {£ ',...,£, ) for which the classifier's structure 
 
 ought to be optimal. There are a few different identification schemes 
 
 conceivable. These schemes will be discussed in this section. 
 
 The entire system of the linear classifier is illustrated in Fig. 6.1. 
 
 Block C stores the last simplex tableau for {| , • - ■ > t }• Block A 
 
 examines an incoming vector | and decides whether it should be considered 
 
 as a new pattern vector £ by checking the information about {£ , • • •> 
 
 | l } stored in Block C ''the adjustment of structure of the classifier 
 
 results^, or not 'no adjustment results'). If the new pattern vector | 
 
 - (2) 
 is identified, Block B computes the new optimal structure for f| >•-•> 
 
 | , i } by starting from the last simplex tableau for { | >•••> 
 
 % ) stored in Block C. Using this new structure, Block D classifies the 
 
 incoming vector | . ('The | must be kept in a buffer memory during 
 
 the identification process by Block A and the computation of the new 
 
 structure by block B. ) The last simplex tableau is stored in Block C. 
 
 Note that the last simplex tableau now stores the information about 
 
 fg ,...,£ + 1 instead of (| ',...,% v 1. This completes a cycle 
 
 of adaptation procedure. 
 
 36 
 
z 
 
 Pattern Vector £ 
 
 t 
 
 
 
 
 Linear 
 
 
 
 
 
 
 Classifier 
 D 
 
 Ou 
 
 
 
 
 
 ^ 
 
 
 
 
 
 
 
 
 Identification of p. 
 
 
 
 
 
 
 New Pattern Vector 
 
 
 
 
 
 
 A 
 
 
 
 
 
 
 / 
 
 i 
 
 
 
 
 
 
 
 Computer 
 
 
 
 
 
 
 
 of Linear 
 Programming 
 
 
 
 
 
 
 
 
 
 
 
 B 
 
 
 
 
 i 
 
 , 
 
 
 
 
 
 
 \ 
 
 
 
 
 
 Last Simplex 
 Tableau Stored 
 
 
 
 
 * 
 
 
 
 
 
 
 
 
 
 C 
 
 
 
 Fig. 6. 1. Adaptation Scheme 
 
 37 
 
The entire procedure is characterized by specifying how Block A 
 
 ~* CM+l) 
 identifies the new pattern vector £ which is then processed by Block B. 
 
 If Block A identifies very often an incoming vector | as a new pattern 
 
 "* (M+l) 
 vector £ , then the whole adaptation procedure also must work as 
 
 often and it slows down the processing speed of the classifier. In this 
 
 case, however, the accuracey of classification by the classifier is maintained. 
 
 On the other hand, if very few incoming pattern vectors are processed for 
 
 adaptation, the processing speed will be much faster. (Since the interval 
 
 between adjacent adaptations is long, the buffer memory will not be filled 
 
 up by incoming pattern vectors which are waiting for the classification. 
 
 Therefore the transmission rate of pattern vectors can be speeded up) . 
 
 When memory space of Block C is limited, the size of M is limited. 
 Therefore if the number of distinct vectors in the time series is more than 
 M, some selection of M vectors out of all distinct vectors must be made 
 for adaptation. Therefore, generally speaking, even if the time series 
 contains new pattern vectors the adaptation may not take place and the 
 classifier may not work correctly for each pattern vector. 
 
 In the following, three simple schemes to identify new pattern 
 vectors are arranged in the desending order of adaptation frequency. 
 
 38 
 
Adaptation scheme (l ) ; Regard every incoming pattern vector as the 
 new vector £ , no matter whether it is actually new or not. This 
 
 eliminates the checking procedure of Block A. This scheme guarantees that 
 the current structure is optimal for the last M pattern vectors. But the 
 classifier does adaptation all the time and the optimality of the structure 
 is v^lid only over the last M pattern vectors. 
 
 Adaptation scheme (2) ; Block A checks whether each incoming pattern 
 vector is new or not, by comparing it with those stored in Block C. If it 
 is new, the classifier solves a linear program, otherwise it does not. 
 This scheme is different from (l) in that the structure in (2) is optimal 
 for M distinct pattern vectors which appeared most recently. These M 
 distinct pattern vectors are, of course, stored in the simplex tableau. 
 
 Adaptation scheme (3 ) ; This scheme is very similar to the so-called 
 error-correction procedure of an adaptive element, as far as the 
 
 selection of a pattern vector is concerned. £ is regarded as the new 
 
 "* (M+l) ~* ' 
 
 pattern vector £ , only if £ is classified erroneously by the 
 
 current structure. The checking of whether it is classified correctly 
 
 or not is done by simply substituting the current solution to the 
 
 corresponding inequality. Note that if the pattern £ is separated correctly, 
 
 the current optimum structure for [£ ,...,£ 1 is also optimum for 
 
 £ in the sense that it is optimum for [£ ',..., t , I ). However, 
 
 obviously the optimum structure for U vy ,..., £ , £ } which is obtained 
 
 -> (1) - ' 
 by replacing £ by i , may be different from the current structure. 
 
 39 
 
As a result the adaptation will take place less frequently th°n 
 the other schemes. In other words, majority of incoming vectors are 
 not included in the M vectors but it is likely the t they are correctly 
 classified when they appear again. The current structure, however, is 
 optimum only for the last M pattern vectors for which the classifier 
 solved linear programs, because the readjusted structure may be no longer 
 optimum for the pattern vectors which were not identified as new 
 pattern vectors and therefore h^ve nothing to do with the current structure. 
 
 In addition to the above three schemes depending upon the given 
 situation, various sophisticated approachs might be also conceivable 
 including the addition of the random sampling approach. However, various 
 aspects of incoming pattern vectors should be examined and possibly 
 experimental justification is necessary, in order to find a proper 
 scheme . 
 
 kO 
 
7. Computational Experiment of Adaptation by the Simplex Method 
 
 The adaptation algorithm by the simplex method discussed in Section k 
 was tested by actually solving a sequence of sets of pattern vectors. 
 Each pattern set consisted of 30 30-dimensional pattern vectors whose 
 coordinates are 1 or 0. Each two consecutive pattern sets differ in 
 exactly one pattern vector so that the adaptation algorithm may be applied. 
 First, we generated 60 pattern vectors together with categories to which 
 pattern vectors should belong, by using random number generator which 
 produces 1 and with the equal probability. Second, the i-th pattern 
 set S (i) was defined as follows: 
 
 S (i) = { T (i \ .... T ( i + 29) ) , i- 1, 2, ... , 31, where 
 I , j = 1, 2, . .., 60, is the j-th generated pattern vector. These 
 31 problems corresponding to S (l) , . .., S (31) were transformed into the 
 sets of linear inequalities as illustrated in the example in Section k, 
 and then solved on the IBM 3oO/75 computer by using IBM Mathematical Pro- 
 gramming System. (Although MPS cannot perform exactly the same procedure 
 as described in Section k, an equivalent modified test was tried in order 
 to observe the number of necessary pivot operations. ) 
 
 The result is that when we solve the 31 problems separately, the 
 average number of pivot operations is 51*3 for each problem whereas 18.5 
 pivot operations are needed in our case of the adaptation scheme. The 
 saving of pivot operations was about two third. This may be encouraging 
 because the tested problem seems difficult for our approach since pattern 
 vectors which are coming in or going out are generated independently of 
 other pattern vectors in the set (i.e., by the random number generator) 
 and accordingly the new solution may not have much relationship with the 
 old one. 
 
 kl 
 
As is expected, the number of pivot operations fluctuates more 
 widely in our adaptation scheme than the case of solving 31 problems 
 separately. This is because the adaptation procedure usually needs more 
 pivot operations if the pattern vector, which leaves the current pattern 
 set, is in the basis of the last simplex tableau, than if it is not in 
 the basis. 
 
 k2 
 
8. Conclusion 
 
 We have discussed the adaptation procedure of a linear classifier 
 based on a linear programming method, making use of the easy incorporation 
 of a new constraint in the simplex method for the dual formulation. Its 
 advantage over the existing ordinary adaptive procedures (such as the 
 error-correction method )is that the optimality of a parameter of 
 the classifier, such as the input tolerance which is a measure of the 
 reliability of the classifier operation, is guaranteed. 
 
 Even when a given set of pattern vectors is not linearly separable, 
 the adaptation procedure b^sed on integer linear programming attains the 
 minimality of the number of incorrectly separated pattern vectors as well 
 as the input tolerance of the classifier. 
 
 On the other h^nd, the disadvantage of the approach is the need 
 of somewhat complicated computation for each adaptation, and the need of 
 storage space for the simplex tableau. 
 
 The computational experiment in Section 7 is encouraging and may 
 indicate the feasibility of this kind of system. 
 
 1+3 
 
Acknowledgment 
 The authors would like to appreciate C. R. Baugh for his 
 assistance in performing the experiment of Section 7 and for valuable 
 comments on the manuscript. 
 
 kk 
 
References 
 
 (1) N. J.Nilsson Learning Machines , McGr=>w-Hill, 1963, New York. 
 
 (2) F. Rosenblatt, Principle of Neurodynamics , Spartan Books, 
 Washington D. C, 1961 
 
 (3) 0. L. M^ngasarian, "Linear and Nonlinear Separation of Patterns 
 
 by Linear Programming", Operations Research , vol. 13, No. 3, pp. kkk-k^2, 
 May- June 19&5- 
 
 (k) F. W. Smith, "Pattern Classifier Design by Linear Programming" 
 IEEE Trans, on Computers , vol. C-17, No. k, pp. 367-372, 
 April 1968. 
 
 (3) S. Muroga, Threshold Logic , Lecture notes for EE U97 and EE i+98, 
 Department of Computer Science, University of Illinois, I965-I966. 
 
 (6") S. Muroga, "Majority Logic and Problems of Probabilistic Behavior", 
 in Self-Organizing Systems , Spartan Books, 1962, pp. 243-281. 
 
 (7) A. J. Goldman, and A. W. Tucker, "Theory of Linear Programming", 
 in Linear Inequalities and Related Systems , edited by H. W. Kuhn 
 and A. W. Tucker, Princeton University Press, 195^, pp. 53-97- 
 
 ( Q^ G. B. Dantzig, Linear Programming and Extensions , Princeton 
 University Press, Princeton, New Jersey, 1963. 
 
 ( 9} G. Hadley, Linear Programming , Addison -Wesley, Addison -Wesley Series 
 in Industrial Management, 1962. 
 
 flO) R. E. Gomory, 'An All-integer Integer Programming Algorithm" in 
 Industrial Scheduling , Prentice -Hall International Series in 
 Management, edited by Muth and Thompson, Prentice-Hall, 1963* 
 pp. 193-206. 
 
 45 
 
s 
 
'% 
 
 ^