LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 5\O.S4 no.2_72>-2£> d if f = 1 w.| vjy + w Q <-d if f = 0=1, ..., M , - -4i) N (i) where w • I ° is the inner product Z w. £. and d > is called i=l X X the margin of the classifier*. The margin is provided to secure reliable Even if the value of the left hand side of (2.k) does not exceed d (does not reach -d) for the vector of category A(B), the classifier is supposed to work properly as long as the value is positive (negative). This is why d is called as the "margin", d is usually set to 1 without loss of generality. operation even when small deviation in w • p*" is caused by noise. In order to facilitate our computation based on linear programming, let each variable w. be decomposed into non-negative variables as follows where w. = w. + - w i " ^ (i = 0, 1, 2, ..., N) w ± + >0, w ± - > . (i =0, 1, 2, ..., N) (2.5) Thus (2.k) is rewritten as N , .v S (w1~ - w7) ^ j; > d if f = 1 i=0 x x N , .x 2 (w+ - wT) i\ 3) < -d if f = 0, j=l, ..., M , i=0 ■"■ where ^ = 1. ( 2#6 ) If (2.6) (i.e. (2.1*)) is consistent, there are generally an infinite number of solutions (w,w ). Henceforth we will consider only a solution which minimizes a linear objective function of the weight vector u • w^ , (2. 7 ) where u is a (2N + 2) - dimensional vector and W±= (W + ' W 0, ^••• w n)- (2.8) The coordinates u will be determined in the following paragraph, depending upon which parameters of the classifier we want to minimize. The minimi- zation of the objective function (2.7) under the constraints (2.5) and (2.6) is a typical linear programming problem. The objective function is expressed in a general form in (2.7). However it can be correlated to the reliable operation of the classifier as follows. Assume that the deviation of each input |. due to noise or other fluctuations in the circuit of the classifier is 8.. Then 1 the actual value of the left side of (2.h) is N , .n N , .s N Z w ± (i ± Vj; + o ± ) + w Q = E w ± g ± U ' + E w i S + w . i=l i=l i=l Now let 6 he the maximum of |£ . | over all i; |6. | < 6 , i=l, 2, ..., N. Then we have N N I E w. 5. I < 5 Z I w. I . 1=1 1=1 Therefore if 6 satisfies N / .n N E w. g: JJ - 8 2 |w. I + w^ > if f = 1 1=1 X X 1=1 X °" (2.9) N . . N E w ± |^ D; + & Z |w i | + w Q < if f = 0, j = 1, ..., M, the linear classifier operates correctly* even when all 5 . have the maximum deviation +8 or -5- (i.e. |&.| = 8 for all i) . The max- imum value of 8 such that the classifier operates correctly is called the input tolerance . Denoting the input tolerance with y, we have (j) l. z w i V +w ol = Min i~l x x y = Min i~l L x Ul (2.10) Z Iw.l i=l by (2.9). If we determine a solution (w, w ) such that 7 is maximized, See the footnote on page 3/ the classifier is allowed to have the maximum deviation in its inputs and therefore we have maximized the reliability of' the operation of the classifier. We will prove that the maximization of (2.10) is equivalent N to the minimization of Z w. . i=l X First of all, if w is a weight vector which maximizes y then k*w also provides the same input tolerance, where k is any positive number such that k • w is still a solution of (2.U), Now we can assume N ( .s Min | Z V± V ± 3) + w Q | = d (2.11) J i=l N , v without loss of generality. If Min | Z w.|. j; + w = t > d, we i=l can simply multiply a certain positive number k = — < 1 in order to obtain (2.1l) which of course still satisfies (2.k). Consequently the maximization of y is to minimize N Z I w. i=l X under conditions (2.k) and (2.1l). In this case, however, we can delete condition (2.1l). This is because if we minijnize N under condition (2.k) and obtain Min | Z w.|: j; + w Q | = e>d, then - < 1 and e d - w' = — • w e satisfies (2.k). For this new weight vector w N N z 1 1=1 w! 1 < Z i=l w. i holds. Thus a contradiction. Therefore we must have e = d, i.e. (2.11) is satisfied. An alternative proof is found in ;?J;,L By setting u« = u,= and u. = u. = 1 for i = 1, 2, .... N, li ' ' ' ' the objective function (2.7) becomes N + Z (wT + wT ) (2.12) .,11 v ' i=l N ^ which represents Z ' w. ' . Solution of the linear program composed of i=l X (2.12), (2.5) and (2.6) will lead to the design of a linear classifier with the maximum reliability of operation. Another case was discussed in the papers when the values of |. are limited to +1 and -1, instead of real numbers. If w. £. (i=0, 1, 2, ..., N) is permitted to deviate as much as w.5 . (l+s) or w. |. (1-5) where 5 is now a percentage deviation, then the input tolerance of a majority element is N , x 1 Min ! Z wA. } ! , (2.13) W J 1=0 where N W = Z I w. i=0 * Obviously the minimization of (2.12) leads to the condition that N N either wt or w? is always 0. Then Z (w. + w7 ) =2 lw ' follows. i=l A ^ i=l And it was proved that the minimization of W is equivalent to the max- imization of the input tolerance, i.e. the maximization of the reliabil- ity of operation. For this case we set u. = U. = 1, i = 0, 1, 2, . .., N in (2.7). Although the input tolerance is a typical objective to be optimized, if a certain parameter of the classifier can be represented in the form of (2.7), we can optimize other characteristic of the classifier rather than the input tolerance. In particular, when the integer programming is used, a wider variety of objective functions may be available for our choice. So far we have assumed that the linear classifier processes the set of distinct patterns (I ,...,£ ). In other words, only ~* (1) ^ (M) I , ..., i are supplied repeatedly to the linear classifier as a r(t -1) r (t) r (t + 1) r (t + 2) _ . . time series . .., £ , \ ,| ,| , ... The structure of the linear classifier was optimized for the set {% , ... | }. Let us consider the case where the time series of pattern vectors gradually contain new pattern vectors for which the structure of the linear classifier is not optimum and instead some of the existing pattern vectors are eliminated from the time series. A few different schemes are conceivable to find whether the incoming pattern vector is considered as a new pattern vector to which the structure of the classifier is optimized. These schemes will be discussed in Section 6. Here for the moment let us assume that the incoming pattern vector is found new by a certain scheme at some instant and that the linear classifier is supposed to process the new set of distinct pattern vectors (!,...,£ } instead of [ | , . .., £ } . In other words, | is replaced by \ Now the structure of the classifier is supposed to be optimized for (£,...,£ }• The linear classifier should adapt itself to this new environment. By changing the notations appropriately and multiplying the second inequality of (2.6) by -1, our linear program to minimize (2.12) under the constraints (2.5) and (2.6) is converted into minimize c x subject to A x > b x > 0, (2.1U) where x is an n-dimensional vector of unknown variables to be determined and ( C;L , •-., c n ), A = ct_ - • • • • cl_ 11 In ml mn (2.15) b = (b_ , ...,b). 1 m Note that A corresponds to the original given set of pattern vectors, b to d's, and x to the original w. c represents the coefficients of the objective function. Thus the adaptability problem of the classifier with the change of environment may be stated as follows: Given an optimum solution for the linear program (2.1^), find an optimum solution for the new linear ->(m+l) program with the new coefficient row a^ replacing the oldest row M 7 (i) a , representing the change from | to £ (Mfl) . i.e. where minimize c x -- — ■ T subject to A* x > b' x > p • • • T21 (2.16) A* = *2n m+l 1, • • • , m+1 n (2.17) V = (b , .-., b _) . 2' ' m+l' The effectiveness of the methods which we will discuss depends on the validity of an assumption that an optimum solution of (2.l6) does not differ "too much" from the one of (2.lU). 10 3- Outline of the Simplex Method We assume that readers are familiar with the "basic properties of linear inequalities and the simplex method, which is a computational procedure to solve a linear programming problem. However let us sketch important concepts which will be used often in the rest of this paper. 3.1 Duality Theorem of a Linear Program Consider the following linear program: maximize b v T -- — subject to A v < c ( 3.1.1) v > where v is an m-dimensional vector of unknown variables. This linear program is called the dual of (2.1*0 . The coefficient matrix is the transpose of A in (2.1*0; and b and c are interchanged. It is known that when we solve either (3«l«l) or (2.1*0 by the simplex method, we will find either optimum solutions to both or infeasibility of the problems (one of them is possible unbounded.) Therefore we can solve the more conveninet of (3.1.1) or (2.1*0. 3.2 Simplex Method. Let us sketch the simplex method. See the literatures, [8] and [9L -for details. In this paper we will work on the dual problem (3»l»l) rather than the primal problem (2.1*0, because (3«l«l) in our case is more ad- vantageous than (2.1*0 in several respects, as will be seen later. We reformulate (3.1.1) in the following form by introducing so-called slack variables s., s_, ..., s : 11 and maximize h" v subject to _ m -fr n n 1 ... .T ... 1 v > 0, > 0. v. V m (3-2.1) The simplex method is a systematic procedure to choose a sequence of sets of n basic variables (only basic variables can assume non-zero values and non-basic variables are assigned zero. ) out of the m+n variables v_ , . . . , v , s_ , .... s until we obtain an optimal solution consisting of basic m 1 n variables which maximizes b v. A solution which satisfies constraints of (3«2.l) but is not necessarily optimal is called a feasible solution. When c. , c p , ..., c > 0, we may choose v, = v~ = = v = m s. 1 = c, ( i = lj 2, . . . } n) (3.2.2) as an initial feasible solution. Let us assume that u of (2.7) be non- negative as seen in (2.12) for example. Then, c_, ..., c are all non-negative as required in (3.2.2). The simplex method is most conveniently described by using tableau representation. The first simplex tableau is shown in Fig. 3.2.1, whose elements are shown by the column vectors p and q, the row vector 12 r, and the matrix H. Here T q = (q^ '", %} is the values of basic variables in the current tableau (There- fore q > means that the solution q is feasible.) and P = (P-i* '••} P ) denotes the coefficients of the basic variables Fig. 3-2.1 Simplex Tableau n + m n q in the objective function (i.e. b. 's of b in (3*2. l) corresponding J to the basic variables in q). H may be divided into two parts [hi, EL, ..., K" = [d', b" 1 ], L 1' 2' ' m+n J L ' ' (3-2.3) where h. for 1 < i < m is the column associated with the variable v. and h. for m+l , r ) can be obtained bv the following: m+n r Q = P ' q r. = p ♦ h. - b. (i = 1, 2, . . . , m + n) (3.2. U) where b m+1' , b are set to (see (3.2.1)). If we start with the 7 m+n initial feasible solution (3.2.2), the initial tableau consists of ' 5 T q = c D = A B~ = I (the unit matrix) r Q = r. = -b. (i = 1, 2, .... m+n). 1 1 7 7 7 (3.2.5) At each simplex tableau, there are three possibilities: 13 (1) r > . (2) Existence of i such that r. < and h. < . l l — (3) Neither (l) nor (2). Case (l) means that the feasible solution at the current tableau is optimal and the case (2) means that the linear program has an un- bounded solution (i.e. the infeasibility of the primal problem (2.lU) ^ -* ). Whenever the case (3) is encountered, the so-called pivot operation is applied to the current tableau and the entries are trans- formed according to the simplex rule., deriving the next tableau. Then we examine "which of the above three possibilities holds in the new tableau. With repeated deriviation of tableaux we will eventually obtain the case (l) or (2). The pivot operation is essentially a process of replacing one basic variable (i.e. a column of the current tableau) by a new basic variable. See the literature [8] and [91 for the details of the derivation of simplex tableau. If the simplex method ends up with the case (l), an optimal solution of (3«l«l) will be in column q of the last tableau and an optimal solution of (2.1*0 will be (r , , .... r ) whose coordinates * m+1 ' m+n are the values of x- . . . , x , respectively. Note that the following relations hold for every simplex tableau: q = B" 1 c _! - T (3.2.6) h. = B a. (i=lj 2, ..., m+n) — - T T where a. is the ith column of [A I] of (3.2.1). These relations will be used later. Ik h. Adaptation "by the Simplex Method When a classifier characterized by the linear program (2.1^) adapts itself to a new environment "with an old pattern vector replaced "by a new one, it is characterized by the new linear program (2.l6). Let us assume (for a while) that the linear programs (2.lU) and (2.l6) both are feasible. The case where they are infeasible will be considered at the end of this section and also in the next section. Now suppose that we have solved the linear program (2.lU) by using the dual formulation (3-1. l) and have obtained an optimum solution in the last simplex tableau as shown in Fig. U.l. The adaptation is essentially an efficient method for deriving an initial tableau of the new problem (2.l6) (strictly speaking, the dual formulation of (2.l6)) and applying the pivot operations until Fig. k.l Last Simplex Tableau a new optimal solution is obtained. p(F) W) r(F) This type of problem has already been studied and is included in the subject of parametric programming. In this paper, however, we propose a new procedure rather than using the standard technique of parametric programming "■ ^ , because (l) the pattern classification problem is suitable for the dual formulation with respect to signs of co- efficients (no need of artificial variables), (2) subsequent pro- cedure for adaptation is simpler and straightforward (the dual simplex method is not needed) and (3) our formulation is easily extendable to the integer linear programming formulation which will be discussed in Section 5. .15 This initial tableau is expected to yield faster convergence than the ordinary initial tableau given in (3.2.5). In order to obtain this new initial tableau, we have to (l) eliminate the effect of the inequality a ll X l + a !2 X 2 + "• + V X n ^ b l (4 - l) which corresponds to the pattern vector to be eliminated, and (2) intro- duce a new inequality a. n x_+a. x + ...+a. x >b . (U.2) m+1, 1 1 m+1, 2 2 m+1, n n — m+1 ' which corresponds to a new pattern vector, without increasing the size of simplex tableau. The first step corresponding to (l) is to set b equal to - S where S is a sufficiently large positive number such that (^.l) is satisfied for all feasible solutions of the remaining inequalities. This means that (k. l) is now non-restrictive. Let us recall that the current linear program is treated by the dual formulation and therefore the inequality (*+.l) is associated with the first column h (F) of H(F) in the last simplex tableau. The change of b causes the change of an entry of p(F) which corresponds to a variable v (or h -(F)) if v is in the basis. But it causes no change if h_(F) is not in the basis. The entries of r(F) are accordingly recalculated by (3.2. U). After this, simply delete the first column h (F) and r_. The deletion is permissible because h..(F) henceforth will not enter the basis (since r will not be negative.). The second step corresponding to (2) above is to introduce the new inequality (k.2) into the column where (U.l) was eliminated. The new entries can be obtained by (3«2. k) and (3-2.6), i.e. h - B_1 Vi (U - 3) l£ where a* +1 = (a^^ ^ ^1,2,..., Vl/ and r i = * ' ff i " Vr <^ Note that q was not changed in the above two steps and there- fore the new tableau still has a feasible solution (although it may not be optimal) . Thus this new tableau can be used as an initial tableau for the dual of the problem (2.l6). If all the entries of r are non-negative, the old solution is optimal for the new problem also. But if r contains some negative entries, apply the pivot operations until an optimal solution is derived for the new problem. Although we discussed the procedure only for the case in which one inequality is replaced, extention to the case of more than one in- equality is obvious. The procedure would need less computation time if the coordinates of the new solution do not deviate "too much" from those of the old sol- ution since the new pivot operation starts from a "closer" solution to the new problem than the ordinary initial solution of (3»2.5)« Example. Fig. k.2 Linear Classifier Let us pursue the adaptation x w of the linear classifier shown in Fig. k.2, when pattern vectors change as follows from (U.5) through (U.7): (i) ( X;l ,x 2 ): (11), (10) yield f = l fCx^Xg) (01) yields f = (U.5) IT (ii) (x 1 ,x 2 ): (11), (00) yield f=l (01) yields f = ^'^ (iii) ( Xl ,x 2 ): (11) yields f = 1 (01), (00) yield f = 0. ^'^ Corresponding to these sets of pattern vectors we have the following sets of inequalities by (2.4): (i) w. + w > 1 (4.8) w 1 + w 2 4- W o > 1* w l + w > 1 W 2 + w < -1 w + w + w > 1 (ii) w > 1 (U.9) w~ + w < -1 2 o — w_. + w_ + w > 1 1 2 o — (iii) w < -1 (U.10) o — w + w < -1 Let us split these variables w. 's as seen in (2.5) in order to keep the non-negertiveness of variables in the simplex method. The objective function to be minimized is now: 1^1 + |w 2 | = w 1 + W;L " + w 2 + + w 2 ~. (^.H) Renaming these split variables and changing the direction of some in- equalities, the above sets of inequalities may be rewritten as follows, corresponding to (2.1^): d in (2.4) is set to 1 for simplicity. 18 (i) x 1 - X_ + X_ - Xs > 1 v_ (U.12) x l - X 2 + X 3 \ + X 5 x 6 > 1 v l x l - X 2 + X 5 - x 6 > l V 2 - X 3 + X k - X 5 + x 6 > 1 V 3 x l - X 2 + X 3 " X k + X 5 X 5 *" x 6 x 6 > > l 1 v l - X 3 + X k - X 5 + x 6 > l V 3 x l - X 2 + X 3 - \ + X 5 - x 6 > 1 v l - X 5 + x 6 > 1 V 5 - X 3 + \ - X 5 + x 6 > l V 3 (ii) x, - x. > 1 v,, (U.13) (iii) - x 5 + x 6 > 1 v 5 (k.lk) - x 3 + x k - x 5 + x 6 > 1 v 3 where v. is shown simply for identification of each inequality. The objective function in the renamed variables is x 1 + x 2 + x + x^ (^.15) Note that the following relations hold among the original and renamed variables x ± - x 2 - w 1 - w^ = w x x - x^ = w 2 - w 2 ~ = w 2 (h.l6) + x r .-x^=w -w =w 56000 Assume that our classifier has the first set of pattern vectors (i). An initial tableau for the dual of this linear program will be derived by (3.2.5). Table k.l shows the initial tableau derived. After applying the pivot operation three times (Tables h.2 and U. 3 and k.k) , an optimal solution will result since Table h.k contains no negative entry in v. The solution is obtained in (r^, ..., r 9 ), i.e., L9 x 1 = 2 x 6 = 1 (U.17) x 2 = X 3 = X k = x 5 = ° ' which implies w l = 2 W 2 = w o - -1 (U.18) Then assume that the first set of pattern vectors is changed to the second set (ii). According to the procedure discussed, elimi- nate the inequality which corresponds to the old pattern vector to be replaced, x ± - x 2 + x^ - x 6 > 1 v 2 (U.19) and instead introduce the inequality x 5 - x 6 > 1 v^ > (U.20) 20 Variables in the basis S 5 s 6 1 -1 1 -1 1 -1 V 2 1 -1 1 -1 v -1 1 1 column name r-» -1 -1 Table k.l Initial Tableau for the First Set of Pattern Vectors v. v. V, s l 1 1 -1 S 2 1 -1 1 1 S 3 1 -1 -1 -1 % 1 1 1 1 v l 1 1 1 -1 1 s 6 1 1 - 2 ( 1 Table k.2 Intermediate Tableau for the First Set of Pattern Vectors 21 V 3 l 1 1 1 -1 S 2 2 1 1 S 3 1 -1 1 -1 % 1 1 1 1 v l 1 1 1 1 1 s 6 © 1 2 2 -1 Table k.3 Intermediate Tableau for the First Set of Pattern Vectors V 3 1 1 1 1 -1 S 2 2 1 1 S 3 1 -1 1 1 % 1 1 1 -1 v l 1 1 1 1 1 S 5 1 1 2 2 1 Table k.h Optimal Tableau for the First Set of Pattern Vectors b-» 1 -100 1 V 3 1 1 1 1 -1 S 2 2 1 1 3 3 1 -1 1 1 s l. 1 1 1 -1 v l 1 1 1 1 1 S 5 1 1 2 101 2 1 Table U.5 Elimination of the Old Vector V„ oo Replace 1 in b (of v ) by a sufficiently small number, say - 100. Since v p is not in the basis in Table U.U, the replacement does not cause any change of any entry except r p of Table k.k. Table k.5 shows the resultant tableau. Next delete h and r from Table k. 5 and fill in new entries which are derived from the new inequality (^.20) according to (^.3) and (U.U). Note that B " is [hi, h , ..., h ] in this case. The result is shown in Table U.6. Since there is a negative entry in r, apply the pivot operation. After two applications, a new optimal solution is derived as shown in Table h."(. It is = 1 (U.21) which lead to x l 2, X ^ = 2, : X 2 = X 3 = x 6 = W l = 2 W 2 = -2 w = 1 • (U.22) 23 V l v k v 3 s l S 2 S 3 % S 5 G 6 V 3 S 2 S 3 % v l S 5 1 1 1 2 1 1 1 1 -1 -1 Q 1 1 1 1 1 1 1 1 1 1 -1 1 2 -2 *. 2 1 p 4 Table k.6 Initial Tableau for the Second Set of Pattern Vectors V 3 ' S 2 S 3 v h v l % 1 ! 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 k J 2 2 1 Table k.7 Optimal Tableau for the Second Set of Pattern Vectors 2k The change from the second set of pattern vectors ( ii) to the third set ( iii) can be treated in a similar manner. In this case p column is changed "because v. is in the basis as seen in Table k.Q. Then Table U.9 is the initial tableau for the third set (iii). An optimal solution is obtained in Table ^-.10 after two pivot operations. It is W l =2 > W 2 = °' W o = _1 (If,23) From (k.lQ), (1+.22), (^.23), we can see that the structure of the linear classifier has changed as shown in Fig. U.3« Fig. k. 3 Adaptation of Linear Classifier 1 -*♦ ^0- ^©- v l v k v 3 s l S 2 s 3 S 3 % S 5 V 3 l 2 1 1 1 S 2 2 1 1 S 3 2 1 1 v k -100 1 1 1 1 v l 1 1 1 1 s 6 1 1 -97 2 -99 -100 Table k.Q Elimination of the Old Vector v, 25 v l V 5 V 3 s l S 2 S 3 % s 5 s 6 V 3 S 2 S 3 Y h v l s 6 1 -100 1 2 2 2 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 -97 99 2 -99 -100 Table k.9 Initial Tableau for the Third Set of Pattern Vectors V 3 1 1 1 1 1 1 S 2 2 1 1 S 3 1 1 1 1 % 1 -1 1 -1 v l 1 1 1 1 s 5 1 1 J 2 1 2 1 Table 1*.10 Optimal Tableau for the Third Set of Pattern Vectors 26 Now let us consider the case in which the given set of pattern vectors are not separable, i.e., (2.l4) (or (2.l6)) is not feasible. By adding an artificial variable to each inequality of (2.l4), (2.l4) can be rewritten as a . _ x_ + +a. x + t . > b . ll 1 in n i — 1 t. > (i = 1, 2, . .., m) x > (J = 1, 2, ..., n). (4.210 J Obviously (4.24) is feasible because the set of inequalities are satisfied by a sufficiently large positive value of each t.. However the positiveness of t. does not mean that the corresponding ith in- equality of (2.14) is satisfied. In order to circumvent this fact let us use the following new objective function. m ^ _ V Z t. + c . x (4.25) 1=1 x where V is some large positive number. Roughly speaking, the minimization of this objective function means that it first minimizes m Z t. which will tend to minimize the number of incorrectly separated i = 1 X vectors and that it next minimizes ex, the original objective function. The set of inequalities (4.24) with the objective function (4.25) can be solved in a similar manner as the case of (2.l4), since (4.24) is now feasible. Although the number of artificial variables increases m Ideally, .Z n t. is to be minimized, no matter what value ex assumes, i=l l However, this is not achieved by (4.25) because t. 's are continuous m variables. In other words, the value of .ILt.,when (4.25) is ' 1=1 i » minimized, depends upon the relative size of V and ex and may not m be minimized. Note that if t.'s are discrete variables, .E,t. is i i=l l minimized as will be discussed in Section 5« 27 indefinitely if the adaptation is repeated infinitely, the storage space increase can be avoided by replacing the old artificial variable to be eliminated by the new artificial variable. The objective function (U.25), however, does not always lead to the minimum number of erroneously separated pattern vectors. As will be discussed in the next section, the exact minimization of the number of incorrectly separated vectors will be obtained by using integer linear programming. 28 5. Formulation and Adaptation "by Integer Linear Programming Given a non- separable set of pattern vectors, one realization of a linear classifier is to minimize the number of incorrectly separated vectors. This can he achieved by the following integer linear pro- gramming approach. Corresponding to each input vector £ of (2.k), w-I^+w >d if f(r^) = 1 (5.1A) o — or -w-i^^-w o >d if f(1^0 = , (5- IB) formulate the two inequalities for each of (5.1A) and (5. IB) as follows: r^w + w +UP. > d (5.2A) o J - TvJ)^ , T . _ TT/n T> \ *r A A-P Wfc(j) | vj; w + w -U(l-P,)<-d if f(| vj; ) = 1 (5-3A) O o — or - r w - w +UP. > d (5.2B) 3 ~ -i^'V-w -u(i-p.)<-d if f(r^) = (5.3B) ^ J where P. is a variable which assumes 1 or 0, and where U is a suf- ficiently large positive number to insure the following property. When P = 0, (5.2A and B) is obviously identical to (5.1A and B) J ^ ( *) respectively and (5.3A and B) is satisfied by any value of S i • This me ans (5.1A or B) is satisfied by w and accordingly the pattern vector is classified correctly. On the other hand, when P = 1, (5-3A and B) become J w or ^^tj') +w <-d if f - 1 (5-^A) _^^j)_ w _<_d if f = . (5- UB) o — .29 The pattern vector is separated incorrectly. Therefore m 2 P, (5.5) 5=1 J shows the number of incorrectly separated pattern vectors. Now consider the objective function: m minimize V Z P . + u w -, (5.6)- 0=1 J where u is the vector in (2.7) and where V is a sufficiently large positive number (i.e. V > Max u w -). Different from the objective function (k.25) in Section h, the minimization of this objective function means the minimization of both Z P. and u w -, because Z P. assumes discrete values only. This is the property which we desired. Of course, we can consider other objective functions if they are needed. The incorporation of integer variables extensively widens the concept of objective functions. For example, it is possible to minimize the number of non-zero weights, i.e., the number of inputs actually needed, by reformulating the integer linear programming with additional integer variables and changing the objective function. ** This new linear program is a mixed-integer linear programming problem because P . must be integral. There are several known methods to solve this type of problem. Among them, we will use Gomory's all- integer integer linear programming method by assuming that the other variables are also integers, without loss of generality in the * A weighted sum Z e. P. may be used, when there is preference among pattern vectors. ** This formulation is due to F. Chen. 30 sense that both formulations will yield the same minimum number of incorrectly separated pattern vectors. This integral condition is adopted because his method is very similar to the simplex method which was applied to the dual problem discussed in the earlier sections. The adaptation process will be accordingly analogous to the case of these earlier sections. In order to apply Gomory's method we need another constraints, P < 1 , J = 1, 2, ..., m. (5.7) J For notational convenience, let us henceforth denote our new problem of (5-2), (5.3), (5-7) together with the objective function (5.6) by: minimize ^ - subject to A*x>b** T (5-8) x -^ o and integers, where x is an n dimensional vector of unknown variables to be de- termined and c* = (c*, , c*,) A* : f a iV"> a in« (5 a m'l"--> a m'n« b*= (bj, ...., b*,). In our case, m' is equal to 3M and n' is the number of variables X l' ' V P l' ' P m' which is 2 ( N + !) + 3 M. Before discussing our adapatation process, let us outline Gomory's method. Rigorous description and proofs are found in [10]. 31 The method is very similar to the ordinary simplex method applied to the dual formulation. A simplex tableau of the method is shown in Fig. 5.1. Each column of H corresponds to an inequality of (5«8). (Slack variables are taken into the account. See (3*2.1).) However when some entries of the row r are negative, we pick a certain column according to a rule stated in [10], and form a new column called a "cut" from that column. The cut is the column Fig. 5.1 Tableau for Integer Linear Programming Cut *V + n* + 1 □ I r m ' + n' + (h , , , ..r , , -.Jin Fig. 5.1. The procedure to derive the m'+n'+l'm'+n'+l/ r cut is also discussed in [l0]. Then the ordinary pivot operation is performed using this cut. This process is repeated until we obtain all non-negative entries in r or find at least one column, say i, such that r. < o and IT. < o (infeasibility of (5-8)). Thus, only cuts can enter the basis. The initial tableau could be the one as shown in (3*2. 5) • Although Gomory did not include the column p in his method, p is necessary for our adaptation procedure. In the simplex method of Section 3, the colum p* had the following meaning: suppose p. corresponds to column h* which was transformed from the inequality J a„ x., + . . . + a. x > b ., Jl 1 jn n - y and then p. = b. holds, i J 32 The column p. in Fig*. 5.1->-of integer linear programming also has the similar meaning except that since each basis is derived through the cut rather than each inequality as in the simplex method, p. cor- responds to the right hand side of the cut through which p. was introduced. p can "be calculated "by (3'2.k) each time a cut is introduced into a basis. Assume that a cut is to be introduced into the i-th row by the pivot operation, then the equation from (3»2.U) r m» + n« + 1 = P * h m» + n 1 +1 ~^t + n i + x is rewritten as b m« + n« + 1 = * ' V +h' + 1 " V + n« + 1 ^- 10 ) where only b* „ , is unknown variable because all the inequality in m T + n + 1 the m' + n' + 1 -th column is obtained by Gomory's algorithm [10]. Now the new p. which will be used in the next step is »i = K< + »■ + 1 {%11) The other entries of p do not change. Applying this pivot operation until all the entries in r become non-negative, an optimal integer solution for (5«8) is found in (r , .. r , ,) as in the case of the ordinary simplex method. m f +1, ..., m 1 +n f Suppose that we have obtained an optimal solution for the problem (5.8). Now eliminate the effect of the oldest pattern vector and let us introduce a new pattern vector. Let us express the oldest pattern vector by Bll x 1+ .... + a lnXn + UP 1 >b 1 (5.12) a,, x +...+ a x - U(l-P.) < b_ - 1 (5-13) 11 J- In n 1 — 1 33 or by renaming the variable coefficients and the constant and by changing the direction of the second inequality, aj 1 x 1 .... + a^, x n , >b* (5.1*0 a| 1 x 1 + .... + ag n , x n » >b*. (5.15) Similarly let us denote the new pattern vector by a m' + l, 1*1 + .... + a m> + l,n- V * X' + 1 (5 " l6 > a m- +2,1 X 1 + "•• +a m' + 2,n- V ^ b S- +2. (5.17) which coorespond to (5.2 or 5-3) for the given pattern vector. The elimination of the oldest pattern vector can be done in a similar way as before: replace b* and b* by - S where S is a sufficiently large positive number. However, this affects the tableau only through the cuts which are derived from the inequalities, (5.1*0 and (5-15). (The classifier must ha^ memory space for this informa- tion. ) In other words, this means replacement of the entries of p which correspond to the cuts which were derived from (5.1*0 (5»15), by - S so that the cuts become non-restrictive. More than one entry or no entry may exist, depending on what the current basis is. The next step is to delete the columns ft , ft and the entries r_, r which correspond to (5. 1*0 and (5-15). And then calculate the new columns hi, ft from the new inequalities, (5.16) and (5.17), by using the relation (3.2.6). Of course, B = [ft t , HI "T - _L^ • • • y ft ]. Note that the variable P , , which is implicitly in m* + n* m' + 1 (5.l6) and (5.17) should be replaced by P of the oldest pattern vector in order to prevent the increase of the number of variables. Finally the new row r can be obtained by (3.2. U). (Here we need p which was obtained by (5.11).) 3k If there are negative entries in the new row r, Gomory's pivot operation is repeated as discussed before until we obtain an optimal solution. This completes our adaptation procedure. (Note that the condition r. < o and E\ < o will not be reached since our problem is always feasible. ) The whole process will be repeated when pattern vectors are changed. 35 6. Entire Scheme Of Adaptation And New Pattern Vector Identification The adaptation procedure of a linear classifier which we have discussed in the previous sections is summarized as follows: Given an optimal structure for the set of distinct pattern vectors {£ " ,% ,..., | ] , the classifier readjusts itself so that its structure is optimal for the new set of distinct pattern vectors {| ,...,% , % } ~* r l) - 'M+l) where | is replaced by | . However we did not discuss how to ""■ 'M+l) identify | among the pattern vectors incoming as a time series, in order to get the new set {£ ',...,£, ) for which the classifier's structure ought to be optimal. There are a few different identification schemes conceivable. These schemes will be discussed in this section. The entire system of the linear classifier is illustrated in Fig. 6.1. Block C stores the last simplex tableau for {| , • - ■ > t }• Block A examines an incoming vector | and decides whether it should be considered as a new pattern vector £ by checking the information about {£ , • • •> | l } stored in Block C ''the adjustment of structure of the classifier results^, or not 'no adjustment results'). If the new pattern vector | - (2) is identified, Block B computes the new optimal structure for f| >•-•> | , i } by starting from the last simplex tableau for { | >•••> % ) stored in Block C. Using this new structure, Block D classifies the incoming vector | . ('The | must be kept in a buffer memory during the identification process by Block A and the computation of the new structure by block B. ) The last simplex tableau is stored in Block C. Note that the last simplex tableau now stores the information about fg ,...,£ + 1 instead of (| ',...,% v 1. This completes a cycle of adaptation procedure. 36 z Pattern Vector £ t Linear Classifier D Ou ^ Identification of p. New Pattern Vector A / i Computer of Linear Programming B i , \ Last Simplex Tableau Stored * C Fig. 6. 1. Adaptation Scheme 37 The entire procedure is characterized by specifying how Block A ~* CM+l) identifies the new pattern vector £ which is then processed by Block B. If Block A identifies very often an incoming vector | as a new pattern "* (M+l) vector £ , then the whole adaptation procedure also must work as often and it slows down the processing speed of the classifier. In this case, however, the accuracey of classification by the classifier is maintained. On the other hand, if very few incoming pattern vectors are processed for adaptation, the processing speed will be much faster. (Since the interval between adjacent adaptations is long, the buffer memory will not be filled up by incoming pattern vectors which are waiting for the classification. Therefore the transmission rate of pattern vectors can be speeded up) . When memory space of Block C is limited, the size of M is limited. Therefore if the number of distinct vectors in the time series is more than M, some selection of M vectors out of all distinct vectors must be made for adaptation. Therefore, generally speaking, even if the time series contains new pattern vectors the adaptation may not take place and the classifier may not work correctly for each pattern vector. In the following, three simple schemes to identify new pattern vectors are arranged in the desending order of adaptation frequency. 38 Adaptation scheme (l ) ; Regard every incoming pattern vector as the new vector £ , no matter whether it is actually new or not. This eliminates the checking procedure of Block A. This scheme guarantees that the current structure is optimal for the last M pattern vectors. But the classifier does adaptation all the time and the optimality of the structure is v^lid only over the last M pattern vectors. Adaptation scheme (2) ; Block A checks whether each incoming pattern vector is new or not, by comparing it with those stored in Block C. If it is new, the classifier solves a linear program, otherwise it does not. This scheme is different from (l) in that the structure in (2) is optimal for M distinct pattern vectors which appeared most recently. These M distinct pattern vectors are, of course, stored in the simplex tableau. Adaptation scheme (3 ) ; This scheme is very similar to the so-called error-correction procedure of an adaptive element, as far as the selection of a pattern vector is concerned. £ is regarded as the new "* (M+l) ~* ' pattern vector £ , only if £ is classified erroneously by the current structure. The checking of whether it is classified correctly or not is done by simply substituting the current solution to the corresponding inequality. Note that if the pattern £ is separated correctly, the current optimum structure for [£ ,...,£ 1 is also optimum for £ in the sense that it is optimum for [£ ',..., t , I ). However, obviously the optimum structure for U vy ,..., £ , £ } which is obtained -> (1) - ' by replacing £ by i , may be different from the current structure. 39 As a result the adaptation will take place less frequently th°n the other schemes. In other words, majority of incoming vectors are not included in the M vectors but it is likely the t they are correctly classified when they appear again. The current structure, however, is optimum only for the last M pattern vectors for which the classifier solved linear programs, because the readjusted structure may be no longer optimum for the pattern vectors which were not identified as new pattern vectors and therefore h^ve nothing to do with the current structure. In addition to the above three schemes depending upon the given situation, various sophisticated approachs might be also conceivable including the addition of the random sampling approach. However, various aspects of incoming pattern vectors should be examined and possibly experimental justification is necessary, in order to find a proper scheme . kO 7. Computational Experiment of Adaptation by the Simplex Method The adaptation algorithm by the simplex method discussed in Section k was tested by actually solving a sequence of sets of pattern vectors. Each pattern set consisted of 30 30-dimensional pattern vectors whose coordinates are 1 or 0. Each two consecutive pattern sets differ in exactly one pattern vector so that the adaptation algorithm may be applied. First, we generated 60 pattern vectors together with categories to which pattern vectors should belong, by using random number generator which produces 1 and with the equal probability. Second, the i-th pattern set S (i) was defined as follows: S (i) = { T (i \ .... T ( i + 29) ) , i- 1, 2, ... , 31, where I , j = 1, 2, . .., 60, is the j-th generated pattern vector. These 31 problems corresponding to S (l) , . .., S (31) were transformed into the sets of linear inequalities as illustrated in the example in Section k, and then solved on the IBM 3oO/75 computer by using IBM Mathematical Pro- gramming System. (Although MPS cannot perform exactly the same procedure as described in Section k, an equivalent modified test was tried in order to observe the number of necessary pivot operations. ) The result is that when we solve the 31 problems separately, the average number of pivot operations is 51*3 for each problem whereas 18.5 pivot operations are needed in our case of the adaptation scheme. The saving of pivot operations was about two third. This may be encouraging because the tested problem seems difficult for our approach since pattern vectors which are coming in or going out are generated independently of other pattern vectors in the set (i.e., by the random number generator) and accordingly the new solution may not have much relationship with the old one. kl As is expected, the number of pivot operations fluctuates more widely in our adaptation scheme than the case of solving 31 problems separately. This is because the adaptation procedure usually needs more pivot operations if the pattern vector, which leaves the current pattern set, is in the basis of the last simplex tableau, than if it is not in the basis. k2 8. Conclusion We have discussed the adaptation procedure of a linear classifier based on a linear programming method, making use of the easy incorporation of a new constraint in the simplex method for the dual formulation. Its advantage over the existing ordinary adaptive procedures (such as the error-correction method )is that the optimality of a parameter of the classifier, such as the input tolerance which is a measure of the reliability of the classifier operation, is guaranteed. Even when a given set of pattern vectors is not linearly separable, the adaptation procedure b^sed on integer linear programming attains the minimality of the number of incorrectly separated pattern vectors as well as the input tolerance of the classifier. On the other h^nd, the disadvantage of the approach is the need of somewhat complicated computation for each adaptation, and the need of storage space for the simplex tableau. The computational experiment in Section 7 is encouraging and may indicate the feasibility of this kind of system. 1+3 Acknowledgment The authors would like to appreciate C. R. Baugh for his assistance in performing the experiment of Section 7 and for valuable comments on the manuscript. kk References (1) N. J.Nilsson Learning Machines , McGr=>w-Hill, 1963, New York. (2) F. Rosenblatt, Principle of Neurodynamics , Spartan Books, Washington D. C, 1961 (3) 0. L. M^ngasarian, "Linear and Nonlinear Separation of Patterns by Linear Programming", Operations Research , vol. 13, No. 3, pp. kkk-k^2, May- June 19&5- (k) F. W. Smith, "Pattern Classifier Design by Linear Programming" IEEE Trans, on Computers , vol. C-17, No. k, pp. 367-372, April 1968. (3) S. Muroga, Threshold Logic , Lecture notes for EE U97 and EE i+98, Department of Computer Science, University of Illinois, I965-I966. (6") S. Muroga, "Majority Logic and Problems of Probabilistic Behavior", in Self-Organizing Systems , Spartan Books, 1962, pp. 243-281. (7) A. J. Goldman, and A. W. Tucker, "Theory of Linear Programming", in Linear Inequalities and Related Systems , edited by H. W. Kuhn and A. W. Tucker, Princeton University Press, 195^, pp. 53-97- ( Q^ G. B. Dantzig, Linear Programming and Extensions , Princeton University Press, Princeton, New Jersey, 1963. ( 9} G. Hadley, Linear Programming , Addison -Wesley, Addison -Wesley Series in Industrial Management, 1962. flO) R. E. Gomory, 'An All-integer Integer Programming Algorithm" in Industrial Scheduling , Prentice -Hall International Series in Management, edited by Muth and Thompson, Prentice-Hall, 1963* pp. 193-206. 45 s '% ^