UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 5/u rr yu), 6$f Report No. UIUCDCS-R-7U-659 pu^ct, 1 DATA COMPRESSION FOR CHARACTER STRINGS "by Alfred C . Weaver July, 197^ AM 27«h< DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Report No. UIUCDCS-R-74-659 DATA COMJRESSION FOR CHARACTER STRINGS by Alfred C . Weaver July, 197^ Department of Computer Science versity of Illinois at Urbana-Champaign Urbana, Illinois 61801 TABLE OF CONTENTS Page DATA COMPRESSION FOR CHARACTER STRINGS 1 1.1. The Necessity of Compression 1 1.2. A Software Solution . . . . 2 DATA COMPRESSION 3 2.1. The Transmitted Message 3 2.2. A Solution Utilizing Duplicate Character Compression 3 2.2.1. The Encoding Algorithm 3 2.2.2. The Decoding Algorithm 10 2.3- A Solution Utilizing Common Phrase Detection and Replacement 11 2.3-1. Common Phrase Detection 11 2.3.2. A Conjecture 27 2.3.3. An Example 28 2.3.^. Common Phrase Replacement kO SUMMARY k3 -in- Digitized by the Internet Archive in 2013 http://archive.org/details/datacompressionf659weav 1. DATA COMPRESSION FOR CHARACTER STRINGS 1.1. The Necessity of Compression In the field of data communications, one recurring problem is how best to transmit the most information in the least time and at the least cost. Within the scope of a centralized computer system, the problem is often solved by "hardwiring" the devices which wish to communicate. As the number and speed of the peripherals increase (by design changes or equipment upgrading), previously adequate data paths begin to show signs of saturation and new techniques must be employed. Three popular techniques are: (l) increasing the bandwidth of the data path to achieve a greater parallelism in the architecture; (2) switching the mode of operation of data path controllers from bit- serial to bit-parallel, with an appropriate increase in hardware; and (3) multiplexing a single high-speed line among many slower speed devices. While these techniques work adequately well, they are some- what dependent upon a controlled environment in which device speeds are well-matched, data paths are relatively short, error rates are small, and the equipment involved is well-understood. But as one leaves the central system and begins to explore the world of remote computing and remote conversational access in general, a new series of problems and woes arises. Primary among these is the basic inadequacy of the common or "voice-grade" telephone line for carrying information at high speed with acceptably low error rates . The rate at which a state-of-the-art minicomputer can produce information now far outstrips the capabilities of the voice-grade line. Yet the alternative, a "conditioned line, " which is capable of high speed -1- operation, is not only expensive to install and costly to use, but denies the basic advantage of telephone circuitry as a data path: the ability to communicate from one point to another point, however unlikely, provided each end has access to a common household telephone. The problem gets completely out of hand when one considers more ambitious projects such as remote graphics. A machine with the size, speed, and capacity of an IBM 360/75, talking to, say, a Calcomp plotter, generates literally millions of instructions to control the plotting of even simple graphs. Drawing a 10" straight line at a U5 angle requires some 2000 commands, all of which must be transmitted over some data link. The expected result is that remote graphics is unbearably slow, and when in use tends to completely saturate the data path, thus effectively blocking use of the line for any other device . 1.2. A Software Solution It is this impasse which prompted research on the subject in question: if line speed is constrained to some constant value, how can the density of information transmitted best be increased? The answer appears to involve manipulation of the transmitted data itself, and in some cases the addition of pre-processors and post-processors to encode/decode the transmitted infor- mation. While this solution also requires additional hardware, the "intelligence" required is small and well within the range of minicomputers and even micro- computers (processor-on-a-chip technology). Such hardware is not expensive: one popular 8-bit microprocessor sells for $60 in quantity. In fact, the utilization of microcomputers can be shown to represent good economy when compared with the "conditioned Line" alternative. Thus, the hardware capabilities for data compression/decompression exist, and what remains is to determine what software methodology should be applied. _p_ 2. DATA COMPRESSION 2.1. The Transmitted Message The purpose of this project was to determine, for a specific class of "messages" which exist initially as print files on disk, how that file should "be pre-processed locally, transmitted across telephone circuitry, and post-processed at a remote site so as to minimize the number of char- acters actually transmitted (and, hence, telephone charges). The boundary conditions of the problem were: 1. All information to be transmitted was basically of the same type: relay ladders for process control logic (i.e., a graph). The following pages show the sequence of equations, object code, and graph which are to be transmitted. Only the graph is of sufficient length to justify condensation. 2. The graphs were stored as line images, 1 to 80 characters/line, and varied in length from 30 to 1500 lines. 3- There was similarity, and indeed duplication, of the basic "building blocks" within each line and further repetition of these building blocks among lines of the graph. It seemed reasonable to assume that a detection of common phrases of the graph, and their reduction from the physical character string itself to either the character and a repetition factor or to a pointer into a "dictionary" of common phrases, was a reasonable approach. The anticipated result was a significant decrease in the total length of transmitted text. 2 .2 . A Solution Utilizing Duplicate Character Compression 2.2.1. The Encoding Algorithm Tf a string of two more more contiguous characters in the message are identical, replace the string with a two-character phrase where the first ■3- ******** ***************** EOU AT I ON S** ************************** CR1 = (PBi+CRl)*PB2 MX1 = CR1 CR3 = MX1 £ PB4-NC*TR5-XX0 CR4 = MX1 £ (PB4-NC+TR5-00X)*CR5 CR5 = MX1 £ --+--( )— + 177 002 CR1 (0011)- ' + 177 MX1 (0012)- ■ + 162 MX1 — ( ) — + 162 PB^-NC TR5-XX0 +— ( )—.-.{ / ) 005 124 PB^-NO + — ( )-- + 006 : TR5-CCX : CR5 +--( ) — *~( ) 124 176 CR3 *— ( ) 166 CR5 +--( ) 176 CR3 +~ ( )-- + 166 CR4 ( / )-- + 165 CR4 : TEMP +~ ( / )--+—< )* 165 160 CR7 ( / ) 163 CR6 ♦~( ) 173 CR5 ♦--( / ) 176 CR7 ( )-- + 163 CR3 (0021)-+ 166 CR4 (0025)-+ 165 TEMP (0031)-+ 160 CR5 (0036)-+ 176 CR6 (0042)-+ 173 TR1 (0043)-+ 120 -6- : CP7 CR8 +— ( ) < >--♦ : 163 172 : : LS2 : CR6 +_._( , __ +--< ) : 007 173 CR7 (0050)-+ 163 +-.■ CR6 ( )- 173 TR1-00X < ) — 120 LS3 < ) 010 TR4-XX0 -( / )— 123 CR8 (0055)-+ 172 LS6 CR13 CR12 CR5 ( ) ( / ) ( / ) ( / )-. 013 16* 175 176 CR10 (0062)-+ 171 +- ♦- PS1 -( )- 004 CPU -< )- 170 PS1 •-( / ) 004 - + -+- CR12 -( / )- 175 CRil -< / )• 170 CR10 ( )• 171 LSI ( )■ 003 CR5 ( / ) 176 CR11 (0070)- • ± 170 SOLA (0074)- ■ + 174 4— PSi -( )- 00* CPU -( )- 170 - + FAULT ♦ — ( )• 017 TEMP (0100)-+ 160 +- •»-- C&12 -( )- 175 TEMP ~( )• 160 PB5 ( ) — + 01* CP12 (0104)-+ 175 ♦— WC ) 167 TR3 (0106)-+ 122 ♦ - CR11 -( ) 170 LS2 ( )■ 007 WC (0111)-+ 167 ♦ — ♦- CR13 ( )- 164 TP3-00X -( )~ 122 -4- CR C + — i / ) llf- CR13 (0115)-+ 164 TR* CP2 «-— ( ) 161 TRA-OOX +--< )~ 123 TEMP ♦--< )• 160 CR7 ( / ) 163 CP5 ( )~ + 176 SSI { ) 016 (0116)-+ 123 TEMP (0121)-+ 160 CR2 (0126)- • + 161 TR5 (0127)- • + 124 XPR000I*******STEP TIME IS 1.04 SECS -8- character is both a phrase marker and a count of the number of duplicate characters and the second character of the pair is the normal bit-code of the repeated character . Thus the string: ABBBCBBBBDEEF length=13 (*) reduces to: A(3)BC(U)BD(2)EF length-10 while the string: XYYYZXYYYZ length=10 (**) reduces to: X(3)YZX(3)YZ length=8 Note that in string (*) the replacement of 'EE ' with '(2)E* represents neither a saving nor a loss with regard to the actual number of characters transmitted; the larger common phrase 'XYYYZ ' is not detected in string (**) because of the contiguous duplicate character requirement. Nevertheless, the method shows promise and features simplicity of the encoding/ decoding mechanism as illustrated in the following Knuth-style description of the algorithm. Step 1 (initialization) F «- first character of character string MSG S «- second character of MSG i «-l L «- total length of MSG Step 2 (longest contiguous string of Fs) while (F=S & j < L) do I d *- J + i •th character of MSG Step 3 (output) if (j-i)=l then output (F) else do I output ( j - i ) output (F) -9- Step h (halt?) if j > L then halt Step 5 (scan remaining characters) J «- J + 1 F «- S th S <- j character of MSG go to step 2 2 .2 .2 . The Decoding Algorithm The decoding algorithm is equally simple Step 1 (initialization) F «- first character of encoded string MSG S <- second character of MSG i <-2 Step 2 (marker?) if F is a marker then do else do output (F) i <- i + 1 if i > L then do C output (S) ] halt F «- S S <- i character of MSG go to step 2 output F copies of S i «- i + 1 if i > L then halt th F «- i character of MSG i <- i + 1 if i > L then do J output (F) | halt 5 ; CH=TRANSLATE(CH, '0', • •); PUT EDIT (»?*, CH, FIRST) (3 A); LR=LR+2; END; I=j; J«J+l; FIPST=SECOND; SECOND=SUBS T R(MSG, J,l); END; PUT EDIT ('•«•) (A); end; END CCMp; ■13- CRIGir.AL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCEO M ESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCEO MESSAGF: ORIGINAL MESSAGE: REDUCED MESSAGF: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCEO MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: PP1' 3S004 P61« 0— ( )--+• CS5002-U0G3 )S002-+« : 001 : • :*003 S00201S5003 :' :?009 :• : CR1 : PB2 :?003 CR1S003 : S5003 PB22047 CR1« +__( )__ + ( )__h __ „. +55002- U003 )SOO2-+*0O2-C8003 ) S002-+X042- ( S5002030021 >- + • : 177 002 : :*003 1S0027S007 S00202SS003 :*044 1*0027' :?0 19 :' ?019 :%0*3 MX1* :*019 ♦2042-(?002012)-+' 55065 162' MX1' 35004 MX1« 0— ( )--+• 0X002- U003 )S5O02-*« 162 : • 35004 162*003 :• J5010 : • *010 : • + + i +S5009- + ' : • • i •: PB4-NC TR5-XX0 -lk- Cftl« (0011)-+' 177' MXi' (0012)-*' 162" CR3 1 RECUC ED MtS?MiE: •:I£002 P6^-NCS003 TP 5-^002X02045 CR3* ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED '-"ESSAGE: ORIGINAL MESSAGE: R EDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: p EDUCED MESSAGE: ORIGINAL MESSAGE: PEDUCED mess/gE: ORIGINAL MESSAGE: PEDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: STATISTICS: ORIGINAL MFSSAGE LENGTH: PEDUCED MESSAGE LENGTH: SAVING: 672 (= 67.6 * ) +__( j_ ( / ) +%002~<%003 )?005-( / )?045-UQ02021)- + ' : 005 124 :S303 S00205S007 124S")48 1*0'126' : PR4-NG' :%002 PB4-N0' +— ( I — +' ♦ S002- (?103 JS002-*-' : 006 : • :*003 %0a2062003 :' :?009 :• : TR5-G0X : CR5 : TRf-*0020X :*003 CP5S047 CP4' -— (0021)-+' 166' < ( )__ 4.— { ) + %002-__ + t +X002-IX003 )X005- ( ) ( / )_„ (0055)-+' +*002-<*003 )*005-(*003 )*005-I*003 )*005-< / ) *025- (*0020*0025)-+« : 173 120 010 123 :*003 173*007 120*007 010*007 123*028 172' : LS6 CR13 CR12 CP5 :*003 LS6*007 CR13*006 CR12*006 CR5*027 CR10' ^ — ( ) ( / j ( / j ( / ) +*002-(*003 )*005-( / )*005-< / )*005-< / ) *025-(*002062)-+' : 013 164 175 176 :*003 013*007 164*007 175*007 176*028 171' : PS1' :*003 PS1' * — ( )--+• +*002-(*003 )*002-+' : 004 : • :*003 5100204*003 : ' 172' CR10' (0062)-+' 171' ORIGINAL MESSAGE: -18- REDUCED MESSAGE: • :*009 ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCFD MESSAGE: GPIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: ORIGINAL MESSAGE: REDUCED MESSAGE: CR11 : CR12 CRIO CR5 :£0r>3 CR*002l*OO2 : *003 CR 12*006 CR 10*006 CR5*027 CR*0021' CR11« ^ — ( ) — + — ( / ) { ) ( / ) — ^ __ (0070)-+' +*002-<*003 )*002-+?002-( / )*005-*005-<*003 )*005-(*003 )*005-< / ) *025- { *0020*002 c )-♦ «.__( ) , » ( , < / ) <0055)-+ ENCODED MESSAGE IS: RECONSTRUCTED MESSAGE IS: :*003 173X007 120*007 123*028 172 173 120 123 172 ENCODED MESSAGE IS: RECONSTRUCTED MESSAGE IS: ENCODED MESSAGE IS: RECONSTRUCTED MESSAGE IS: ENCODED MESSAGE IS: RECONSTRUCTED MESSAGE IS: :*003 LS6*007 CR13*006 CR12*006 CR5*027 CR10 : LS6 CR13 CR12 C P 5 + *002-<*003 )*005-< / )*005-( / l*005-( / ) *025- < *002062 »-♦ «.__( ) ( / , { / ) ( / ) CR10 (00621- + ENCODED MESSAGE IS: RECONSTRUC T ED MESSAGE IS: *003 013*007 164*007 175*007 176*028 171 013 164 1/5 i 76 -25- 171 Again it is clear that a set of phrases of S can be generated by enumeration. Now, given a set P of common phrases in S, what do we know about them? (1) If S is of length N, then no phrase p € p can be of length > L N/2 J , else it could not be repeated and, hence, could not be "common"; (2) A string of length N will have: N phrases of length 1, (N-l) phrases of length 2, (N-2) phrases of length 3, and (N - LN/2J + l) phrases of length LN/2J . Would the replacement of all phrases p with pointers to P yield a minimal length S'? Not unless there is an ordering to P, for replacement of p.-AB' and p.= 'ABC, applied in the order (i, J) to S='ABCABC', yields S'=(p. )C(p. )C, with length= J +, while the order (j,i) yields S'=(p.)(p.) with length=2 . Suppose P were to be ordered by length of the p. such that replace- ments always removed the longest phrases first. This is not sufficient since p.= 'ABCD' and p.= 'CDEAB', applied in the order (j,i) since |p.|>|p. |, to S= 'ABCDEABCD ' , yields S '= 'AB(p . )CD ' with length=5, while the order (i,j) J yields S'= ' (p. )E(p. ) ' with length=3- So length of common phrases alone does not establish an order on P. Does frequency of usage affect the order of P? In addition to all phrases p. e P, also keep a count of how many times that phrase appears in the message, f . . Now order P by decreasing f., i.e., largest number of uses first. -26- This method also fails since for p.= 'ABC f.=3 1 1 applied to S= 'ABCXABCYABC ' p.= 'XABY' f.=l 3 3 in the order (i,j) because f. > f . yields S'=(p. )X(p. )Y(p. ) with length=5 while the ordering (j,i) yields S'=(p. ) (p .) (p. ) with length=3- J- o ■*■ But now let us combine these two methods to account for the effects of both phrase length and frequency in suggesting a possible solution to the problem of how to pick and order a set P. 2.3.2. A Conjecture Given a string of characters S of length N, let P be the set of phrases of S such that each element of P is a phrase p., where p. is the j phrase of length |p.|, and p. has frequency of occurrence f . > 2 in S. Then for each phrase p. define a reduction factor r. = f. (Ip.l - l) 3 3 3 ' 3 ' which represents the number of characters saved when |p.| characters of text are replaced with one one-character phrase identifier (pointer) for each of the f . occurrences of p . . 3 3 Construct the set of phrases P', which are the p. sorted into descending order by r . . Within groups of phrases with equal r value, sort J again by descending length of phrases. Now replace, in order, every occurrence of pi in S with the appropriate reference pointer until all p '. have been J J examined, thus yielding a new string of characters and/or phrase pointers, S'. Then S' is the minimal length text string which can be created from S. .27- 2.3-3- An Example Consider the (arbitrary) message 'ABCXABCYABCZXABCY ' . The set of unique phrases appearing at least twice include: phrase # phrase 1 'ABCY' 2 8 2 •XABC' 2 6 3 *ABCY' 2 6 h 'ABC k 8 5 'XAB' 2 k 6 'BCY' 2 k 7 'AB* 1+ h 8 »BC' 1+ k 9 'XA' 2 2 10 r CY' 2 2 The ordering of replacement of the phrases with r=8 is not arbitrary; the longest such phrase (XABCY) must be applied first. Replacing 'XABCY' first we derive: ABC XABCYA BCZ XABCY length=17 ABC(1) ABCZ (1) length=9 (4)(1)(U)Z(1) length=5 while replacing 'ABC' first derives: ABCXABCYABCZXABCY length=17 (k)x(k)Y(k)z(h) length=7 The algorithm described has been programmed and commented to accept a message string, determine all possible phrases which occur repeatedly, sort them by descending r values, and replace common phrases with pointers into phrase table P. -28- /* OPTIMAL PHRASE EXTRACTION FROM TEXT STRINGS */ /* COMPUTER SCIENCE 389 PROJECT */ /* ALFRED C. WEAVER */ /* AM ALGORITHM TO EXTRACT SUBPHRASES FROM A MESSAGE • S» */ /* SUCH THAT REPLACEMENT OF SUBPHRASES WITH "PHRASE POINTERS" */ /* YIELDS A MINIMUM LENGTH MESSAGE */ PHRASES: PROC OPT IONS ( MA IN ) ; /* »S' IS THE ORIGINAL MESSAGE (CHARACTER STRING) */ /* «P« IS THE ARRAY CF SUBPHRASES OF S */ /* 'F« IS THE ARRAY OF FREQUENCY COUNTS */ /* «R» IS THE ARRAY OF REDUCTION FACTORS */ DCL S CHAF(IOO) VAR, ( P ( 500 ), PHRASE ) CHAR(50) VAR, CH CHAR(3), (F(500) ,R (500), A, I ,J,K,L,M,N) FIXED BIN(31); /* REPEAT UNTIL INPUT STREAM IS EXHAUSTED V ON ENDFILE1 SYSIN) STOP; FOREVER: DO WHILEPl'B); GET LIST( S); N = LENGTH(S); K = 0; F = 0; R=0; PUT SKIP(2) EDITi 'ORIGINAL STRI NG : • , • • • • , S , • • ■ • , • SUBPHRASE • , •FREQUENCY*, 'REDUCTION FACTOR • )( A ,COL( 30 ) , 3 A, SKIP, A,C0L(20) ,A,C0L(30),A) ; /* SEARCHING FOP THE LONGEST SUBPHFASE FIRST IS */ /* ESSENTIAL TO PREVENT A RESOFT LATER ON PHRASE LENGTH */ DO L = FL0PR(N/2) TO 2 BY -l; /* EXTRACT EACH OF THE SUBPHRASES OF LENGTH 'L 1 IN 'S« */ DC I = 1 TO N-L*l; PHPASE=SUBSTR(S,I ,L) ; /* THIS LOOP DISCARDS COMMON SUBPHFASES */ /* THIS STEP IS NOT ESSENTIAL, BUT SAVES TIME */ /* IN THE REPLACEMENT STEP */ DO A = 1 TO k; IF PHRASE=P(A) THEN GOTO X; END; /* »M« POINTS Tn THE NEXT POSSIBLE OCCURANCE OF •PHKASE* */ m=I4-l; -29- IF M>N THEN GOTO X; J=INDEX( SUBSTB(SfM), PHRASE); /* DETERMINE WHETHER 'PHRASE 1 OCCURS AT LEAST TWICE */ IF J>0 /* FREQUENCY > 1 */ THEN DO; K=K+l; P(K)=PHRASE; F(K)=2; m=m+j+l-1; /* FIND ALL OCCURENCES OF 'PHRASE* IN 'S» */ DO WHILE (J>0) ; IF M+L > N THEN GOTO X; J=INDEX( SUBSTR(StM), PHRASE); IF J>0 THEN DO; M=M+J+L-l; F{K)=F(K)*l; END; END; END; X: END; END; /* ESTABLISH THE REDUCTION FACTOR FOR EACH SUBPHRASE */ DC A = 1 TO k; P(A) = F(A) * (LENGTH(P(A)) - 1); PLT SKIP EDIT l»" , ,PIA),"",F(A),R(A)) (3 A,COL(20),F(4),COL(30),F(8) ); fnd; /* now sort the phrases into descending order by r(i) */ /* a simple jump-down sort will do */ DO 1=1 TO K-l; DO J=I+1 TO K; IF RIIXR1J) THEN /* INTERCHANGE*/ DO; PHRASE=P(I); L=F(I); M=P(I); P(I)=P(J); F(I)=F(J); R(I)=R(J); P(J) = PHRASE; F(J)=U R(J) = M; end; end; end; /* MOW LABEL AND PRINT ALL SUBPHRASES IN THEIR OPTIMAL ORDER */ PUT SKIP<2) EDIT ('SUBPHRASES IN OPTIMAL ORDER FOR REPLACEMENT', •PHRASE* SUBPHRASE', 'FREQUENCY', * REDUCT ION FACTOR ' ) (A, SKIP, A, C0L(20), A, COL(30), A); DO A = L TO K; PUT SKIP EDIT (A, "", P(A), "", F(A), R(A)J (F (5), COL (11) ,3 A,COL(20),F(A),C0L(30), F(8)); END; /* REPLACE ALL SUBPHRASES WITH PHRASE REFERENCES */ DO 1=1 TO k; DC WHILE(INDEX(S,P( I) )>0); J = INDEX(S,P(D) ; PUT STRING(CH) EDIT (I) (F(3)); CH=TPANSLATE(CH,'0' ,' '); (NOSTPINGRANGE): S = SUBSTR ( S , 1 , J~l ) II •«• I I CH I I -30- SUBSTMS, J+LENGTH(P( I))); END; end; PUT SK!P(2) EDIT («THE MINIMAL STRING IS: 1 , S) (A, COL (30), A); END FCREVEP; END PHRASES; -31- ORIGINAL STRING: • ABC XABCYABC7.XABCY* SUBPHRASE FREQUENCY REDUCTION FACTOR •XABCY* 2 8 • XABC • 2 6 •ABCY* 2 6 •ABC* A 8 •XAB« 2 A •BCY» 2 A •AB» A A • BC» 4 A »XA» 2 2 •CY« 2 2 SURPHPASES IN OPTIMAL CPDER FOR REPLACEMENT PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 1 «XABCY» 2 8 2 »ABC* A 8 3 »ABCY« 2 6 A 'XABC 1 2 6 5 «XAB» 2 A 6 »BCY« 2 A 7 •AB* A A 8 »BC* A A 9 »XA« 2 2 10 'CY» 2 2 THE MINIMAL STRING IS: *Q02$001£002ZS001 ORIGINAL STRING: 'XXXXXXXXXX* SUBPHRASE FREQUENCY REDUCTION FACTOR •XXXXX* 2 8 »XXXX« 2 6 •XXX« 3 6 •XX« 4 A SUBPHPASES IN OPTIMAL ORDER FOR REPLACEMENT PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 1 'XXXXX» 2 8 2 «XXXX» 2 6 3 »XXX» 3 6 A • XX f A A THE MINIMAL STRING IS: $0013001 ORIGINAL STRING: « ABCCABCDABCDEF' SUBPHRASE FREQUENCY REDUCTION FACTOR •ABCD» 3 S •RCDA» 2 6 •CDAB» 2 6 •DABC« 2 6 •ABC 1 3 6 •BCD* 3 6 •CDA» 2 A •D*B« 2 A •AB» 3 3 •BC* 3 3 •CD' 3 3 •DA» 2 2 -32- SUBPHRASES IN OPTIMAL ORDER FOR REPLACEMENT PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 9 6 6 6 6 6 4 4 3 3 3 2 1 •ABCD* 3 2 •BCDA 1 2 3 •CDAB* 2 4 •DARC 2 5 •ABC* 3 6 •BCD' 3 7 •CDA« 2 8 •DAB' 2 9 •AB« 3 10 •BC« 3 11 »CD» 3 12 «DA» 2 THE MINIMAL STRING IS: *001*001*001EF ORIGINAL STRING: SUBPHRASE • ABCDEF* •ABCDE' •BCDEF» 1 ABCD» • RCDE' •CDEF' •ABC •BCD' •CDE« •DEF' •RST* •AB' •RC» •CD« •DE« •EF« •RS« •ST« •ABCDEFABCDEFRSTRSTABCRST* FREQUENCY P. EDUCTICN FACTOR 2 10 2 8 2 8 2 6 2 6 2 6 3 6 2 4 2 4 2 4 3 6 3 3 3 3 2 2 2 2 2 2 3 3 3 3 SUBPHRASES IN CP T IMAL ORDER FOR PEPLACFMENT PHRASE* 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 SURPHRASE • ARCOEF* • ART)F' •RCDEF« • A B C ■ •RCDE' •CDEF* •ARC •RST' •CDE« •DEF* •BCD' •AB» •RC •PS' •ST' •EF» •CD' •DE» FREQUENCY 2 2 2 2 2 2 3 3 2 2 2 3 3 3 3 2 2 2 REDUCTICN 10 8 8 6 6 6 6 6 4 4 3 3 3 3 2 2 2 FACTOR THE MINIMAL STRING IS: *ooi*oomoB*oo8*oo7*DDa -r*- ORIGINAL STRING: »XXXXX» SUBPHPASE FREQUENCY REDUCTICN FACTOR •XX 1 2 2 SUBPHFASES IN OPTIMAL ORDER FOR REPLACEMENT PHRASE* SUBPHPASE FREQUENCY REDUCTION FACTOR 1 •XX* 2 2 THE MINIMAL STRING IS: *001*001X ORIGINAL STRING: «XXXXXX» SUBPHRASE FREQUENCY REDUCTICN FACTOR •XXX* 2 4 •XX' 2 2 SUBPHPASES IN OPTIMAL ORDER FOP REPLACEMENT PHRASE* SUP PHRASE FREQUENCY REDUCTION FACTOR 1 »XXX« 2 4 2 'XX» 2 2 THE MINIMAL STRING IS: $001*001 ORIGINAL STRING: 'XXXXXXX* SUBPHRASE FREQUENCY REDUCTICN FACTOR •XXX 1 2 4 •XX* 3 3 SUBPHPASES IN OPTIMAL ORDER FOR REPLACEMENT PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 1 'XXX 1 2 4 2 'XX* 3 3 THE MINIMAL STRING IS: 35001*00 IX ORIGINAL STRING: »XXXXXXXX« SUBPHRASE FREQUENCY REDUCTION FACTOR •XXXX» 2 6 •XXX 1 2 4 •XX» 3 3 SUBPHPASFS IN OP T IMAL ORDER FOR REPLACEMENT PHRASE* SUBPHPASE FREQUENCY REDUCTICN FACTOR 1 »XXXX f 2 6 2 »XXX« 2 4 3 'XX 1 3 3 THE MINIMAL STRING IS: *00i*001 ORIGINAL STRING: «XXXXXXXXX» SUBPHRASE FREQUENCY REDUCTICN FACTOR •XXXX 1 2 6 •XXX« 2 4 •XX f 4 4 SUBPHPASES IN OPTIMAL ORDER FOR REPLACEMENT PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 1 »XXXX« 2 6 2 «XXX' 2 4 -34- 3 'XX' % THE MINIMAL STPING IS: S0012001X ORIGINAL STRING: SURPHRASE • XXXXX 1 •XXXX» •XXX* « XX 1 FPEOUENCY 2 2 3 •XXXXXXXXXX* REDUCTION FACTOR 8 6 6 3UBPHPASFS IN OPTIMAL ORDER FOP REPLACEMENT PHRASE* 1 2 3 SURPHRASE • XXXXX' •XXXX* •XXX 1 «xx* FREQUENCY 2 2 3 REDUCTION 8 6 6 /. FACTOR THE "INI MA L STRING IS % DO 1*00 1 ORIGINAL STRING: SURPHRASE • XXXXX* •XXXX* • XXX' • XX* FPEOUENCY 2 2 3 •XXXXXXXXXXX* PEDUCTICN FAC 8 6 6 'OP SUBPHPASFS IN OPTIMAL ORDER FOR REPLACEMENT PHRASE* 1 2 3 i SUBPHRAS E • XXXXX' •XXXX' •XXX* • XX' FP EOUENCY 2 2 3 REDUCTION 8 6 6 5 FACTOR THE MINIMAL S T PING IS: * 001? 00 IX ORIGINAL STPING: SURPHRASE • XXXXXX' • XXXXX* • XXXX* •XXX* • XX* FPEOUENCY 2 2 ? 3 5 • XXXXXXXXXXXX • REDUCTICN FACTOR 10 . 8 6 c SUBPHPASES IN TP^IMAL OPUEP FOP REPLACEMENT PHRASE* SURPHRASE FPEOUENCY PEDUCTICN FACTOR 1 2 3 4 • XXXXXX • •XXXXX' • XXXX' • XXX' •XX' HE MINIMAL STRING IS: 10 8 6 6 5 35001*001 -35- ORIGINAL STRING: SUBPHRASE — I >~ ---• -( I™ ■ ( ) • — < )— ■ -( ) ■ ( )— — * ) • — ( ) -( ) ( 1™ )-— )— — ( ) -( )-« • ( ) ) > j — ( ) -< )-— { ,_„..,. ) — )- ) .- — ( ) — ( ) ,- ) — — — ) ) — < )- -( ) — ( ) ) , )— ) I — ( ) ( ) — > ) ) — < -( ) ( )- ) — ) ) ) FREQUENCY 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 •0— ( ) REDUCTTCN 26 26 26 24 24 24 24 22 12 22 22 22 20 20 20 20 20 20 18 18 18 18 18 18 18 16 16 16 16 16 16 16 16 14 14 14 14 14 14 14 14 12 12 12 12 12 12 12 12 10 10 10 10 10 10 10 FACTOR — + t 36 •» 2 10 •-< 2 8 •( )• 2 8 1 )-l 2 8 1 l—l 2 8 • J — • 2 8 1 ) 1 2 8 2 8 • { • 2 6 • ) • 2 6 • )-• 2 6 1 )—1 2 6 f )•- .- • 2 6 • 1 4 12 2 4 • )• 2 4 1 )-. 2 4 •)— •• 2 4 1 1 6 12 1.. I 9 9 . )• 2 2 1 )-. 2 2 SUBPHR4 SES IN OPTIMAL ORDEP FCR REPLACEMENT 'PHRASED SUBPHRASE FREQUENCY PEDUCTICN FACTOR 1 • — < >~- ' 2 3 2 26 2 1 f !.._..,„.-__... 1 26 2 26 4 ■-- ( |— — • 5 2 »-< ) ■ 2 24 24 6 , ( , , 2 24 7 2 24 8 •-- < ) ' 2 22 9 •-< ) • 2 22 LO • ( , • 2 22 11 • ) 1 12 2 22 2 22 13 •--< ) ■ 2 2C 14 •-( ) ■ 2 2C 15 • ( , . 2 2C 16 • ) i ■s 2 2C 17 i ) t 2 20 18 i ) i -37- -- ( )- ( ) — ) — ,. ) ) ,„. ) — ( )• -( )~ ( )— ) ) ) — ( )- -( )~ { )»-■ )—— )- — »'• ) ~-( ) -{ )- ( )— )— - ) -.- — ) — )~~ • ) — .1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 6 2 2 2 2C 18 18 18 18 18 18 18 16 16 16 16 16 16 16 16 14 14 14 14 14 14 14 14 12 12 12 12 12 12 12 12 12 12 10 1G 10 •38- 55 i ) « » 2 56 i )—.—._ < ' 2 57 » — - — - i 2 58 1 »— 1 ' • 2 59 '-( ) ' 2 60 « i— — i 9 61 ' i )— i 2 62 i )-— i 2 63 ' i ) _-.«. i 2 64 2 65 « '-< • 2 66 « »( )• 2 67 « i )-i 2 68 i )— i 2 69 j i ) 1 2 70 '< » 2 71 ' ' )• 2 72 ' i )~i 2 73 « i )-• 2 74 ' ')--• 2 75 ' i f 2 76 ' • )• 2 77 ' 1 ) • 2 78 ' )- • 2 10 10 10 TTJ" 10 9 8 8 8 "8" 8 8 8 6 6 6 6 6 u 4 4 4 2 2 THE MINIMAL STRING IS: 0S001S0CU060* ORIGINAL STRING: SUBPHPASE — _ I FREQUENCY REDUCTICN FACTOR 2 12 2 10 2 8 3 S A 8 7 7 SUBPHPASES IN OPTICAL ORDER FOR REPLACEMENT PHRASE* SURPHPASE FREQUENCY D EOUCTICN FACTOR 1 2 3 4 5 6 _— • THE MINIMAL STRING IS: ORIGINAL STRING: 2 12 2 10 3 9 2 8 A 8 7 7 $001*001- • *•_( / ) ( / ) 4-1 -39- 2.3-^. Common Phrase Replacement Assuming now that we have selected by hand, or generated automatically using the method described, a particular set of phrases P which are the set of phrases to be used for replacement in S, we can now attack the question of how to best recognize and replace a common phrase in a general string. Clearly, the optimal way is not the linear search which was used for convenience in the previous program. Let us attach the problem somewhat mathematically. Assume that a set of common phrases is given. Only those phrases may be referenced (replaced) within messages. The problem then is to discover for each message that "parse" into nonover lapping phrases which \ minimizes the new message's length. Let a general character string (message) to be transmitted be described by the following context-free grammar: : := | | P = E = C (any string of 1 to 256 printable characters) (any integer in the range to 255) The occurring in a indicates which of 256 phrases (maximum) is intended for substitution ( is the index into phrase table P) ; the in is a count of the characters in the , minus 1. Assuming that no string exceeds 256 characters in length, the space requirements for each component of the message are: phrase reference - 2 units end marker - 1 unit character string - 2+L, where L is the number of characters in the string. -kO- This encoding scheme is not space optimal - in fact, the storage requirements for the character string of length L could be reduced to L, but at the cost of having the decoding mechanism examine every character in the string looking for a phrase reference (P) or end marker (E). Since this is supposed to be an algorithm for application purposes, this appeared to be unacceptably slow, hence, the addition of two extra characters [C ] allowing the direct application of the IBM 360 "Move Characters" instruction to move the string all at once without examining individual characters. An efficient algorithm for producing space-optimal parses has been developed by using this strategy. Consider one message as a simple character string. Number the character positions from 1 through N. Suppose that one can compute the function f(j) = least space necessary to store characters j, j'+l, ..., N of the given message for I < j < N Then f(l) will be the space-optimal parse length for the entire message. Let P be the set of all phrases. For each p e P let |p| = length of p. Let ST(j,p) be a predicate which is true when phrase p matches character positions j, j+1, ..., j 4- |p I - 1 of the given message string. ST(j,p) is false when p is not a phrase or when p does not match the string beginning at position j in the message. To define f(l), let P(I) = (p|ST(l,p)) F(l) = min (F(l+|p| + 2, F(l+l)+l)} f or 1 < I < N . ' Assume by induction that f(j)=F(j), for I < j < N. Assume that phrase p e P(l) is used in the parse at I - it will match characters I, 1+1, ..., I + |p J - 1 and that storage space can be reduced to two characters -Ul- (the phrase marker and the phrase reference number). Then the remainder of the message, characters I + |p|, I + |p | + 1, . .., N, will require f(l + |p | ) characters for storage. But f(l + Ipl) = F(l + Ipl) by the induction hypothesis. Now assume that no phrase could be used at I. Then the one -character string at I can be stored followed by the optimal parse of characters I + 1, 1+2, . .., N. Since a one-character string requires one character of storage, f(l + l) + 1 = F(l + l) + 1. Now simply minimize all alternatives at each I and set f(l) = F(l). Finally, the search for phrases in P is accelerated by using a "hash table" techniques. Since this algorithm will incur a cost of two characters overhead for each , only Ipl > 3 are considered for replacement. The characters I, I + 1, and I + 2 of the message are then hashed to accelerate the search for phrases "beginning with that three character segment . -k2- 3 • SUMMARY This study has illustrated the necessity of data compression in a telecommunications environment. Two techniques have been presented which accomplish data compression hy very different methods - duplicate character compression and common phrase replacement. For the type of data under consideration, both work quite well. The success of the duplicate character compression method is due in large part to the specific type of data being transmitted, which did, in fact, have many occurrences of contiguous duplicate characters. The common phrase detection and replacement method is more general, will apply to a large number of situations, but incurs a much larger overhead. Thus, the method selected is, predictably, a strong function of the transmitted me s sage . -U3- BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-7U-659 4. Title and Subtitle Data Compression for Character Strings 3. Recipient's Accession No. 5. Report Date July, 1974 7. Author(s) Alfred C. Weaver 8. Performing Organization Rept. No - UIUCDCS-R-74-659 9. Performing Organization Name and Address University of Illinois at Urbana- Champaign Department of Computer Science Urbana, Illinois 618OI 10. Project/Task/Work Unit No. 11. Contract/Grant No. 12. Sponsoring Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 13. Type of Report & Period Covered 14. 15. Supplementary Notes 16. Abstracts Two approaches to data compression in a telecommunications environment are examined: contiguous duplicate character compression and common phrase detection/replacement. Algorithms for each method are presented. Each method is shown to be useful for a given class of transmitted messages. 17. Key Words and Document Analysis. 17a. Descriptors Data Compression Telecommunications Common Phrase Detection 17b. ldcntif iers/Open-Knded Terms 17c. ( OSA II lie Id /Group 18. Availability Statement Release Unlimited 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages kk 22. Price FORM NTI3-3B (10-70) USCOMM-OC 40329-P7I AUG 27 1974 FEB 1 7 1981