key: cord-0712338-e9app2cv
authors: Lu, Chin Lung; Huang, Yen Pin
title: A memory-efficient algorithm for multiple sequence alignment with constraints
date: 2005-01-01
journal: Bioinformatics
DOI: 10.1093/bioinformatics/bth468
sha: 4b0e97f3c2c4402b3174623db30ce1a1cde46b50
doc_id: 712338
cord_uid: e9app2cv

Motivation: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in 𝒪(γn(2)) memory, where γ is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences~only. Results: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only 𝒪(αn) space, where α is the sum of the lengths of constraints and usually α ≪ n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints. Availability: http://genome.life.nctu.edu.tw/MUSICME Contact: cllu@mail.nctu.edu.tw

Multiple sequence alignment (MSA) is one of the fundamental problems in computational molecular biology that have been studied extensively, because it is a useful tool in the phylogenetic analyses among various organisms, the identification of conserved motifs and domains in a group of related proteins, the secondary and tertiary structure prediction of a protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002) . Moreover, MSA is one of the most challenging * To whom correspondence should be addressed. problems in computational molecular biology because it has been shown to be NP-complete under the consideration of sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and Jiang, 1994; Bonizzoni and Vedova, 2001) , which means that it seems to be hard to design an efficient algorithm for finding the mathematically optimal alignment. Hence, some approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna et al., 1997; Li et al., 2000) and heuristic methods (Feng and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and Sharpe, 1988; Thompson et al., 1994) were introduced to overcome this problem.

Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists regarding the structures/functionalities/consensuses of their datasets into sequence alignment such that the userspecified residues/nucleotides are aligned together in the computed alignment (Tang et al., 2003) . Tang et al. (2003) first designed a dynamic programming algorithm for finding an optimal constrained alignment of two sequences and then used it as a kernel to develop a constrained multiple sequence alignment (CMSA) tool based on the progressive approach, where each constraint considered by Tang et al. is a single residue/nucleotide only. Their proposed algorithm for the two sequences runs in O(γ n 4 ) time and consumes O(n 4 ) space, where γ is the number of constrained residues and n is the maximum lengths of the sequences. Later, this result was improved independently by two groups of researchers to O(γ n 2 ) time and O(γ n 2 ) space using the same approach of dynamic programming (Yu, 2003; Chin et al., 2003) . In fact, each constraint requested to be aligned together can represent a conserved site of a protein/DNA/RNA family and each conserved site may consist of a short segment of residues/nucleotides, instead of a single residue/nucleotide. In other words, the constraint specified by the biologists can be a fragment of several residues/nucleotides. For some applications, biologists may further expect that some mismatches are allowed among the residues/nucleotides of the columns requested to be aligned. Hence, Tsai et al. (2004) studied such a kind of the constrained sequence alignment and designed an algorithm of O(γ n 2 ) time and O(γ n 2 ) space for two sequences. The improvements and extension above greatly increase the performances and practical usage of the CMSA tools developed using the progressive approach. However, the requirement of O(γ n 2 ) memory still limits the existing CMSA tools to align a set of short sequences, at most several hundreds of residues/nucleotides. To align large genomic sequences of at least several thousands of residues/nucleotides, there is a need to design a memoryefficient algorithm for the constrained pairwise sequence alignment (CPSA) problem, which is the key limiting factor relating to the applicable extent of the progressive CMSA tools. Hence, in this paper, we adopt the so-called divide-andconquer approach to design a memory-efficient algorithm for solving the CPSA problem, which runs in O(γ n 2 ) time, but consumes only O(αn) space, where α is the sum of the lengths of constraints and usually α n in practical applications. Based on this algorithm, we have finally developed a memoryefficient CMSA tool using the progressive approach. Note that applying the divide-and-conquer approach to memoryefficiently align two or more sequences without any constraints has been studied extensively (Myers and Miller, 1988; Chao et al., 1994; Tönges et al., 1996; Stoye et al., 1997a,b; Stoye, 1998) . In contrast to the progressive approach used here, the divide-and-conquer algorithms proposed by Stoye et al. (Tönges et al., 1996; Stoye et al., 1997a,b; Stoye, 1998) considered the input sequences simultaneously and heuristically compute the good, but not necessarily optimal, dividing positions so that the resulting total MSA is close to an optimal MSA of the original sequences. In fact, many other CMSAs have been proposed from various perspectives, even using different approaches (Schuler et al., 1991; Depiereux and Feytmans, 1992; Taylor, 1994; Myers et al., 1996; Notredame et al., 2000; Thompson et al., 2000; Sammeth et al., 2003) . Of these various CMSAs, it is worth mentioning that Myers et al. (1996) obtained their CMSA by performing progressive multiple alignment under position-based constraints that are given by users; Sammeth et al. (2003) got their CMSA by performing simultaneous multiple alignment under segment-based constraints (as same as we studied here) that are pre-computed via a local segmented-based algorithm (Morgenstern, 1999) . We refer the reader to their papers for details.

Let S = {S 1 , S 2 , . . . , S χ } be the set of χ sequences over the alphabet . Then an MSA of S is a rectangular matrix consisting of χ rows of characters of ∪ {-} such that no column consists entirely of dashes and removing dashes from row i leaves S i for any 1 ≤ i ≤ χ . The sum-of-pairs score (SP score) of an MSA is defined to be the sum of the scores of all columns, where the score of each column is the sum of the scores of all distinct pairs of characters in the column. In practice, the score of the pair of two dashes is usually set to zero. Then the problem of finding an MSA of S with the optimal SP score is the so-called sum-of-pairs MSA problem (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002) .

Let δ(T 1 , T 2 ) denote the Hamming distance between two subsequences T 1 and T 2 of equal length, which is equal to the number of mismatched pairs in the alignment of T 1 and T 2 without any gap. Given an alignment L of S, a band is defined as a block of consecutive columns in L (i.e. a submatrix of L). For any band L of L, let subseq(S i , L ) denote the subsequence of S i whose residues/nucleotides are all in the band L , where 1 ≤ i ≤ χ . A subsequence T = t 1 t 2 . . . t λ is said to appear in L if L contains a band L of λ columns, say π 1 , π 2 , . . . , π λ , such that the characters of column π j , 1 ≤ j ≤ λ, are all equal to t j , or equivalently, subseq(S i , L ) = T for each 1 ≤ i ≤ χ . If δ[subseq(S i , L ), T ] ≤ λ × for a given error ratio 0 ≤ < 1 [i.e. some mismatches are allowed between subseq(S i , L ) and T ], then T is said to approximately appear in L. From the biological viewpoint, T can be considered as the consensus among the subsequences in L and hence T is also called as an induced consensus by the band L . For any two subsequences T 1 and T 2 , T 1 ≺ T 2 is used to denote that T 1 (approximately) appears strictly before T 2 in L (i.e. their corresponding bands do not overlap). Let = (C 1 , C 2 , . . . , C γ ) be an ordered set of γ constraints (i.e. subsequences), each

Then the CMSA of S with respect to is defined as an alignment L of S in which all the constraints of approximately appear in the order C 1 ≺ C 2 ≺ · · · ≺ C γ such that δ(subseq(S i , L j ), C j ) ≤ λ j × for all 1 ≤ i ≤ χ and 1 ≤ j ≤ γ , where L j is the band of L whose induced consensus is C j . Given a set S of χ sequences along with an ordered set of γ constraints and an error ratio , the so-called CMSA problem is to find a CMSA w.r.t. with the optimal SP score. When the number of sequences in S is restricted to two (i.e. χ = 2), the CMSA problem is called as the CPSA problem.

In this section, we shall first design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a 1 a 2 . . . a m and B = b 1 b 2 . . . b n , a given ordered set

and a given error threshold . After that, we shall use it as the kernel to heuristically solve the CMSA problem.

For any sequence T , let pref(T , l) [respectively, suff(T , l)] phase don't change denote the prefix (respectively, suffix) of T with length l. For any two characters a, b ∈ , let σ (a, b) denote the score of aligning a with b. The gap penalty adopted here is the so-called affine gap penalty that penalizes a gap of length l with w o +l ×w e , where w o > 0 is the gap-open penalty and w e > 0 is the gap-extension penalty. For convenience, let A i = pref(A, i) = a 1 a 2 . . . a i ,

Let M k (i, j) denote the score of an optimal constrained alignment of A i and B j w.r.t. k . Clearly, M γ (m, n) is the score of an optimal constrained alignment of A and B w.r.t. . An alignment L is called as a semi-constrained alignment of A i and B j w.r.t. k if it is a constrained alignment of A i and B j w.r.t. k−1 and also ends (or begins) with a band whose induced consensus is equal to a prefix of C k (or a suffix of C 1 ). N k (i, j , h) is defined to be the score of an optimal semi-constrained alignment of A i and B j w.r.t. k that ends with an induced consensus equal to

, which is defined to be the maximum score of all constrained alignments of A i and B j w.r.t. k that end with a substitution pair (a i , b j ). Let 

Case 3: The last aligned pair of L is an insertion pair. Then the score of L is M I k (i − 1, j) and (a i , −) is charged by a gap-open penalty and a gap-extension penalty in M D k (i, j).

However, by including an extra M D k (i − 1, j) − w o − w e into the right-hand side of the above recurrence, we can reformulate

According to the recurrences above, we designed an algorithm to compute M γ (m, n) and its corresponding constrained alignment using the technique of dynamic programming as follows. For convenience, we depicted the recurrences of matrices M k , M D k , M I k and N k for all 0 ≤ k ≤ γ by a three-dimensional (3D) grid graph G, which consists of (m + 1) × (n + 1) × (γ + 1) entries and each entry (i, j , k) Figure 1 shows the relationship of four adjacent entries

Note that there is a directed edge, which is not shown in Figure 1 

node of entry (m, n, γ ) corresponds to a constrained alignment of A and B w.r.t. . As a result, an optimal constrained alignment of A and B can be obtained by backtracking a shortest path from M γ (m, n) to M 0 (0, 0) in G. It is not hard to see that the algorithm costs both computer time and memory in the order of O(γ mn). We call the above algorithm based on the dynamic programming approach as CPSA-DP algorithm. Hirschberg (1975) had developed a linear-space algorithm for solving the longest common subsequence problem based on the divide-and-conquer technique. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences (Myers and Miller, 1988; Chao et al., 1994) . In this paper, we generalize the Hirschberg's algorithm so that it is capable of dealing with the CPSA. As compared with others, our generalization is more complicated because the grid graph G dealt here is 3D, instead of 2D, and the input sequences are accompanied with several constraints that need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (i mid , j mid , k mid ) on an optimal path from M 0 (0, 0) to M γ (m, n) in G so that we were able to divide the constrained alignment problem into two smaller constrained alignment problems; then these smaller constrained alignment problems are continued to be divided in the same manner, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (Fig. 2) .

Before describing our algorithm, some notation must be introduced as follows. Let A i and B j denote the suffixes a i+1 a i+2 . . . a m and b j +1 b j +2 . . . b n of A and B, respectively, for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Let k denote the ordered subset (C k+1 , C k+2 , . . . , C γ ) for 1 ≤ k ≤ γ . 

Schematic diagram of divide-and-conquer approach: two light gray areas are the reduced subproblems after middle position (i mid , j mid , k mid ) is determined, each of which will be further divided into two subproblems of dark gray areas.

Define M k (i, j) to be the score of an optimal constrained alignment of A i and B j w.r.t. k , and define M S k (i, j) (M D k (i, j) and M I k (i, j), respectively) to be the maximum score of all constrained alignments of A i and B j w.r.t. k that begin with a substitution [deletion and insertion, 

Next, we describe our divide-and-conquer algorithm, termed as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t.

as follows. The key point is to determine the middle position (i mid , j mid , k mid ) of the optimal path in G to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment L, we use score(L) to denote the score of L. Let L γ (A, B) be an optimal constrained alignments of A and B w.r.t. and clearly score[L γ (A, B)] = M γ (m, n). Let i mid = m 2 . Then, we partition L γ (A, B) into two parts by cutting it at the position immediately after a i mid and we let L k mid (A i mid , B j mid ) denote the part containing a i mid and L k mid (A i mid , B j mid ) denote the remaining part, where b j mid denotes the last character in L k mid (A i mid , B j mid ) from B, and k mid denotes the largest index so that pref(C k mid , h mid ) (approximately) appears in L k mid (A i mid , B j mid ). Then there are two possibilities when we consider the last aligned pair of L k mid (A i mid , B j mid ).

Case 1: The last aligned pair of L k mid (A i mid , B j mid ) is a substitution pair [i.e. (a i mid , b j mid )]. In this case, we have M γ (m, n) = score(L γ (A, B)) = score(L k mid (A i mid , B j mid )) + score(L k mid (A i mid , B j mid )). If (a i mid , b j mid ) is not a constrained column in L γ (A, B) , then L k mid (A i mid , B j mid ) is an optimal constrained alignment of A i mid and B j mid w.r.t. k mid ending with a substitution pair (a i mid , b j mid ), and L k mid (A i mid , B j mid ) is an optimal constrained alignment of A i mid and B j mid w.r.t. k mid . Hence,

is an optimal semi-constrained alignment of A i mid and B j mid w.r.t. k mid (h mid ) ending with a band L whose induced consensus is equal to pref(C k mid , h mid ). If h mid < λ k mid , then L k mid (A i mid , B j mid ) is an optimal semi-constrained alignment of A i mid and B j mid w.r.t. k mid (h mid ) beginning with a band L whose induced consensus is equal to suff(C k mid , λ k mid − h mid ). Moreover, the induced consensus of the merge of L and L have to be equal to C k mid . In this case, we have M γ (m, n) = N k mid (i mid , j mid , h mid ) + N k mid (i mid , j mid , h mid ). If h mid = λ k mid , then L k mid (A i mid , B j mid ) is an optimal constrained alignment of A i mid and B j mid w.r.t. k mid (h mid ), and hence M γ (m, n) = N k mid (i mid , j mid , λ k mid ) + M k mid (i mid , j mid ).

Case 2: The last aligned pair of L k mid (A i mid , B j mid ) is a deletion pair [i.e. (a i mid , −)]. If the first aligned

We need to compensate it by adding w o because the open penalty of the gap containing a i mid and a i mid +1 in L γ (A, B) is charged twice by M D k mid (i mid , j mid ) and M D k mid (i mid , j mid ). In summary, the recurrence of M γ (m, n) is derived as follows:

is added to the right-hand side, the above recurrence is not changed, but can be reformulated as follows:

In other words, j mid , k mid and h mid are the indices j , k and h, where 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h < λ k , such that the following maximal value is the maximum.

Now, we show how to use O(αn), instead of O(γ mn), memory to determine j mid , k mid and h mid , where α = 1≤k≤γ λ k and α ≤ min{m, n} intrinsically. In fact, a single matrix E of size (γ + 1) × (n + 1) with each entry E(k, j) of λ k + 4 space is enough to compute M k (i mid , j), M S k (i mid , j), M D k (i mid , j) M I k (i mid , j) and N k (i mid , j , h), for 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h ≤ λ k . When reaching the entry (i, j , k) of 3D grid graph G, we use entry E(k, j) of E to hold the most recently computed values of M k (i, j), M S k (i, j), M D k (i, j) M I k (i, j) and N k (i, j , h), which clearly needs a total of λ k + 4 space. Note that the old values in entry E(k, j) will be moved into an extra entry, termed as V k whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into V k ; however, we need to first move M k (i −1, j −1) in V k into a space, named as v k,k+1 , where 1 ≤ i ≤ m. The mechanism above will enable us to compute N k (i, j , 1), which needs to

, which needs to refer M k (i − 1, j − 1) that is kept in V k ; and finally we were able to compute M k (i, j). Figure 3 shows the grid locations of E(k − 1), E(k) and the values in V k−1 and V k when we reach the entry (i, j , k) of G for the computation, where E(k) denotes the k-th row of E. Hence, the total needed space for computing and storing all M k (i mid , j), M S k (i mid , j), M D k (i mid , j) M I k (i mid , j) and N k (i mid , j , h) is the sum of the space of matrix E, the space of all V k and the space of all v k,k+1 , where 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h ≤ λ k , which is equal to O(αn). Similarly, the required matrix, denoted by E, for computing all M k (i mid , j), M S k (i mid , j), M D k (i mid , j) M I k (i mid , j) and N k (i mid , j , h) still needs O(αn) space. Hence, the determination of j mid , k mid and h mid can be performed in O(αn) space. The details of CPSA-DC algorithm are described as follows. Note that the program code of BestScoreRev is similar to that of BestScore and hence is omitted here. In the codes, the variable E(M k (i mid , j)) is used to denote the value of M k (i mid , j) in E(k, j) and others are analogous. The global variables

are computed in Algorithm BestScore so that they can be used directly in Algorithm CPSA-DC.

Algorithm CPSA-DC(i start , i end , j start , j end , k start , k end ) Input: Sequences a i start · · · a i end and b j start · · · b j end with constraints (C k start , . . . , C k end )

Align the nonempty sequence with spaces; else i mid = i start +i end 2 ; BestScore(i start , i mid , j start , j end , k start , k end );

if E(N k (i mid , j , λ k )) + E(M k (i mid , j)) > max then max = E(N k (i mid , j , λ k )) + E(M k (i mid , j)); j mid = j ; k mid = k; h mid = h; type = case 5; end if end if end if end for end for 3: if type = case 1 then CPSA-DC(i start , i mid − 1, j start , j mid , k start , k mid ); Align a i mid with a space; CPSA-DC(i mid + 1, i end , j mid + 1, j end , k mid + 1, k end ); end if if type = case 2 then CPSA-DC(i start , i mid − 1, j start , j mid , k start , k mid ); Align a i mid a i mid +1 with two spaces; CPSA-DC(i mid + 2, i end , j mid + 1, j end , k mid + 1, k end ); end if if type = case 3 then CPSA-DC(i start , i mid − 1, j start , j mid − 1, k start , k mid ); Align a i mid with b j mid ; CPSA-DC(i mid + 1, i end , j mid + 1, j end , k mid + 1, k end ); end if if type = case 4 then CPSA-DC(i start , i mid − h mid , j start , j mid − h mid , k start , k mid − 1); Align a i mid −h mid +1 · · · a i mid +λ k −h mid with b j mid −h mid +1 · · · b j mid +λ k −h mid ; CPSA-DC(i mid + λ k − h mid + 1, i end , j mid + λ k − h mid + 1, j end , k mid + 1, k end ); end if if type = case 5 then CPSA-DC(i start , i mid − λ k , j start , j mid − λ k , k start , k mid − 1); Align a i mid −λ k +1 · · · a i mid with b j mid −λ+1 · · · b j mid ; CPSA-DC(i mid + 1, i end , j mid + 1, j end , k mid + 1, k end ); end if Algorithm BestScore(i start , i end , j start , j end , k start , k end ) Input: Sequences a i start · · · a i end and b j start · · · b j end with constraints (C k start , . . . , C k end ) 1: /* Reindex */ m = i start − i end + 1; n = j start − j end + 1; γ = k start − k end + 1; 2: /* Initialization */ for j = 0 to n do for k = 0 to γ do

if k ≥ 1 then for h = 1 to λ k do E(N k (i mid , j , h)) = −∞; end if end for end for 3: /* Computation */ for i = 1 to m do for k = 0 to γ do /* For the case of

end if end for end if end for end for end for Now, we analyze the time-complexity of our CPSA-DC algorithm for solving the CPSA. As shown in Figure 2 , after determining the middle position (i mid , j mid , k mid ) of the optimal path in G, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (i mid , j mid , k mid ), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of entries in G. It is not hard to see that the time-complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of the subproblem. Let denote the size of the original problem (i.e. = γ mn). Then the total time-complexity of our CPSA-DC algorithm is equal to + 2 + 4 + · · · = 2 , which is twice as high as the CPSA-DP algorithm. Using the CPSA-DC algorithm as a kernel, we were able to design a memory-efficient algorithm, termed CMSA-DC, for progressively aligning multiple input sequences into a CMSA according to the branching order of a guide tree. The above progressive method we adopted was proposed by Tang et al. (2003) . Owing to space limitation, we refer the reader to their paper for the details of its implementation.

We use Java language to implement the CMSA-DC algorithm as a web server, called as MuSiC-ME (Memory-Efficient tool for Multiple Sequence Alignment with Constraints). The input of the MuSiC-ME system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of residue/nucleotide that (approximately) appears in all input sequences. The output of MuSiC-ME is a CMSA in which the fragments of the input sequences whose residues/nucleotides exhibit a given degree The memory usage includes JVM (Java Virtual Machine), code (MuSiC/MuSiC-ME) and data, and MuSiC cannot deal with the case of CoV-3 -UTR due to running out of memory.

of similarity to a constraint are aligned together. For its biological applications, we refer the reader to other related papers (Tang et al., 2003; Tsai et al., 2004) . In the following, we evaluate our memory-efficient MuSiC-ME system and compare its running time and memory to the original MuSiC system (Tsai et al., 2004) , whose kernel CPSA algorithm was implemented by the dynamic programming approach. We chose five families of protein/RNA sequences as our testing datasets, each of which has been shown to contain an ordered series of conserved motifs related to the structures/functionalities/consensuses of the family (McClure et al., 1994; Chin et al., 2003; Tang et al., 2003; Tsai et al., 2004) : (1) the aspartic acid protease family (Protease), (2) the hemoglobins family (Globin), (3) the ribonuclease family (RNase), (4) the kinase family (Kinase) and (5) the 3 -untranslated region of the coronaviruses (CoV-3 -UTR). From each family, we have selected a representative set of sequences and adopted the ordered series of conserved motifs as the constraints. Table 1 lists the information of the tested families and their constraints. All tests were run with default parameters on IBM PC with 1.26 GHz processor and 512 MB RAM under Linux system. Table 2 lists the CPU time and memory usage of our experiments using MuSiC and MuSiC-ME. It shows that the memory usage of MuSiC-ME is much smaller than that of MuSiC for large-scale sequences, and the CPU time required by MuSiC-ME is smaller than that required by MuSiC for short sequences, since we have simplified the recurrences of the dynamic programming here.

It is worth mentioning that in MuSiC-ME system, the letters representing the constraints are not just the individual residues/nucleotides, but also the IUPAC (International Union of Pure and Applied Chemistry) codes. For example, nucleotides N and R have the meanings of any nucleotides and purine (i.e. A or G), respectively. This enhanced improvement will enable the user to define more flexible constraints or combine several small constraints with fixed distances into a large one. For example, consider our fifth experiment above related to the 3 -UTRs of the coronavirus sequences, including HCV-229E (human coronavirus), PEDV (porcine epidemic diarrhea virus), TGEV (porcine transmissible gastroenteritis virus), BCV (bovine coronavirus), MHV (mouse hepatitis virus) and SARS-TW1 (severe acute respiratory syndrome virus). All the 12 adopted constraints appear in the fragment sequences that were able to fold themselves into a stable pseudoknot structure (Williams et al., 1999; Tsai et al., 2004) . However, these adopted constraints are too short to correctly align the truly conserved motifs of sequences together, since the short constraints occur frequently in the large genomic sequences that led to the difficulty in identifying the true occurrences. In fact, four pairs of two consecutive constraints appear in the stem regions (containing no loops) of pseudoknots and each paired constraints is separated by a non-conserved subsequence of fixed length. Hence, we can combine each pair of constraints into a new and larger constraint by representing the non-conserved part with N. Consequently, we got eight new constraints with the order of (CUNNNNC, A, AA, G, C, UNNNA, GNNNNAG, UNNNA) for this dataset. After running MuSiC-ME, a satisfied CMSA was found (Figure 4) , where the band of the resulting CMSA corresponding to a constraint is black and its corresponding constraint is displayed beneath it. This resulting CMSA implies that the fragment of SARS-TW1 between the first band and the last band may fold into a pseudoknot structure that is possibly involved in replicating SARS viruses (Pleij, 1994; Deiman and Pleij, 1997) . In fact, this fragment is the pseudoknot sequence of SART-TW1 that was found by Tsai et al. (2004) using MuSiC to align the 3 -UTR of SARS-TW1 with the pseudoknot sequences, instead of 3 -UTRs, of other coronaviruses. The input sequences of the above experiment were also tested by Clustal W 1.82, the most commonly used MSA tool. According to its resulting MSA as shown in Figure 5 , the fragments of all pseudoknots, including our detected pseudoknot for SARS-TW1, were not able to align well so that it is difficult for us to identify the exact fragment of the SARS-TW1 pseudoknot from this MSA.

In this paper, we designed a memory-efficient program for performing the CMSA, which can incorporate the knowledge of biologists about the structures/functionalities/consensuses of their datasets into sequence alignment such that the userspecified residues/nucleotides are aligned together. We first used the divide-and-conquer approach to design a memoryefficient algorithm for optimally aligning two sequences with constraints, and then based on this algorithm, we used the progressive method to develop a memory-efficient tool, called MuSiC-ME, for heuristically aligning multiple sequences with constraints. The proposed MuSiC-ME system makes it possible to align several large-scale protein/DNA/RNA sequences with constraints through the desktop PC with the limited memory. In this system, moreover, the letters allowed to represent the constraints are the IUPAC codes, which will enable the user to define more flexible constraints or combine several small constraints with fixed distances into a large one. It is worth mentioning that the A * algorithm, a heuristic search method in Artificial Intelligence, has been extensively used to time-and/or memory-efficiently solve the general MSA problem without constraints Imai, 1994, 1999; Kobayashi and Imai, 1999; Lermen and Reinert, 2000) . Hence, it is interesting to study whether or not the A * algorithm can still be applied to the CMSA problem.

Approximation algorithms for multiple sequence alignment

The complexity of multiple sequence alignment with SP-score that is a metric

The multiple sequence alignment problem in biology

A survey of multiple sequence comparison methods

Recent developments in linear-space alignment methods: a survey

Efficient constrained multiple sequence alignment with performance guarantee

Multiple sequence alignment with hierarchical clustering

Pseudoknots: a vital feature in viral RNA

MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences

Progressive sequence alignment as a prerequisite to correct phylogenetic trees

Efficient methods for multiple sequence alignment with guaranteed error bounds

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

CLUSTAL: a package for performing multiple sequence alignment on a microcomputer

A linear space algorithm for computing maximal common subsequences

Fast A * algorithms for multiple sequence alignment

Enhanced A * algorithms for multiple alignments: optimal alignments for several sequences and k-opt approximate alignments for large cases

The maximum weight trace problem in multiple sequence alignment

Improvement of the A * algorithm for multiple sequence alignment

The practical use of the A * algorithm for exact multiple sequence alignment

Near optimal multiple alignment within a band in polynomial time

Comparative analysis of multiple protein-sequence alignment methods

DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment

Optimal alignment in linear space

Progressive multiple alignment with constraints

Strategies for multiple sequence alignment

Recent progresses in multiple sequence alignment: a survey

T-Coffee: a novel method for fast and accurate multiple sequence alignment

Multiple alignment, communication cost, and graph matching

RNA pseudoknots

Divide-andconquer multiple alignment with segment-based constraints

A workbench for multiple alignment construction and analysis

Multiple sequence alignment with the divide-andconquer method

DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment

Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment

Constrained multiple sequence alignment tool development and its application to RNase family alignment

Multiple sequence alignment by a pairwise algorithm

Motif-biased protein sequence alignment

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice

DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches

A general method for fast multiple sequence alignment

MuSiC: a tool for multiple sequence alignment with constraints

On the complexity of multiple sequence alignment

A phylogenetically conserved hairpin-type 39 untranslated region pseudoknot functions in coronavirus RNA replication

Efficient algorithms for constrained sequence alignment problems

The authors would like to thank the anonymous referees for their constructive comments to the presentation of this paper. This work was supported in part by National Science Council of Republic of China under grant NSC93-2213-E-009-113.