key: cord-018963-2lia97db authors: Xu, Ying; Liu, Zhijie; Cai, Liming; Xu, Dong title: Protein Structure Prediction by Protein Threading date: 2010-04-29 journal: Computational Methods for Protein Structure Prediction and Modeling DOI: 10.1007/978-0-387-68825-1_1 sha: doc_id: 18963 cord_uid: 2lia97db The seminal work of Bowie, Lüthy, and Eisenberg (Bowie et al., 1991) on “the inverse protein folding problem” laid the foundation of protein structure prediction by protein threading. By using simple measures for fitness of different amino acid types to local structural environments defined in terms of solvent accessibility and protein secondary structure, the authors derived a simple and yet profoundly novel approach to assessing if a protein sequence fits well with a given protein structural fold. Their follow-up work (Elofsson et al., 1996; Fischer and Eisenberg, 1996; Fischer et al., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et al., 1992) on protein fold recognition led to the development of a new brand of powerful tools for protein structure prediction, which we now term “protein threading.” These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many of the proteins encoded in the hundreds of genomes that have been sequenced up to now. The seminal work of Bowie, Liithy, and Eisenberg (Bowie et aI., 1991) on "the inverse protein folding problem" laid the foundation ofprotein structure prediction by protein threading. By using simple measures for fitness ofdifferent amino acid types to local structural environments defined in terms of solvent accessibility and protein secondary structure, the authors derived a simple and yet profoundly novel approach to assessing if a protein sequence fits well with a given protein structural fold. Their follow-up work (Elofsson et aI., 1996; Fischer and Eisenberg, 1996; Fischer et aI., 1996a,b) and the work by Jones, Taylor, and Thornton (Jones et aI., 1992) on protein fold recognition led to the development of a new brand ofpowerful tools for protein structure prediction, which we now term "protein threading." These computational tools have played a key role in extending the utility of all the experimentally solved structures by X-ray crystallography and nuclear magnetic resonance (NMR), providing structural models and functional predictions for many ofthe proteins encoded in the hundreds of genomes that have been sequenced up to now. What has made protein threading particularly attractive as a protein structure prediction tool is the observation that the number ofunique structural folds in nature is a few orders ofmagnitude smaller than the number ofproteins in nature (Finkelstein andPtitsyn, 1987; BairochetaI., 2005) . Although this is still not a fully resolved issue, both theoretical and statistical studies (Murzin et aI., 1995; Brenner et aI., 1996; Li et aI., 1996 Li et aI., , 1998 Li et aI., ,2002 Wang, 1996; Orengo et aI., 1997; Holm and Sander, 1996a; Zhang and DeLisi, 1998) suggest that the number ofunique structural folds in nature ranges from a few hundred to a few thousand. Clearly this is a significantly smaller number than the number ofproteins in nature -as we understand now, the number of different living organisms on earth could range from millions to hundreds ofmillions (May, 1988) . Since each organism often has at least thousands of different proteins encoded in its genome, the total number of different proteins in nature is at least in the tens ofbillions or possibly significantly higher, even without considering protein variants such as alternatively spliced proteins. This disparity suggests an effective paradigm for possibly solving all protein structures through combining experimental and computational approaches, that is, to solve structures of proteins with unique structural folds using the expensive and time-consuming experimental techniques and computationally model the rest ofthe proteins using the experimental structures as templates. This is the key strategy currently being employed by the worldwide Structural Genomics efforts (Gaasterland, 1998; Skolnick et aI., 2000; Baker and Sali, 2001) . The basic idea of protein threading is to place (align or thread) the amino acids of a query protein sequence, following their sequential order and allowing gaps, into structural positions of a template structure in an optimal way measured by fitness scores outlined above. This procedure will be repeated against a collection of previously solved protein structures for a given query protein. These sequencestructure alignments, i.e., the query sequence against different template structures, will be assessed using statistical or energetic measures for the overall likelihood of the query protein adopting each ofthe structural folds. The "best" sequence-structure alignment provides a prediction of the backbone atoms of the query protein, based on their placements in the template structure. Currently, protein threading is being widely used in molecular biology and biochemistry labs, often for initial studies of target proteins, as it may quickly provide structural and functional information, which could be used to guide further experimental design and investigation. As a prediction technique, protein threading has a number of highly challenging computational and modeling problems. These include (a) how to effectively and accurately measure the fitness of a sequence placed in a template structure, (b) how to accurately and efficiently find the best alignment between a query sequence and a template structure based on a given set of fitness measures, (c) how to assess which sequence-structure alignment among the ones against different template structures represents a correct fold recognition and an accurate (backbone) structure prediction, and (d) how to identify which parts of a predicted structure are accurate and which parts are not. As researchers find more effective solutions to these and other challenging problems, we expect that protein threading will play an increasingly significant role in structural and functional studies of proteins. As of now, over a million protein sequences have been determined (Bairoch et aI., 2005) , among which~30,000 have had their tertiary structures experimentally solved (Dutta and Berman, 2005) . Given that there could be at least tens of billions of different proteins in nature as discussed above, one interesting question, particularly relevant to the idea of protein threading, is how many unique protein structures or structural folds these proteins might have adopted. To answer this question, we need to first look at the basic structural units of proteins, called protein domains (Wetlaufer, 1973; Richardson, 1981) . Protein domain is extensively discussed in Chapter 4 of this book. Here we describe briefly the concept of a domain from the perspective of threading. A structural domain is a distinct and compact structural unit that could fold independently of other domains. While many proteins are single-domain proteins, there are proteins with two, three, or even more structural domains (Ekman et aI., 2005) . Our study shows that in the FSSP (Fold classification based on Structure-Structure alignment of Proteins) nonredundant set (Holm and Sander, 1996b) of the PDB, 67% of the proteins have single domains, 21% have two domains, 7% have three domains, and the remaining 5% have four or more domains (Xu et aI., 2000a) . This distribution may not necessarily reflect the actual domain distribution for all the proteins in nature as the set of known protein structures in PDB might have overrepresented small proteins due to the relative ease in solving these structures. For eukaryotic organisms, it was estimated that at least two-thirds oftheir proteins are multidomain proteins (Gerstein and Hegyi, 1998; Gerstein, 1997 Gerstein, , 1998 Doolittle, 1995; Apic et aI., 2001a,b) . This number is somewhat smaller for bacterial and archaeal organisms but still represents a significant percentage of all their proteins. Previously domain partition of multidomain proteins was typically done manually. To keep up with the rate at which protein structures are being solved, there is a clear need for more automated domain-partitioning methods to process the newly solved structures. Currently computer programs are being used for partitioning a protein structure into individual domains. Popular programs for this purpose include DALI (Dietmann and Holm, 2001) , DomainParser (Xu et aI., 2000a) , and PDP (Alexandrov and Shindyalov, 2003) . A protein domain could be part of different protein structures, through combining with other domains. Figure 12.1 shows an example of a domain in different proteins. Both proteins, RNA 3'-terminal phosphate cyclase and glutathione S-transferase, have the thioredoxin fold domain, which has two layers, one with two a-helices and one with four antiparallel r3-strands, although some details differ in the two proteins. The parts other than the thioredoxin fold domain in the two proteins have no structural relationship. Since domains are the basic structural units of proteins, current studies on the number of unique structural folds in nature have been carried out on protein domains rather than whole proteins (hereafter, the term "proteins" will refer to single-domain proteins for simplicity of discussion). To estimate the number of unique folds of proteins, one popular approach is through examining all protein families and the relationships between protein families and unique structural folds. Using the definition of SCOP classification scheme (Murzin et aI., 1995; Brenner et al., 1996) , a proteinfamily represents a group of orthologous proteins (Makarova et aI. 1999; Gerlt and Babbitt, 2000; Tatusov et aI., 2000; Gelfand et al., 2000) . The number ofprotein families in nature could possibly be estimated through finding orthologous gene groups covering all the genes in the genomes that have been sequenced. One such estimate suggests that there are 23,100 such protein families (Orengo et al., 1994) . This number has been used in later studies on estimating the number of unique structural folds in nature. Other studies estimate this number to be in the tens of thousands (Koonin et al., 2002) . Whether it is 23,100 or some other number ranging from 10,000 to 50,000, the idea is that all proteins in nature fall into one of these families. The estimation on the number of unique structural folds is obtained through estimating the number of families that each structural fold covers and studying the coverage distribution by all the known structural folds. One of the estimates by Zhang and DeLisi (1998) suggests that the number N of unique structural folds is probably around 700. This estimation is based on the observation that the number of structural folds covering X number of protein families follows a power-law distribution, withXbeing a variable. In essence it says that a few structural folds each cover many families (e.g., TIM barrel fold covers 31 protein families) while many structural folds each cover only a small number of families ; more generally, the number of structural folds decreases as their coverage of protein families increases . Specifically, Zhang and DeLisi proposed a formula which matches well with the known protein families and structural folds. Let M and N represent the number of families and the number of unique folds in nature, respectively. The probability that a fold covers exactly x families is given by Let M, and N, be the numbers of protein families and unique structural folds currently having solved structures, respectively. Through a simple algebraic transformation, Zhang and DeLisi showed that which is used to estimate the number of unique structural folds. Using the known numbers of M, = 736 and N, = 361 at the time of the estimation, they estimated that N is roughly about 700, which many researchers will argue to be too small (see following for more discussion). Similar estimations have been made by other researchers, estimating the size of N ranging from a few hundred to a few thousand (Orengo et aI., 1994; Wang, 1996) , based on somewhat different assumptions. Coulson and Moult (2002) recently developed a new model for estimating N, based on the work of Zhang and DeLisi. Using the more recent data on the numbers of genes, gene families, and structural folds, they argued that there are two "special" cases that have not been treated well by previous estimation models. Based on their argument, they consider that there are three classes of structural folds, which are termed unifolds, mesofolds, and superfolds. Unifolds represent structural folds that each covers only one family of proteins, superfolds represent structural folds, each ofwhich covers many structural folds, and mesofolds represent structural folds in between. For example, TIM barrel covers 31 families, while many unifolds exist in SCO~Based on their observation, they argued that previous models such as Zhang and DeLisi did not fit well with the data of unifolds and superfolds. So a new piecewise model was then developed which treats unifolds, mesofolds, and superfolds, separately. Using this new model, Coulson and Moult (2002) estimated that less than 20% of the protein families belong to unifolds, while 20% of the families belong to a few dozen superfolds and the rest of the protein families belong to mesofolds. Considering that the estimated number of protein families ranges from 10,000 to 50,000 (or 23,100 as one of the popular estimates suggests), we can infer that the number of unifolds ranges from 2,000 to 10,000. The number of mesofolds could be estimated using the Zhang and DeLisi model, based on the sizes of M, and N s , after excluding the unifolds and superfolds. Hence, Coulson and Moult concluded that the most probable size for the number ofmesofolds is about 400. The number of superfolds is believed to be very small, possibly in the range of low dozens. Overall this model suggests that over 80% ofthe protein families fold into a little over 400 structuralfolds, the majorityofwhichare alreadyknown, whilethe restoftheprotein families each belongs to a unique unifold. The implication ofthis estimation is that about 80% ofthe protein families are amenable for structural modeling using protein threading techniques, assuming that at least one protein in each of the meso-and superfolds has its structure solved. If experimental facilities for structure solution will strategically select their solution targets to maximally cover all the meso-and superfolds, we could expect that at least 80% of the protein families will be structurally modelable in the near future. This is exactly the strategy that the National Institute of Health (NIH) is using in its Protein Structure Initiative (http: //www.nigms.nih.gov/psi/) . For the remaining 20% of protein families, it might take some time to have at least one solved structure in each of the unifolds. Hence, the threading technique will be less applicable for this class of proteins, at least in the near future. There are a number of popular schemes and associated databases for classification of proteins into structural folds, including SCOP (Murzin et aI., 1995) , CATH (Orengo et aI., 1997) , and FSSP (Holm and Sander, 1996b) . These classification schemes classify all solved protein structures into different structural folds and subclasses of structural folds. The classification of protein structures is essentially achieved through grouping protein structures into clusters ofsimilar structures, which can be done computationally through structure-structure alignments (Holm and Sander, 1996a) . SCOP (Murzin et aI., 1995; Brenner et aI., 1996; Andreeva et aI., 2004 ) groups all protein structures essentially into a three-level classification tree. At the top level, SCOP (SCOP1.65) currently consists of about 800 structural folds, each ofwhich is further divided into superfamilies and then into families. While a family represents a group of orthologous proteins, a superfamily represents a group of homologous proteins, possibly made ofmultiple families. Currently SCOP consists ofabout 1300 superfamilies and about 2400 families. Among the 800 structural folds, 489 have only one family each, which might represent unifolds; and 9 cover a large number offamilies each, which are considered as superfolds by Coulson and Moult. One thing worth noting is that among the 800 SCOP folds, only 36 represent membrane proteins. This is a reflection of the fact that only 2% of all the solved protein structures are membrane proteins (http: //blanco.biomol.uci.edu/MembraneYroteins...xtal.html) . This suggests that threading is generally not applicable to structure prediction of membrane proteins, at the present time. SCOP's hierarchical classification ofstructural folds provides a convenient tool for applications ofthreading methods, as query proteins falling into a SCOP protein family are generally expected to have accurate structure predictions, while proteins with structural homologues in a SCOP superfamily will have a good chance to have the correct structural folds identified and some portions oftheir backbone structures predicted accurately. In general, it still represents a challenge to correctly identify the structural fold by a threading method if a query protein only has a structural analogue (i.e., similar structure but not homologous) in SCO~ The realization that protein structures are clustered into structural folds in the structure space and the number ofsuch clusters is possibly quite small has led to a new way ofpredicting protein structures in a more efficient and effective manner. The general belief is that different proteins fold into similar 3D shapes because at some level, they share similar interaction patterns among their residues and between the residues and the environments. It has been shown that these interaction patterns could possibly be captured using simple statistics-based energy models as exemplified by the earlier work ofEisenberg and colleagues (Bowie et al., 1991; Fischer et al., 1996a,b; Fischer and Eisenberg, 1996) , the work of Sippl and colleagues (Sippl, 1990) , and others (Jones et al., 1992; Rost et al., 1997) . These simple statistics-based energy functions have been used, for many cases, to distinguish the correct structural folds from the incorrect ones and to distinguish the correct placements of the residues in a query protein into the structural positions of a correct structural template from the incorrect ones. Placing the (backbone atoms of) residues of a query protein into the correct structural positions in a correct structural fold gives a prediction of the backbone structure of the query protein. To accomplish this, one would need two capabilities: (a) an energy function whose global minimum will correspond to the correct placement of residues into the correct structural template, and (b) a computational algorithm that can find the global minimum of the given energy function. We explain the basic idea of developing such energy functions in this section and leave the algorithmic issues to the next one. In their earlier work (Bowie et al., 1991; Fischer et al., 1996a,b, Fischer and Eisenberg, 1996) , Eisenberg and colleagues demonstrated that simple residue-based, instead of atom-based, energy functions could provide substantial discriminating power in separating good from poor placements of individual residue types into different structural environments, justifying the usage ofresidue-based energy functions. In their work, structural environments are simply defined in terms oftwo parameters, solvent accessibility sol and secondary structure ss. Specifically the quantity sol of solvent accessibility is discretized into a number of intervals, say 30--40% exposed to the solvent. A secondary structure could be a helix, a beta-strand, or a loop, or it could be defined in terms of more refined categories of secondary structures, say including different types oftums. Then a structural environment for each residue in a template structure could be defined using (sol, s), say (0-10% exposed, alpha-helix). Statistics could be collected from a collection of solved protein structures about how frequent a particular type of amino acid appears in a particular structural environment as we just defined. This can be done by going through all protein structures under consideration to count the number of occurrences of each amino acid type in each encountered structural environment. If we consider, say, three levels of solvent accessibility, {exposed, intermediately exposed, buried}, and three types of secondary structures, we will have nine types of structural environment. Under this assumption, the result of counting the numbers of occurrences above will be a 20 by 9 table, with each of the 20 rows representing an amino acid type and each of the 9 columnsrepresenting a structuralenvironment. Based on the collected statistics, we can build a simplepreference model to measure how preferred a particular amino acid type is to a particular structural environment. This can be done using the following measure: where Oi,} represents the observed frequency of amino acid type i in structural environment j and Ei,l represents the expected frequency of amino acid type i in structuralenvironment j. In the work of Eisenberg and colleagues, Ei,l is estimated using the frequency of amino acid type i in all proteinsunder consideration. Hence, if an aminoacidtype i has a higherfrequency in a particularstructuralenvironment j than its frequency overall, it willbe assigneda negative score-In(Oi,}I Ei,l); otherwise it will get a positivescore or zero (when Oi,l = Ei,j). The higher Oi,l is comparedto Ei.j» the more negative the corresponding energy is. A popular name for this type of energyfunction is singletonenergy.By performingstatistical analysis on a database, one can get the scoring functionusing the above formulation for the 20 amino acid types appearing in the nine structuralenvironments. When building such statistics-based energy functions, one needs to be careful in selectingthe data set for statistics collection. Forexample, someproteinsin SCOP (or in PDB) havemore homologous structuresthan the others,which could possibly leadto biased statistics. Toremove this type of statisticbias in our data set, one needs to remove homologous structuresin the data set for statisticscollection. There are a numberof databases forthis purpose,suchas the nonredundant sequencerepresentativesin FSSP, PDB-select (Hobohmet al., 1992) , and PISCES (Wang and Dunbrack, 2003) ,whereno two proteinshavehigher than a certainlevelof sequencesimilarity. Another statistics-based energy functionwidelyused in threadingprogramsis often calledpairwise interaction energy. It measures the preference of having two particulartypes of amino acids that are spatiallyclose. One particular form of such an energy function was developed by Sippl (1990) . The basic idea of this energy function is to compare the observed frequency of a pair of amino acids within a certain distancein solvedprotein structureswith the expectedfrequency of this pair of aminoacid types in a protein structure. The basic idea of such an energyfunction comes from statistical mechanics where the probability Pi} of having a pairwise interaction betweenresidues i and j has the Boltzmanncorrelationto its energygij (Gibbs free energy), definedas wherek is the Boltzmannconstant, T is the temperature, andZ is a partitionfunction. When using a residue-averaged state as a reference state g(P), a knowledge-based potential can be derivedusing . Specifically, if N o(i, j) and NE(i, j) represent, the observedand expectednumbers of amino acid types i andj within a certain distance, respectively, we can use the following to measurethe preferenceof havingthese two types of amino acids close to each other: While No(i, j) canbe collectedby goingthroughtheproteinstructuresin the sample set, an accurate estimation of NE(i, j) represents a challenge. There have been a number of proposed models for estimatingthis quantity. Among these models, the simplestone is the "independentreference state" model (Xu et al., 1998) ,in which NE(i, j) is estimatedas follows: Table 12 .2 shows a scoring function for the preference between 20 types of amino acids using the above formulation (Xu et aI., 1998) . There are more sophisticated models for defining the reference state, one of which is the uniform distribution model (Sippl, 1990; DeWitte and Shakhnovich, 1996; Lu and Skolnick, 2001; Samudrala and Moult, 1998) , as discussed later. These more complex models take more factors into consideration in building the reference state, hence making the energy models more likely to be accurate. In addition to using physics-or statistics-based energy functions, researchers have incorporated evolutionary information into energy model building. One of the earlier major improvements in energy function modeling is the incorporation of sequence profile information (Panchenko et aI., 2000; Zhou and Zhou, 2005) derived from homologous proteins into the energy models outlined above. It was noticed that when using a sequence profile ofa protein family (or superfamily) rather than a single (query) protein sequence, threading accuracy could be significantly improved (Panchenko et aI., 2000; Zhou and Zhou, 2005) . The very basic idea ofthis generalized approach is that rather than asking the question "will protein sequence A fit well with structural fold B?" we ask the more general question "will the whole family of protein A fit well with structural fold B?" Clearly if done properly, this approach could iron out some ofthe spurious predictions, caused by the appearance ofspecific individual sequences. Now a threading problem becomes a fitting problem between a sequence profile and a structural fold. A sequence profile is defined in terms ofa multiple sequence alignment ofthe members ofthe query protein's family (or superfamily), with each element being the frequency distribution ofthe 20 amino acid types in this aligned position rather than a specific amino acid. To generalize the aforementioned energy functions to take into account the profile information, we can simply use the relative frequency of each amino acid in the position-dependent distribution as a weight factor when calculating the energy values for each amino acid or amino acid pairs, and then sum over all possible amino acids or amino acid pairs. Specifically,let Pibe the relative frequency ofamino acid type i in a particular aligned position with L Pi = 1.0. Then the Eisenberg type of energy can be calculated as Similarly, pairwise interaction energy could be generalized as follows: Other types of energy functions have also been used in existing threading programs, including fitness scores between specific amino acids and the secondary structures in the template structure and threading alignment gap penalties. Typically these energy scores are combined using a simple weighted sum while often the scaling factors are empirically determined, based on some training data. It has been observed that distance-dependent pairwise interaction energy could provide more accurate threading results than distance-independent models as outlined above. A distance-dependent energy could be estimated as follows: where r is the distance between residues i and j (possibly measured between their C-beta atoms), No(i, j, r) is the observed number of pairs of residues (i, j) within a distance bin from r -~r12 to r +~r/2 in a database of folded structures for some bin width Sr, and NE(i, j, r) is the expected number of pairs (i, j) within the same distance bin. The challenging issue in accurately estimating the interacting energy u(i,j,r) is how to estimate NE (i, j, r) . Under the assumption that we are dealing with an ideal infinite liquid-state system within a volume V and residues are distributed uniformly (Sippl, 1990; DeWitte and Shakhnovich, 1996; Lu and Skolnick, 2001; Samudrala and Moult, 1998), NE(i, j, r) can be estimated using where N, and N, are the numbers of amino acid types iand j in the protein database, respectively. Researchers have realized that this model needs to be corrected when dealing with finite systems like a protein structure, to make the model more accurate when used in threading programs. Twoparticular corrections are made in the DFIRE (distance-scaled ideal gas reference state) energy model (Zhou and Zhou, 2002) , a popular energy function for threadirtg. In the first correction, DFIRE used r" instead of r 2 , considering that the number of interaction pairs in a finite system could not actually reach the level of r 2 as in an infinite system, where a < 2 is determined throughminimizing the distribution fluctuation of interaction distanceon a set of trainingdata. In the secondcorrection, DFIREassumes that only short-range interactions need to be considered. That is, interaction energy becomes zero when the distance betweenthe interacting pairs is beyonda cutoffdistancercut. Afterthese corrections, the interactionenergy could be estimatedusing the following formula: where constant11 is related to the systemtemperature and can be determinedempirically. These simple energy models have played key roles in making threading programs as popular and as useful as we have seen today. While they have been used to help to solve many structurepredictionproblems, the limitations of these simple models have also become quite clear as we have seen from the recent CASP prediction results-the improvement in predictionaccuracyhas been only incremental in the past few CASPs (Kinch et aI., 2003) . One of the key reasons for the incremental improvement comes from the crudeness of the threading energy functions. Currently, the algorithmic techniques for protein threadinghaveadvanced to a stage that should be able to handle more sophisticated energy models in the threading framework, which could lead to more accuratepredicted structures. We can expect that more detailedand more physics-based energyfunctions will emergein the near future as the field is clearly in need of more accurate energy models for protein threading. The general form of threadingenergy function could be written as follows: whereEs measuresthe overall fitness of puttingindividual residuetypesinto specific structuralenvironments, E p measuresthetotalinteraction energyamongpairswithin the cutoffdistance, and Egap represents the total penalty for the gaps in a sequencestructurealignment. The scalingfactorsfi andj3 are typicallydeterminedempirically through optimizingthe performance of a threadingprogram on a representative set of proteinpairs. Withthe optimizedfi andj3values, the goal of threadingis to findan alignment(or placement) betweena queryprotein sequence and a templatestructure that optimizesthe energy function. In a sense,proteinthreadingis like sequence-sequence alignmentas it finds an alignmentbetweena sequence of aminoacids and a sequence of structuralpositions in a 3D structure. What makes threading more difficult than a sequence alignment problem is the pairwise interaction energy term Ep. For a sequence alignment problem, a simple dynamic programming can guarantee to find the global optimal alignment between two sequences as the problem formulation follows the principle of optimality (Brassard and Bratley, 1996) . This type of simple dynamic programming algorithm does not work for a threading problem as the global optimal threading alignment could not be easily reduced to a small number of optimal threading alignments for the partial problems as in a sequence alignment problem. Intuitively, a simple dynamic programming approach, like the one used for sequence alignment that goes from the starts of the sequences to their ends and extends partial optimal alignments to include more elements at each step, will not work for protein threading as at each current point, we do not know what residues will be available to be assigned to future structural positions, which might have interactions with the previously aligned positions. Such interactions will have a global impact on a sequence-structure alignment. It is such a global nature of the problem that makes the threading problem more challenging from the algorithmic point of VIew. There have been a number of studies attempting to understand the "intrinsic" difficulty, or computational complexity, ofthe threading problem. Under the assumption that all pairwise interactions need to be considered, the threading problem was proved to be an NP-hard problem by a number of authors (Lathrop, 1994; Calland, 2003) . While these mathematical proofs provide some evidence that the problem is computationally difficult, they might not be particularly relevant to the true difficulty of a threading problem as previous studies have shown that pairwise interactions beyond certain cutoffdistances (e.g., 10-12 A between C-beta atoms) do not contribute to fold recognition and threading alignment (Lund et al., 1997; Melo and Feytmans, 1997; Xu et al., 1998; Zhang and Skolnick, 2004) , and hence need not be considered. As of now, it remains an open problem regarding whether the threading problem, using a distance cutoff for pairwise interactions, is polynomial-time solvable. The challenging issue in studying the "intrinsic" computational complexity of a threading problem with a cutoff for pairwise interactions is that we do not have a good and realistic characterization for the overall interaction patterns for such a threading problem. Because of the algorithmic challenges, earlier threading programs have employed various heuristic strategies for solving the optimal sequence-structure alignment problem. One particular strategy is called "frozen approximation" (Westhead et al., 1995) . The basic idea is that it uses a dynamic programming approach to find a sequence-structure alignment and uses an approximation scheme to calculate the interaction energy. When the algorithm assigns an amino acid to a specific structural position from the beginning to the end of the query sequence during the dynamic programming procedure, it calculates the relevant interaction energy using the amino acids in the template structure rather than assigned amino acids from the query protein, for all the unassigned structural positions up to the current point of the dynamic programming procedure, within a certain cutoffdistance. Intuitively the algorithm should work to some degree in capturing some ofthe interaction "patterns" encoded in the query protein sequence as some of the position-equivalent residues between the native structure and the native-like template structure should have similar physicochemical properties, suggesting the validity ofthe frozen approximation scheme. Practical applications have also confirmed that the frozen approximation, while not guaranteeing global optimal threading alignment, does have an advantage over threading programs that do not consider pairwise interactions (Westhead et al., 1995; Skolnick and Kihara, 2001; Zhang et al., 1997) . A number ofrigorous threading algorithms have been developed that guarantee to find the global optimal threading alignments, measured in terms of energy functions that consider pairwise interactions. These include a divide-and-conquer algorithm employed in the PROSPECT threading program and an integer programming algorithm used in the RAPTOR program (Xu et aI., 2003a,b) . It was convincingly demonstrated, through applications ofthese programs at the CASP contests (Xu et aI., 2001; Xu and Li, 2003) , that threading programs with guaranteed global optimality do have an advantage over programs without this property. One particular advantage of such programs is that they can be used to rigorously benchmark a proposed energy function. When using programs without such a guarantee to assess an energy function, it will be difficult to decide whether it is the energy function or the lack of rigor in the threading algorithm that has resulted in a subpar performance by a particular energy function. The following provides some detailed information about three types of threading algorithms. The basic idea of the divide-and-conquer algorithm in PROSPECT can be outlined as follows. The algorithm first divides the query protein sequence into two subsequences and also divides the template structure into two substructures by cutting at one of its loop regions. Then it tries to find the globally optimal threading alignments between the first subsequence/substructure pair and between the second subsequence/substructure pair, respectively. When calculating pairwise interaction energy for each "half" of the sequence-structure alignment problem, we might need information about which amino acids are assigned to which structural positions in the other "half" of the problem. To facilitate this calculation, the algorithm uses a simple data structure for each structural position L in the current substructure, that keeps a list of structural positions in the other substructure that are close enough to L to have interactions with the amino acid to be assigned to it. Figure 12 .2 depicts schematically the situation where each of the two substructures has a number ofstructural positions, which are close enough to structural positions in the other substructure so that their alignments with amino acids need to be considered when calculating the interaction energy in the other substructure. These structural positions can be considered as extended parts of the other substructure (shown as extended arms for each substructure in Fig. 12.2) . The difference between these extended parts and the original positions ofa substructure is that when doing alignment between the substructure and the corresponding subsequence, we do not have any knowledge about which amino acids are assigned to these positions in the other substructure. Hence, the optimal threading alignment for each substructure/subsequence pair depends on the optimal threading alignment for other substructure/subsequence pair. This codependence relationship makes the problem challenging. In PROSPECT, this problem is overcome using the following strategy : consider all possible combinations of amino acids possibly assigned to these extended positions (in the other "half" of the problem); and then solve an optimal threading alignment problem for each substructure/subsequence pair under each possible combination of amino acid assignments to these extended structural positions. Assume that we can solve the optimal threading problem for each pair of (extended) substructure/subsequence and for each combination of such an assignment. Then it can be checked that the global optimal threading alignment for the whole structure and sequence must be the union of two optimal threading alignments for the two extended subproblems, under one specific combination of the amino acid assignment to the extended part of each subproblem. This realization lays the foundation for the divide-and-conquer algorithm of PROSPECT as it allows reducing a whole threading problem to two smaller threading problems. If we can solve the smaller threading problems, the whole threading problem can be solved by simply going through all combinations of the amino acid assignments to the extended parts for each subproblem to find the one that gives the overall best combined score. During the "conquer" step, it needs to make sure that the two optimal solutions, to be combined, to the subproblems is consistent. To solve a smaller threading problem, we can apply the same divide-and-conquer strategy to reduce it to even smaller problems. This procedure can continue until the size ofthe problem is small enough that it can be solved using a brute force exhaustive search strategy. The trick is how to make this algorithm run efficiently.Note that ifnot done carefully, the number of possible combinations of assignments needed to be considered could be very large. In PROSPECT, (sub)structures are divided in such a way which minimizes such a number ofcombinations, through cutting at the "weakest" link with the least interactions between two substructures (Xu et aI., 1999) . The overall computational complexity of the divide-and-conquer algorithm is dominantly determined by the thickest link among all weakest links throughout the bipartitioning of a protein structure into a series of small structures during the whole divide-and-conquer scheme. Intriguingly, we found that this thickest link is generally a small number for the vast majority ofthe solved protein structures (Xu et aI., 1999) , making the actual computing time of PROSPECT threading practically acceptable. This observation has also raised an interesting question: "Is there something special about the topology of protein structures, which could be further and more rigorously exploited for efficient threading algorithm development?" 12.4.2 RAPTOR RAPTOR uses a more general framework to rigorously solve the threading problem (Xu et aI., 2003a,b) than PROSPECT. It formulates a threading problem as a linear integer programming (LIP) problem. It uses an integer variable ("0" or "1") to represent if a particular residue in the query protein sequence is assigned to a particular structural position. Then the set ofall feasible solutions to a threading problem could be defined in terms of a set of equalities or inequalities (called constraints), each of which is defined in terms of the above and other integer variables. The global optimal threading problem is then defined to find a feasible threading alignment that optimizes a given energy function. Generally a linear integer programming problem requires an exponential computing time to find an optimal solution, and hence intractable for large-size problems. Branch-and-bound represents a popular technique for solving linear integer programming problems. Typically, an integer programming problem is first relaxed to a linear programming problem, i.e., variables could take real values as possible solutions. There are efficient algorithms for solving linear programming problems as they are polynomial-time solvable (Papadimitriou and Christos, 1998) . If by chance the solution to the relaxed linear programming problem is all integral, a solution to the original linear integer programming is found. Otherwise the linear solution will be used to constrain the search space, through fixing one variable to "0" or"1" and then the algorithm iterates this process until all solutions have integral values. An interesting observation made is that for the vast majority of threading problems, this relaxation procedure stops after a few iterations, solving the threading problem efficiently and also indicating that threading problems seem to have a special structure in terms of the integer programming formulation. Such special characteristics could possibly be utilized for developing more efficient threading algorithms, using more specialized algorithmic techniques. One particularly interesting technique which is being actively investigated by a number of researchers is based on the idea of tree decomposition of an interaction graph representing possible alignments between a query sequence and a template structure (Song et al., 2005; Xu et al., 2005) . In a sense this type of technique is a generalization of the divide-and-conquer outlined above. Tree decomposition technique has been widely used for various graph-related optimization problems, for example finding the maximum independent set and dominating set (Arnborg and Proskurowski, 1989) . We now provide a detailed description of one such algorithm for solving the protein threading problem. Using a tree decomposition algorithm, both the template structure and the query sequence are represented as graphs; vertices denote core secondary structures (or simply cores) and edges represent interactions between cores (two cores are considered to be in interaction if their shortest distance is within a predefined cutoff distance). A sequence-structure alignment problem essentially corresponds to finding an isomorphic mapping from the structure graph to a subgraph of the sequence graph. The efficiency of the alignment hinges on the tree width of the structure graph. Intuitively, the tree width of a graph measures how much the graph is "treelike." A graph can be represented as a "tree having thick trunks," where the "trunk thickness" is quantified by the tree width of the graph. This technique of "treelike" representation for graphs is called tree decomposition. In a tree decomposition of a graph, vertices of the graph are grouped into possibly overlapping subsets, each of which is associated with a node in the tree. The maximum size of such a subset corresponds to the tree width ofthe tree decomposition. Given a tree decomposition of a structure graph with tree width t, a dynamic programming algorithm can be employed to find the optimal sequence-structure alignment in time Oik: N 2 ) , for some small integer parameter k and N being the number of amino acids in the template structure. The alignment algorithm is very efficient since the tree width for such structure graphs is small in general (by the nature of protein structures). For example, among 3890 protein tertiary structure templates compiled using PISCES (Wang and Dunbrack, 2003) only 0.8% of them have tree width t > 10 and 92 % have t < 6, when using a 7.5-A C~-C~distance cutoff for defining pairwise interactions [see Fig. 12 .5(a)]. We now provide the details of a tree-decompositionbased threading algorithm. A sequence-structure alignment can be formulated as a generalized subgraph isomorphism optimization problem, for which both the template structure and the query sequence are represented as mixed graphs that contain both directed and undirected edges. We use V(G), E(G), and A(G) to denote the vertex set, the undirected edge set, and the directed edge (arc) set of a mixed graph G, respectively. The graph H for the template structure is constructed as follows: each vertex in V(H) represents a core, each undirected edge in E(H) represents the interaction between two cores, and each directed edge (arc) in A(H) represents the loop between two consecutive cores (from the N-terminal to the C-terminal). For technical convenience, both N-and C-terminals are presented as vertices. Figure 12 .3 gives a protein tertiary structure and its corresponding structure graph representation. A query sequence is preprocessed so that for each core in the template, all substrings (called candidates) of the query sequence that align well with the core are identified (Xu et aI., 2000) . By representing each candidate as a vertex, a query sequence can also be represented as a mixed graph. That is, each edge in E(G) connects a pair of candidates that may possibly interact but do not overlap in the sequence, and each arc in A( G) connects two candidates (from the N-terminal to the C-terminal) that do not overlap. As in a structure graph, both N-and C-terminals are represented as vertices in the sequence graph . Figure 12 .4 illustrates the sequence graph with a simple example. The relationship between a core v in the template and its candidates in the query sequence can be constrained using a mapping function M such that M(v) contains all possible candidates of v. The less restricted M is, the more accurate the alignment is expected to be and the more time it may take to compute. Xu et a1. (1998) used a similar approach in their divide-and-conquer threading algorithm, which can find all suitable candidates for each core. The maximum size k = 1M(v) lover all cores v is called the map width ofM, an important parameter for the alignment algorithm. Now a sequence-structure alignment problem can be formulated as a problem offinding an isomorphism mapping f between the structure graph and a subgraph of the sequence graph G such that the following sum ofthe alignment energy functions L Ecore (u, f(u) achieves the minimum, where E eore is the alignment energy score (singleton type of energy) between a core u in the template and its candidate feu) in the sequence, E pair represents interaction energy (pairwise interaction energy) between residues (f(u),f(v)) assigned to cores (u, v) , and El oop is the alignment score between the loop < u, v > in the template and the corresponding 1, especially when E > 1.2, there was a significant energy gap between the optimal alignment and the decoy alignments. Hence, this quality could be used as a measure for assessing the significance of a threading alignment. A good method for assessing the statistical significance of a threading score should not only allow comparing threading results on the same footing but also provide a way to indicate if a particular fold is possibly the correct fold for a query sequence, without using other reference information. A typical threading program consists of four key components from the implementation perspective: (a) a database oftemplate protein structures, (b) an energy function, often residue-based, (c) a threading algorithm that can find the optimal threading alignment between a query sequence and a template structure, and (d) a method for calculating "significance" scores of threading alignments. From the prediction accuracy point of view, the larger a template structure database is, the more accurate we can expect the threading prediction will be. However, it might not always be realistic to use the whole PDB database as the template database due to the amount of time required to thread a query sequence against each PDB structure. Often a template structure database consists of a representative subset of all the structures in PDB, say PDB-select, which consists of PDB structures with the "redundant" structures removed. Here "redundant" refers to structures that have high sequence similarities to other structures in the database. To make prediction more accurate, some threading programs employ a two-stage strategy: (a) thread a query sequence against a representative structure database and to identify a few possible native-like structures, and (b) thread the query sequence against all family/superfamily members ofthe identified structures in (a). Certain preprocessing of the template database might be needed for some threading programs. For example, as we discussed in Section 12.4, protein structure needs to be represented as structure graphs as required by the threading algorithm. The majority ofthe current threading programs employ the types ofenergy functions outlined in Section 12.3 or their variations. These energy functions are statistics, rather than physics, based. They are used to distinguish correct structural folds from the incorrect ones and to distinguish accurate alignments against the correct structural folds from inaccurate alignments. Threading programs have been using these simple energy forms, mainly because of the consideration of computational efficiency and also partly due to the constraints of limited available structural data for more sophisticated energy forms when such statistics-based energy functions were first developed. As those simple energies began to reach their limits, we began to see more physics-based energy functions developed, as we discussed in Section 12.3. We expect that as the threading algorithms become more efficient, we will see more physics-based energy functions. We expect that one particular type ofextension to the existing energy functions is to consider multibody interactions, which have been mostly ignored by existing threading programs. Recent studies have shown that multibody interactions could help to improve the performance ofthreading programs (Munson and Singh, 1997; Li and Liang, 2005) , and hence should be considered. Existing threading programs employ various algorithmic techniques for solving the sequence-structure alignment problem, including dynamic programming with enhanced heuristics (Westhead et aI., 1995; Skolnick and Kihara, 2001; Zhang et aI., 1997) , divide-and-conquer algorithm , and integer programming (Xu and Li, 2003; Xu et aI., 2003a,b) . In this chapter, we presented a new class of threading algorithm based on a tree decomposition of sequence and structure graphs. While integer programming might represent the most general framework for handling sequence-structure alignments, particularly so for threading problems considering multibody interactions, tree-decomposition-based algorithm could prove to be more popular down the road because of its conceptual simplicity and computational efficiency. We expect that a class of more general threading algorithms will begin to emerge to deal with more complex threading problems as the existing threading algorithms become faster and faster. This general class of threading algorithms should be able to handle simultaneous backbone threading and side-chain packing problems, leading to significantly more accurate capabilities in fold recognition and protein structure prediction. Existing threading programs use various ideas and techniques to assess the "significance" of threading results. These methods include z-score calculation (Sommer et aI., 2002) , normalized threading scores using techniques such as support vector machines or neural network (Xu et aI., 2002; Ding and Dubchak, 2001) , and (Panchenko et al., 2000; Bryant and Altschul, 1995) . While useful to some degree, none ofthese methods have reached the level of performance comparable to p-value calculations for BLAST sequence alignments (Altschul and Gish, 1996) . This is possibly due to a combination of the inadequacy of the existing threading energy functions for accurate threading prediction and the lack of general understanding about distinguishing characteristics between correct and incorrect native folds and between correct and incorrect placements of amino acids into structural positions. Overall, compared to other areas of protein threading, this is a somewhat underdeveloped area. New ideas and techniques are clearly needed to fill the holes in this area. Because of the importance of solving protein structures for functional studies and the power of threading techniques, many protein threading programs have been developed. Using these programs, a large number of protein structures have been predicted prior to the solution of their experimental structures, providing highly useful information for guiding experimental design in investigation ofthese proteins. Examples ofsuch predictions include an obese gene (Madej et al., 1995 ), vitronectin (Xuet al., 2001 , and a SARS protein (von Grotthuss et al., 2003) . Table 12 .3 provides a list of popular threading programs and URLs for accessing these prediction tools. We now summarize the highlights of some of these threading programs which use different energy functions and different computational techniques, each ofwhich has its strengths and limitations. PROSPECT Kim et al., 2003) : The PROSPECT program employs a divide-and-conquer algorithm for rigorously solving the global optimal threading problem, which employs a somewhat standard threading energy function, including a singleton energy term and a pairwise interaction energy term plus a secondary structure fitness score and a gap penalty score. For a typical threading problem, it can find the best alignments against a template structure database of 2000+ within a couple of days on a single CPU while it can virtually get a linear speed-up using multiple CPUs when the number ofCPUs is smaller than the number of structures in the template structure database. It achieves its computational efficiency by taking advantage of the fact that protein structures generally have small topological complexities (Xu et aI., 1998) and through using a filtering procedure to filter out "improbable" alignment positions for each core secondary structure in the template structure. While this heuristic filtering works well for the vast majority of the threading cases, it might filter out the correct alignment positions for some cases. PROSPECT normalizes the threading scores, along with various parameters of the template structure and the query sequence, using a support vector machine. Then a z-score is calculated based on the "normalized" threading scores. RAPTOR (Xu and Li, 2003; Xu et aI., 2003a,b) : The RAPTOR program formulates a threading problem as a linear integer programming problem, and solves the problem using a branch-and-bound method plus a standard integer programming solver. RAPTOR employs the same energy functions ofPROSPECT and uses a similar approach for assessing the "statistical significance" of threading results to that of PROSPECT. A unique feature of the program is that its threading algorithm is more rigorous than that of PROSPECT as it does not use a heuristic filter to filter out "improbable" alignment positions. For a typical threading problem, it takes minutes to hours to thread the query sequence against 2000 structures in its structure database. Since the program is data-parallelizable, its speed-up is virtually linear when running on multiprocessor computers. GenTHREADER (Jones, 1999b) and an improved version mGenTHREADER (McGuffin and Jones, 2003) use PSI-BLAST profile (Altschul et aI., 1997) and predicted secondary structures by PSIPRED (Jones, 1999a) for threading. It employs a double dynamic programming strategy (Jones et aI., 1992) in its threading program. The algorithm does not treat pairwise interactions rigorously but its performance has been among the top threading programs, indicating the effectiveness ofthis strategy. A Web server for this program has been set up at http: //bioinfcs.ucl.ac.uklpsipred/. A user in general can expect the return of a threading prediction in minutes. The prediction program runs fast enough that it can be used for genome-scale protein structure predictions. PROSPECTOR (Skolnick and Kihara, 2001 ) recognizes native-like structural folds using a hierarchical strategy for obtaining sequence profiles. It uses two types of sequence profiles, one type derived using close homologous sequences whose sequence identity lies between 35% and 90%, and another type constructed using more remote homologous sequences with a FASTA E-value less than 10. Both types of sequence profiles are incorporated into a typical threading energy as described in Section 12.3 to screen a structural database. The program uses a dynamic programming algorithm to find the best threading alignment, and employs z-scores for assessing the significance of each threading alignment. FUGUE (Shi et aI., 2001 ) uses a typical threading energy function as described in Section 12.3, with some unique features: (1) its structural environment singleton term includes a term for hydrogen bonding status, and the singleton term was derived from structural alignments in the HOMSTRAD database (de Bakker et aI., 2001; http://www-cryst.bioc.cam.ac.uk/homstrad/); (2) its gap penalties are structure-dependent based on solvent accessibility, its position relative to the secondary structure elements, and the conservation ofthe secondary structure elements; and (3) its alignment is based on multiple sequences against multiple structures to enrich the conservation/variation information. FUGUE uses dynamic programming as its threading algorithm and employs z-scores to assess the statistical significance of a threading result. Since each of these threading programs has its own strengths and limitations, a popular strategy for predicting a protein structure is to use multiple prediction programs and combine their prediction results. Further discussion on this topic is given in Chapter 17. We now use PROSPECT as an example to illustrate how to use a threading program for predicting structures of SARS-Co V proteins (Wan et aI., 2005) , which playa role in the development ofthe SARS disease. We used the PROSPECT pipeline to survey all ofthe 11 Open Reading Frames (ORFs) in SARS-CoY strain Urbani (GenBank ID: 30027617), one of the first SARS-CoV genomes. Among the 11 ORFs, the Sand M proteins playa key role in the virus infection process. Interestingly, both the M protein and the S2 domain in the S protein are predicted to adopt the fold ofIg-like beta sandwich. The structural similarity suggests that the S2 domain and the M protein may be evolutionarily related through gene fusion and duplication, although their sequences do not have significant similarity after a long period of evolution. The threading results might explain how the M protein interacts with the S2 domain, for the virus assembly: since the S2 domain with the fold ofIg-like beta sandwich can interact with the S1 domain, the M protein with the same fold could possibly interact with the S1 domain. This suggests that the SI domain may act as an on/off switch between the S2 domain and the M protein. Such a mechanism may suggest that the M protein could also be involved in the virus-host cell interaction. This hypothesis was supported by a recent study in the murine hepatitis coronavirus study, which showed that glycosylation ofthe M protein affected the interferogenic capacity of the virus (de Haan et al., 2004) . Threading programs have been used for genome-scale applications. A recent study (Guo et aI., 2004) performed structure prediction for all ofthe ORFs ofPyrococcusfuriosus, which is found in the marine sand surrounding sulfurous volcanoes and can grow at temperatures above 100 a e.The microbe utilizes peptides, proteins, and some carbohydrates as carbon sources. Its entire genome is about 2 Mb in length with 2195 annotated ORFs. Out of a total of 2195 ORFs, 540 are predicted to be membrane proteins, and 753 proteins can be predicted with structures in high confidence, among which 190 ORFs cannot be detected using PSI-BLAST. Recent prediction results in CASPs indicate that even when the correct structural folds are identified, the threading alignments could often be off. From the same statistics, we found that the overall alignment accuracy has not been improved over the past few CASPs. In addition, the alignment accuracy of the best CASP models using templates with <30% sequence identity ranges from 60% to 90% (Venclovas et al., 2003) . All these statistics suggest that there is significant room for improvement in threading alignments. The prediction accuracy of the current threading programs is mainly limited by the inaccuracy of statistics-based and residue-based threading energy functions. While improving the threading energy functions represents one direction to take for improving threading performance, other approaches may also help, which include (a) application ofpartial experimental data as constraints in the threading process and (b) refinement of threaded structures using molecular dynamics and energy minimization. We refer the reader to Chapter 11 for related discussions. Often partial structural data is available for specific proteins before the determination ofthe detailed structure ofa protein. These partial structural data might be in the form of (a) residue-residue distances such as disulfides between specific cysteines, (b) specific residues involved in particular active sites, binding sites, or other functionally important sites, (c) specific residues known to be on the surface ofa protein structure, or any other information providing geometric information about specific residues in a protein structure. In addition to such information about specific residues, there are experimental techniques that can be used to generate geometric information in a systematic manner. NMR represents one such technique. NMR methods solve a protein structure through generating either distance restraints [called NOE, or nuclear Overhauser effect, distances (Prestegard, 1998) ] between different residues (or more specifically different atoms) or orientations of certain chemical bonds in a protein, called residual dipolar coupling (Tolman et aI., 1995) . Then a protein structure is solved through finding structural models that are consistent with the collected geometric constraints and have their energy minimized. To accurately solve a protein structure using NMR technique, it typically requires 15-20 distance restraints per residue (Clore et aI., 1993) , which will require multiple NMR experiments. Partial distance restraints could possibly be collected using fewer NMR experiments, possibly involving labeling of specific amino acid types. Similar can be said about orientation information collected through residual dipolar coupling experiments. While these partial NMR data might not be sufficient for solving a protein structure accurately, they provide highly useful constraints for protein fold recognition and backbone structure prediction by a threading method. Chemical cross-linking experiments provide another systematic approach to generating partial structural information for a protein. In such experiments, chemical cross-linkers, with customized arm lengths, are designed to link specific types of amino acids within a certain distance range (Cohen and Sternberg, 1980) . Such experiments followed by tandem mass spectrometry experiments and data interpretation could provide distance information between certain amino acids. Such an approach has been used for structural data collection for both soluble and membrane proteins (Yan et aI., 2005) , which have then been used for protein structure prediction (Young et aI., 2000) . One way to use such structural information, such as distances or structural locations, in a threading program is to add an energy term in the threading energy function, which measures the consistency between the collected structural data and a threading alignment. For example, if residues A and B are known to be within a certain distance, then an energy term could be specifically designed to penalize threading alignments which violate this particular geometric constraint. The energy term could be designed so that the bigger the violation, the larger the penalty. Similarly, if a residue X is known to be on the surface of a protein structure, an energy term could be specifically designed for this knowledge so that it penalizes threading alignments that do not put X into a surface position in the template structure, and the amount of penalty could be designed to reflect the degree of violation of this particular knowledge. To deal with all geometric constraints, we can design a new energy term EG ,which is the sum of the individual penalty functions for all the specific geometric constraints. The overall scaling factor for this new energy term in a threading energy function could be empirically determined based on a training data set, for which actual partial experimental data is available. The effectiveness of applying such partial structural data in a threading program has been documented in a number of studies (Xu et aI., 2000b,c; Young et aI., 2000; Qu et aI., 2004a,b) . Table 12 .4 shows a systematic study on improving the "Query" and "temp" represent the PDB codes of the query and template proteins, respectively. "RMSD/rank vs. percentage of assigned ss" are the Ca-RMSD between the experimental structure and the predicted structure using MOD ELLER based on threading alignments for the alignable portions in the structure-structure alignment between the query protein and the template, and the rank of the correct template structure among 667 templates. 0%, 20%, 40%, ... , 100% represent the percentage of residues with secondary structure assignment, respectively. The highlighted numbers show improvement between using no secondary structure information and using full secondary structure assignments. threading performance by incrementally increasing the number ofNMRlNOE distance constraints used in a threading process. It is seen that distance constraints help both fold recognition and threading alignment accuracy. In the best scenario, a threading program can provide an accurate prediction of the backbone atoms of a protein structure, which is still a long way from having a detailed all-atom structure. In the most general situation, a threading program could provide a somewhat accurate structure for the backbone atoms in the core secondary structures while predictions for the loop regions are often not accurate. The reason is that threading predicts a structure based on a known template structure. While the core secondary structures among homologous proteins are generally "well" conserved, loops are often not. Hence, template-based loop predictions are generally not accurate. Fortunately, existing methods for short loop prediction « 14 residues) have reached a level that the predicted loops could be as accurate as the predicted core structure. For example, tile recent work by Jacobson et al, (2004) has achieved prediction, accuracy of 0.43 A for 5-residue loops, 0.84 A for 8-residue loops, and 1.63 A for i l-residue loops, using an accurate all-atom energy function and hierarchical refinement protocol. Potentially the predicted full backbone structure, after adding loop structures, could be refined using an energy-based approach. To do this, one needs to put all the atoms, backbone and side chains, into a structural model. One can use alignments with the selected templates in fold recognition to produce a 3D atomic model through homology modeling tools, such as MODELLER (Sali and Blundell, 1990) , which runs a protocol ofenergy minimization and molecular dynamics simulation to refine a structural model. After a structure model is generated, one can apply structure assessment tools such as WHATIF (Vriend, 1990) and PROCHECK (Laskowski et aI., 1993) to evaluate the packing and backbone conformations, the inside/outside occupancies ofhydrophobic and hydrophilic residues, and stereochemical quality of a predicted structure. Based on this assessment, a user can pick the best among the multiple structures derived from an alignment. While widely used, the potential of protein threading as a protein structure and function prediction technique is far from being fully realized. There are a number of factors that have limited its wider range of applications. First, fold recognition for structural analogues and some remote homologues is still challenging (Kinch et al., 2003; Sippl et aI., 2001) . Such proteins might account for about 40% of all proteins encoded in a typical genome according to our studies (Xu et aI., 2003; Guo et aI., 2004) . These structures are theoretically modelable using comparative modeling techniques such as protein threading, but the predictions typically gave a low confidence level and the results may be wrong. Second, even when a correct fold is identified, the accuracy ofthreading alignment has been about 60-90% for proteins with less than 30% sequence identity with their template structures (Venclovas et al., 2003) . Novel ideas and new techniques are clearly needed now to make a major jump in improving the prediction capability of the existing threading methods; this has become quite clear based on the slow and incremental improvements in threading performance in the past few CASP contests (Venclovas et al., 2003) . The current energy functions are generally coarse gained mainly to achieve fast predictions. Given the significant advances in computer hardware and algorithm development, it may be the time to use more sophisticated energy functions. For example, multibody interactions and more physical energy functions may help improve the threading prediction accuracy. Although many theoretical studies have been carried out for threading algorithms, there is still significant room for further improving the computational efficiency of threading programs. Better search methods against structural templates using advanced database techniques have not been explored thoroughly. More work can be done at the implementation level in a similar way to the implementation of BLAST, where many low-level operations were implemented in a highly efficient manner. In addition, algorithmic development needs to address new types of energy functions, such as new threading algorithms that could handle simultaneous backbone threading and side-chain packing to fully take advantage of more detailed energy function forms. Threading algorithms that could handle multibody interactions and energy functions capturing more global properties (e.g. , compactness) ofproteins are clearly underdeveloped. Existing confidence assessments are either too time-consuming in computation or not sufficiently accurate. More rigorous and faster assessment techniques for threading are clearly needed to achieve comparable performance to that of BLAST. Assessments of different alignments using the same fold were basically not studied. Furthermore, identification of"reliable" versus "unreliable" parts ofa threaded structure, and quantitative assessment of the structural deviations in terms of RMSD for regions of predicted structures have not been achieved. It has been found that using multiple fold recognition programs to build consensus of structural template is an effective way to increase the prediction accuracy (Lundstrom et al., 2001) . Furthermore, one can thread subdomain structures and use these substructures from different templates to build a new structure through a shotgun approach (Fischer, 2000 (Fischer, , 2003 . Currently a consensus was built by a simple scheme of majority vote. Much statistics can be done to do this in a more scientifically sound way. In addition, how to piece different substructures together to form a global protein structure is another challenging issue. Further discussion on consensus building and subdomain threading can be found in Chapter 17. As a structure prediction technique, threading potentially applies to at least 80% of all protein families. However, the application ofthreading to membrane proteins has been very limited due to the lack of available structural templates. Threading techniques have been widely used for various purposes in biological studies, including (a) functional studies of proteins and experimental design (e.g., targeted mutagenesis) (Madej et aI., 1995; Xu et al., 2001; von Grotthuss et al., 2003) , (b) genome annotation (Xu et aI., 2003; McGuffin et al., 2004) , (c) helping solve experimental structures (Ye et al., 2004) , (d) modeling protein complex structures (Lu et al., 2003) , (e) prediction ofmisfolded structure (see Chapter 9), and (f) protein design (Sorenson and Head-Gordon, 1999) . To further increase the utility of threading techniques to meet the needs for genome-scale protein structure prediction to keep up with the rate of genome sequencing and gene prediction, we clearly need a new generation of threading energy functions, threading algorithms, methods for assessing the statistical significance of threading results, and refinement of threaded structures. There are several comprehensive reviews and books on various aspects ofthreading. Recent reviews related to threading include Fetrow et al. (2002) and Godzik (2003) . A number of books also provide some general coverage of threading and protein structure predictions (Tsigelny, 2002; Jiang et al. 2002; Bourne and Weissig, 2003) . For scoring function, the readers can find more information in Chapters 2 and 3 ofthis book. For more information about general protein structure and structure-function relationship, we recommend Branden and Tooze (1999) and Lesk (2001) . PDP: protein domain parser Local alignment statistics Gapped BLAST and PSI-BLAST: A new generation of protein database search programs SCOP database in 2004: Refinements integrate structure and sequence family data Domain combinations in archaeal, eubacterial and eukaryotic proteomes An insight into domain combinations Linear time algorithms for NP-hard problems restricted to partial k-tree The Universal Protein Resource (UniProt) Protein structure prediction and structural genomics A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons Flexible protein sequence patterns. A sensitive method to detect weak structural similarities A linear time algorithm for finding tree-decompositions of small treewidth Structural Bioinformatics A method to identify protein sequences that fold into a known three-dimensional structure Introduction to Protein Structure Fundamentals of Algorithmics Understanding protein structure: Using scop for fold interpretation Statistics of sequence-structure threading On the structural complexity ofa protein Fold recognition with minimal gaps Exploring the limits of precision and accuracy of protein structures determined by nuclear magnetic resonance spectroscopy On the use of chemically derived distance constraints in the prediction ofprotein structure with myoglobin as an example A unifold, mesofold, and superfold model of protein fold use HOMSTRAD: Adding sequence information to structure-based alignments of homologous protein families Cleavage inhibition ofthe murine coronavirus spike protein by a furin-like enzyme affects cell-eell but not virus-eell fusion SMoG: de novo design method based on simple, fast, and accurate free energy estimates .1. Methodology and supporting evidence Identification of homology in protein structure classification Multi-class protein fold recognition using support vector machines and neural networks The multiplicity of domains in proteins Large macromolecular complexes in the Protein Data Bank: A status report Multi-domain proteins in the three kingdoms of life: Orphan domains and other unassigned regions A study of combined structure/sequence profiles The protein folding problem: A biophysical enigma Why do globular proteins fit the limited set of folding patterns? Hybrid fold recognition: Combining sequence derived properties with evolutionary information 3D-SHOTGUN: A novel, cooperative, fold-recognition metapredictor Protein fold recognition using sequence-derived predictions Assessing the performance of fold recognition methods by means of a comprehensive benchmark Assigning amino acid sequences to 3-dimensional protein folds Planar graph decomposition and all pairs shortest paths Structural genomics: Bioinformatics in the driver's seat Prediction of transcription regulatory sites in Archaea by a comparative genomic approach Can sequence determine function? A structural census ofgenomes: Comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure How representative are the known structures of the proteins in a complete genome? A comprehensive structural census Comparing genomes in terms ofprotein structure: Surveys of a finite parts list Fold recognition methods PROSPECT-PSPP: An automatic computational pipeline for protein structure prediction Selection of representative protein data sets Mapping the protein universe The FSSP database: Fold classification based on structure-structure alignment of proteins A hierarchical approach to all-atom protein loop prediction Current Topics in Computational Molecular Biology Protein secondary structure prediction based on position-specific scoring matrices GenTHREADER: An efficient, and reliable protein fold recognition method for genomic sequences A new approach to protein fold recognition PROSPECT II: Protein structure prediction program for genome-scale applications CASP5 assessment of fold recognition target predictions The structure of the protein universe and genome evolution PROCHECK: A program to check the stereochemical quality of protein structures The protein threading problem with sequence amino acid interaction preferences is NP-complete Introduction to ProteinArchitecture: The StructuralBiology ofProteins A unified statistical framework for sequence comparison and structure comparison Emergence of preferred structures in a simple model of protein folding Are protein folds atypical? Designability of protein structures: A lattice-model study using the Miyazawa-Jernigan matrix A distance-dependent atomic knowledge-based potential for improved protein structure selection Geometric cooperativity and anti-cooperativity of threebody interactions in native proteins Multimeric threading-based prediction of protein-protein interactions on a genomic scale: Application to the Saccharomyces cerevisiae proteome Protein distance constraints predicted by neural networks and probability density functions Peons: A neuralnetwork-based consensus predictor that improves fold recognition Threading analysis suggests that the obese gene product may be a helical cytokine Comparative genomics ofthe Archaea (Euryarchaeota): Evolution of conserved protein families, the stable core, and the variable shell How many species are there on earth Improvement ofthe GenTHREADER method for genomic fold recognition Protein Structure Prediction by Protein Threading The Genomic Threading Database: A comprehensive resource for structural annotations of the genomes from key organisms Novel knowledge-based mean force potential at atomic level Statistical significance of protein structure prediction by threading Statistical significance of hierarchical multibody potentials based on Delaunay tessellation and their application in sequence-structure alignment SCOP: A structural classification of proteins database for the investigation of sequences and structures Protein superfamilies and domain superfolds CATH-A hierarchic classification of protein domain structures A local alignment method for protein structure motifs Threading with explicit models for evolutionary conservation ofstructure and sequence Combination ofthreading potentials and sequence profiles improves fold recognition Combinatorial Optimization: Algorithms and Complexity New techniques in structural NMR-anisotropic interactions Protein fold recognition through application of residual dipolar coupling data Protein structure prediction using sparse dipolar coupling data The anatomy and taxonomy ofprotein structure Graph minors .2. algorithmic aspects of tree-width Protein fold recognition by predictionbased threading Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structuredependent gap penalties Calculation ofconformational ensembles from potentials ofmean force. An approach to the knowledge-based prediction of local structures in globular proteins Assessment of the CASP4 fold recognition category Structural genomics and its importance for gene function analysis Defrosting the frozen approximation: PROSPEC-TOR: A new approach to threading Confidence measures for protein fold recognition Tree decomposition based protein threading Redesigning the hydrophobic core of a model beta-sheet protein: Destabilizing traps through a threading approach The COG database: A tool for genome-scale analysis of protein functions and evolution Protein structure alignment Nuclear magnetic dipole interactions in field-oriented proteins: Information for structure determination in solution Protein Structure Prediction: Bioinformatic Approach Assessment of progress over the CASP experiments mRNA cap-l methyltransferase in the SARS genome WHAT IF: A molecular modelling and drug design program Application of computational biology in understanding emerging infectious diseases: Inferring the biological function for S-M complex ofSARS-Co~In Progress in Bioinformatics PISCES: A protein sequence culling server How many fold types of protein are there in nature Protein fold recognition by threading: Comparison of algorithms and analysis of results Nucleation, rapid folding, and globular intrachain regions in proteins Model for the three-dimensional structure ofvitroneetin: Predictions for the multi-domain protein from threading and docking Characterization of protein structure and function at genome scale with a computational prediction pipeline Sequence-structure specificity of a knowledge based energy function at the secondary structure level A tree decomposition approach to protein structure prediction Assessment of RAPTOR's linear programming approach in CAFASP3 RAPTOR: Optimal protein threading by linear programming. 1 Bioinform Protein threading by linear programming A polynomial-time algorithm for a class of protein threading problems Protein threading using PROSPECT: Design and evaluation A computational method for NMR-constrained protein threading Protein threading by PROSPECT: A prediction experiment in CASP3 Protein structure determination using protein threading and sparse NMR data Protein domain decomposition using a graph-theoretic approach A practical method for interpretation ofthreading scores: An application of neural network An efficient computational method for globally optimal threading A graph-theoretic approach for the separation ofb and y ions in tandem mass spectra Probabilistic cross-link analysis and experiment planning for high-throughput elucidation of protein structure High throughput protein fold identification by using experimental constraints derived from intramolecular cross-links and mass spectrometry Similarities and differences between nonhomologous proteins with similar folds: Evaluation of threading strategies Estimating the number ofprotein folds Scoring function for automated assessment of protein structure template quality Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment offragments This research was sponsored in part by the u.S. Department of Energy's Genomes to Life program (www.doegenomestolife.org) under project "Carbon Sequestration in Synechococcus sp.: From Molecular Machines to Hierarchical Modeling" (www genomesllife.orgj. YXandZJL'sworkwasalsosupportedinpartbyNSFIDBI-0354771, NSF/ITR-IIS-0407204, and a "Distinguished Cancer Scholar" grant from the Georgia Cancer Coalition. DX's work was also partially funded by NSF/EIA-0325386.