Submitted 30 June 2016 Accepted 30 October 2016 Published 5 December 2016 Corresponding author Alberto Pascual-García, alberto.pascual.garcia@gmail.com, apascual@ic.ac.uk Academic editor Rahul Shah Additional Information and Declarations can be found on page 33 DOI 10.7717/peerj-cs.100 Copyright 2016 Nido et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Learning structural bioinformatics and evolution with a snake puzzle Gonzalo S. Nido1,2,*, Ludovica Bachschmid-Romano3,*, Ugo Bastolla4 and Alberto Pascual-García4,5 1 Department of Neurology, Bergen University, Bergen, Norway 2 Department of Clinical Medicine, Bergen University, Bergen, Norway 3 Department of Artificial Inteligence, Technische Universität Berlin, Berlin, Germany 4 Centro de Biología Molecular ‘‘Severo Ochoa,’’ Universidad Autónoma de Madrid, Madrid, Spain 5 Department of Life Sciences, Imperial College London, Ascot, Berkshire, United Kingdom * These authors contributed equally to this work. ABSTRACT We propose here a working unit for teaching basic concepts of structural bioinformatics and evolution through the example of a wooden snake puzzle, strikingly similar to toy models widely used in the literature of protein folding. In our experience, developed at a Master’s course at the Universidad Autónoma de Madrid (Spain), the concreteness of this example helps to overcome difficulties caused by the interdisciplinary nature of this field and its high level of abstraction, in particular for students coming from traditional disciplines. The puzzle will allow us discussing a simple algorithm for finding folded solutions, through which we will introduce the concept of the configuration space and the contact matrix representation. This is a central tool for comparing protein structures, for studying simple models of protein energetics, and even for a qualitative discussion of folding kinetics, through the concept of the Contact Order. It also allows a simple representation of misfolded conformations and their free energy. These concepts will motivate evolutionary questions, which we will address by simulating a structurally constrained model of protein evolution, again modelled on the snake puzzle. In this way, we can discuss the analogy between evolutionary concepts and statistical mechanics that facilitates the understanding of both concepts. The proposed examples and literature are accessible, and we provide supplementary material (see ‘Data Availability’) to reproduce the numerical experiments. We also suggest possible directions to expand the unit. We hope that this work will further stimulate the adoption of games in teaching practice. Subjects Bioinformatics, Computational Biology, Computer Education, Scientific Computing and Simulation Keywords Structural bioinformatics, Education, Protein folding, Statistical mechanics, Contact matrix, Protein structure alignment, Designability, Evolution, Contact order, Protein classification INTRODUCTION Scientific knowledge is becoming increasingly interdisciplinary, with life sciences being one of the most significant examples. This field has attracted experts from different areas and backgrounds, as foreseen by Schrödinger’s seminal book ‘‘What is Life?’’ (Schrödinger, 1992). In fact, life sciences offer a very interesting ground for the application of formal methods originated in other disciplines such as physics or informatics, allowing to address How to cite this article Nido et al. (2016), Learning structural bioinformatics and evolution with a snake puzzle. PeerJ Comput. Sci. 2:e100; DOI 10.7717/peerj-cs.100 https://peerj.com mailto:alberto.pascual.garcia@gmail.com mailto:apascual@ic.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.100 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.100 biological questions from different perspectives (Lazebnik, 2002). In recent years, the spectacular increase of biological data promoted by high-throughput technologies boosted the development of computational tools for its analysis and classification, an essential task in itself (Dougherty & Braga-Neto, 2006). We are witnessing the growth of a theoretical biology with formal parallelism with other disciplines and with the objective to identify the underlying general principles of biological systems. This scenario challenges the traditional educational programs (Gallagher et al., 2011), often reluctant to overcome the boundaries between established disciplines. In contrast, we observe a rapid growth of interdisciplinary publications in the scientific literature, which contributes to bolster and further extend the gap between research and education. As a result, it is common that students with different backgrounds meet in the same postgraduate courses, but they often lack the skills required to work in such an integrative environment. Together with the limited number of teaching hours, this situation constitutes a serious bottleneck to learning. Hence, it is of great importance to find tools that help to bridge the gap between different backgrounds and favour a learning convergence (Fox & Ouellette, 2013). Games can help students to assimilate abstract concepts and to address complex problems. There is a growing number of games related with topics as diverse as protein folding (Cooper et al., 2010), spin glasses (Hartmann, 2013), ecological networks (Fortuna et al., 2013), biological data integration (Schneider & Jimenez, 2012), geometry (O’Rourke, 2011) or scientific induction (Gardner, 2008), to name a few. Here we present an illustrative example where we employed a wooden snake puzzle in a structural bioinformatics course at the Universidad Autónoma de Madrid (Spain). This puzzle can be regarded as a coarse grained (toy) model of a polymer structure and it is strikingly similar to simplified mathematical models proposed in the protein folding literature (see, for instance, Šali, Shakhnovich & Karplus, 1994). We propose several exercises accessible to students with a graduate-level background in either biology or physics and notions in programming. Our goal consists in giving concreteness to the subjects presented in the course through a physical object, and our experience in this sense is very positive. This example allows us to discuss the first steps in the modeling process, i.e., the definition of the system and its epistemological and practical consequences, a discussion that is often neglected. Relying on a physical object allows us to provide a first intuitive contact with the different subjects that we treat, ranging from computational techniques, such as protein structure alignment algorithms, to theoretical concepts, such as protein folding and evolution. We argue that starting from these examples allows the lecturer to introduce more complex problems, in which real examples might be considered. We suggest along the unit different questions that may be followed to expand a course. One important aspect in which we focused in the development of these problems is the intimate relation between physics and evolution. Adapting the famous quote of Theodosius Dobzhansky, the properties of natural proteins only make sense in the light of evolution and, conversely, the properties of protein evolution only make sense considering the constraints imposed by protein physics (Liberles et al., 2012). Furthermore, there is a deep analogy between the statistical mechanics in the space of protein conformations, which Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 2/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 governs protein folding, and the statistical mechanics in the space of protein sequences, which emerges from evolution (Sella & Hirsh, 2005). The paper is organized as follows. Firstly, we present the analogy between the snake puzzle and the contact matrix representation of protein structures, and we propose a simple algorithm to solve the puzzle. The solutions are then adopted as a set of representative model structures, on which we propose computational exercises focused on evolutionary concepts. Throughout the paper we suggest several discussions that are stimulated by the analogy between the computational results and real proteins, proposing chosen references that are easy to handle by postgraduate students. Finally, we provide as Supplementary Material (see ‘Data Availability’) the input data needed to reproduce all the numerical experiments and the source code to execute some of them. PROPOSED EXERCISES Puzzle description: visualization and analogies with protein structures The snake puzzle consists of a wooden chain of N blocks that must be folded into a cube. Each element of the chain, except the first and the last one, is linked to two others. Three linked units are constrained to form either a straight line or a turn that extends into two dimensions. There are at least two versions of the puzzle in different sizes, available in retail stores and online shops. The first version is a 27-unit puzzle that folds into a cube of side 3, and the second one is a 64-unit puzzle that folds into a cube of size 4. In accordance with the nomenclature used in the literature on polymer physics, we will refer to the former as 27-mer model and as 64-mer to the latter. The exercises proposed in the following sections are based on the 64-mer puzzle, which is more complex while remaining computationally tractable. Our proposal is inspired by the similarity between this puzzle and the lattice models of protein structures studied in the literature (the relevant references will be presented all throughout the text). The number of possible polymer conformations on the cubic lattice is huge. Taking into account the self-avoiding condition that two units cannot occupy the same cell, each new unit can be placed in at most 5 different cells. Each time that the length increases the number of conformations is multiplied by a roughly constant factor, leading to an exponential increase. Numerical computations show that the number of self-avoiding walks on the cubic lattice scales as NγµN for large N , with µ≈4.68 (Bellemans, 1973), close to the maximum possible value, so that the number of conformations of a 64-mer is of the order of ∼1051. The enormousness of these numbers offers the opportunity to introduce the well known Levinthal’s paradox (Levinthal, 1969) as a starting point for a course focused on protein folding. Inlatticemodelsofproteinfoldingitisoftenassumedthatfoldedproteinsarerepresented by maximally compact conformations. In the case of the 64-mer, these are conformations that occupy the 4×4×4 cube. This number also increases exponentially, although with a smaller exponent whose asymptotic value can be computed analytically, for instance it is µ=5/e ≈1.84 on the cubic lattice (Pande et al., 1994). Thus the number is huge Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 3/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 Figure 1 Conformations of the 27-mer and 64-mer. (A): Maximally compact conformations. (B): Partly open conformation that illustrates the similarity between the model and two protein domains connected through a hinge. (C): Fully extended conformations, where one can see the consecutive rigid fragments of size 2 and 3 (for the 27-mer) and size 4 (64-mer). (of the order of 1019 for the 64-mer). An algorithm to generate exhaustively all maximally compact conformations of the 27-mer model was proposed in Shakhnovich & Gutin (1990). With respect to these numbers, the puzzle presents a huge reduction of the conformation space due to the constraints on the direction that each step can take. Indeed, linearly arranged consecutive units can be regarded as a single rigid fragment of size 2, or 3 (27-mer) and 2, 3, or 4 (64-mer). These rigid blocks can be easily seen in Fig. 1C. The number of fragments is much smaller than the number of units. In addition, two consecutive rigid fragments have only four self-avoiding conformations (two consecutive fragments cannot extend in the same dimension). Consequently, due to these constraints the 64-mer puzzle has access to only ∼1012 conformations. Nevertheless, it is still remarkably difficult to solve it without computational help. This reduction in the order of magnitude of the number of conformations offers an opportunity to discuss how the secondary structure of proteins limits the number of different folded structures, and how secondary structure prediction helps predicting protein structure from sequence. Furthermore, different fragments of the puzzle are sometimes assembled into higher order structures reminiscent of structures found in real proteins. For instance, the 64-merhas two consecutive blocks of size 4 that cannot be folded unless they are placed in parallel next to each other. This configuration is remarkably similar to the Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 4/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 naturally occurring beta sheet secondary structure. The constraints that this folding entails subdivide the puzzle into two structurally independent modules joined by what resembles a protein hinge (see Fig. 1B). On the other hand, there are long regions of consecutive blocks of length 2 that can provide the opportunity to discuss the role of more flexible regions in real proteins such as large loops or intrinsically unstructured regions. Solutions of the puzzle After this preliminary descriptive analysis of the puzzle we propose the first computational exercise, which consists in programming a search algorithm that generates all the maximally compact conformations (i.e., the solutions of the snake puzzle folded in a cube). The solution that we describe in ‘Materials and Methods’ starts from one end of the chain and exhaustively builds all the conformations of the blocks of the puzzle that fit into a cube of size 3 (for the 27-mer) or 4 (64-mer). We provide the source code along with the algorithm as Supplementary Material (see ‘Data Availability’) to the present paper. It has to be noted that the time taken by our search algorithm grows exponentially with the number of fragments, in accordance with a recent work that shows that the snake puzzle is an NP-complete problem (Abel et al., 2013), i.e., loosely speaking, a problem that cannot be solved exactly by any algorithm whose computation time grows only polynomially with system size. Thus, while it would be impossible to apply this algorithm to much larger systems, modern computers are readily capable of handling the calculations for the 64-mer. Stochastic algorithms can find solutions in shorter times at the expense of an exhaustive exploration, and they are of interest in the context of protein folding. To characterize the bottlenecks of the search algorithm, we monitor in which regions of the puzzle the exhaustive search spends more time. In Fig. 2 we plot the histogram of the number of times a fragment is visited during two exhaustive searches that start from the two ends of the chain. To understand the relationship between the search and the constraints imposed by the fragments, we depict in the histogram the position of the rigid fragments of size larger than two. We observe that there is an intense search around the regions where many consecutive small fragments accumulate. On the other hand, the search is drastically reduced when the algorithm finds large consecutive fragments (see, for instance, the two rigid fragments of size 4 separated by a fragment of size 2) because there are few possible conformations compatible with valid spatial coordinates. It is interesting to observe that the number of steps needed to find the solutions varies depending on the end of the chain from which the algorithm starts. This observation suggests that the puzzle may be solved faster if we start the algorithm from regions with more restrictive constraints. Through the exhaustive search, we find eight solutions for the 64-mer puzzle that we show in Fig. 3. These solutions will be labelled hereafter as S1, S2, ..., S8. The students can be requested to visually inspect the solutions through standard visualization software such as PyMOL (DeLano, 2002) or VMD (Humphrey, Dalke & Schulten, 1996). For a review on software for protein structure visualization see O’Donoghue et al. (2010). For this exercise, it is necessary to transform the format of the files with spatial coordinates. This gives us the opportunity to discuss the different file formats with their flaws and Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 5/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 Chain element number 0 0.5x108 1.0x108 1.5x108 0 10 20 30 40 2.0x108 N um be r o f v is its d ur in g fo ld in g Figure 2 Number of times that a given rigid fragment is visited by the folding algorithm. The search starts either from the first (blue histogram) or from the last fragment of the chain (red) and explores all maximally compact conformations. Positions with blocks of length larger than two are depicted over the histograms. advantages, and to stress the importance of format standardization in bioinformatics (Gibas & Jambeck, 2001). The contact matrix At this point, it is convenient to introduce a reduced representation of protein structures that arises naturally in the context of lattice polymers, and that will play an important role in the following: the contact matrix. This binary matrix has value Cij =1 if two units i and j contact each other in space in the polymer conformation, and Cij =0 otherwise: Cij = { 1 if dij ≤d0 0 otherwise (1) Bonded units (j = i ± 1) are excluded from this count because they contact each other in all conformations. In lattice polymers, the condition for contact is that two units are nearest-neighbours in the lattice, i.e., d0 is the lattice space. We provide the contact matrices of the eight solutions of the snake puzzle in the Supplementary Material (see ‘Data Availability’) since they are needed to perform the exercises that we describe in the following. Figure 4 shows the four different types of locations that a monomer can occupy within the maximally compact structure and the number of contacts that it has in each case. The same figure also shows an intermediate conformation of two similar solutions, S1 and S2, depicting only their common structure and the first different fragment. The associated contact matrices are represented below, highlighting the differences determined by the alternative positions of the last fragment. Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 6/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 Figure 3 The eight solutions of the 64-mer puzzle. Although the solutions cannot be superimposed, they represent four pairs of mirror images, analogous to naturally occurring enantiomers of chemical sub- stances. In the figure, each pair of related solutions are depicted with the same color (S1 & S4, S2 & S5, S3 & S6, S7 & S8). The figure was generated using coordinates files in PDB format, whose documentation can be obtained in http://www.wwpdb.org. The source code provided in the Supplementary Material (see ‘Data Availability’) outputs PDB files that can be explored using any standard protein structure visualiza- tion software. Contact matrices are also used as a simplified representation of real protein structures. Two residues i and j are considered in contact if their closest not-hydrogen atoms are closer than a threshold, typically d0 =4.5Å (the main reason to exclude hydrogen atoms is that they are often not reported in PDB files obtained with X-ray crystallography). Pairs are considered in contact only if |i−j|>2, since neighbours along the chain trivially fulfil the contact condition in all conformations. The contact matrix provides a simple visual representation that makes evident secondary structure elements (alpha-helices, appearing as lines of contacts (i,i+3) and (i,i+4) parallel to the diagonal; parallel beta sheets, (i,i+l) and anti parallel beta sheets, i+l =const, appearing as lines perpendicular to the diagonal). Importantly, the contact matrix is independent of the reference frame used to represent the coordinates. This point gives a good opportunity for a general discussion on the modelling process in biology and its epistemological and practical implications. Protein molecules are extremely complex. They are made of thousands of atoms bound together by quantum interactions. Although not covalently bound, the water solvent is essential in determining the properties of a protein (dynamics, thermodynamic stability, catalytic function...). If we want to make quantitative predictions, we have to reduce this complexity to a simplified model that is amenable to computation. In a statistical mechanical framework, a simplified (mesoscopic) model must be imagined as the result of integrating out some degrees of freedom of the system (the quantum interactions, the water molecules, the hydrogen atoms, the side chains...) and retaining only those that are either most relevant or simplest Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 7/38 https://peerj.com http://www.wwpdb.org http://dx.doi.org/10.7717/peerj-cs.100 Figure 4 Contact matrix representation. (A) Illustration of the four different types of locations where a monomer can be found in a maximally compact structure and the contacts that it will has in each case. A given monomer may have from one to four contacts, with the exception of the first and last monomers, which can present an additional contact up to a maximum of five. (B) Conformations and contact matri- ces of two similar solutions S1 and S2. In the conformations, for clarity we show only the part that is com- mon to both solutions and the first fragment that is different. The upper triangular matrix shows the con- tacts of S1 and S2, highlighting in red the contacts present in S1 that are absent in S2 (shown in the con- formation S1 in red as well). The grey shaded area in the matrix corresponds to the monomers not shown in the above conformations. The distance in sequence between the contacts can be seen in the arcs repre- sentation depicted in the left of the matrix. The lower triangular matrix represents the number of shared contacts among the four non-redundant solutions. to handle, allowing quantitative predictions. The contact matrix is one such mesoscopic representation. To define it, we need arbitrary choices (the definition of the distance between residues) and parameters (the threshold distance). This is an unavoidable (and indeed desirable) feature of the modelling process. Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 8/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 Contact energy function Nevertheless, although the modelling choices may seem plausible, it is important that they are tested a posteriori for their predicting ability, and that parameters are optimized by comparison with experimental data, if it is possible. In the case of protein contact matrices, predictions are obtained from statistical mechanical models that present a simple contact energy function: E(C,A)= ∑ iU(H,H)+U(P,P). For a given sequence, the ground state of this model is the contact matrix with the lowest effective energy among all physically feasible contact matrices that satisfy the strong conditions of chain connectivity, excluded volume (the self-avoiding interactions) and secondary structure (if we consider real protein structures). While the number of all possible contact matrices is huge (of the order of 2L 2 ), most of them are unfeasible since the number of self-avoiding walks increases only exponentially with the number of monomers L. As a model of feasible contact matrices, we can consider the contact matrices of compact self avoiding walks on the cubic lattice or the contact matrices of real protein structures. Here, for the sake of simplicity, we will limit our computations to the solutions of the snake puzzle. This is analogous to the toy model of protein folding based on the maximally compact structures on the cubic lattice introduced by Shakhnovich and coworkers in the 90’s (Shakhnovich & Gutin, 1990). As suggested by several works, we identify the native state of a protein sequence with the ground state of its effective energy (more realistic models also impose conditions on the stability with respect to alternative compact conformations, as we will see in the following). This allows us to define the designability of a given structure as the number of sequences for which this structure is the ground state. More designable structures are expected to be more resistant to sequence mutations. Mutational robustness has been proposed to be important to reduce the deleterious effects of erroneous protein translation in the ribosomes (Drummond & Wilke, 2009) and to mitigate the mutation load determined by unviable offsprings generated under strong mutation rate (Bloom et al., 2007; Lauring, Frydman & Andino, 2013). Furthermore, it has been shown that robustness favours the evolution of new protein functions (Soskine & Tawfik, 2010). Similar considerations motivated the computational study by Li et al. (1996), who considered as feasible states the contact matrices of the maximally compact self avoiding walks of 27 steps on the cubic lattice, whose number is ∼105 (Shakhnovich & Gutin, 1990) and evaluated their designability in the space of the 227 sequences of 27 H and P. Here we propose that the students reproduce this computation, adopting as feasible structures three of the solutions of the snake puzzle: S1, S3 or S7. We discard the solution S2 because it is very similar to S1 (see the Contact Overlap values in Table 1) and it should be considered part of its native valley rather than an alternative conformation. Figure 6 shows the results for the three structures considered for an increasing number of sequences (see ‘Materials and Methods’). Li et al. (1996) observed that the distribution for the number of structures with S sequences decays exponentially with S, which means that there are few structures with high designability and many with low designability. Of course three structures are too few to test this behaviour, but it is interesting that there are Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 15/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 10 100 1000 10000 Number of sequences generated 25 30 35 40 45 % S eq ue nc es a ss ig ne d Structure 1 Structure 3 Structure 7 Figure 6 Designability of selected structures for an increasing number of random sequences. We ob- serve significant differences between the structures that tends to a well defined limit when the number of sequences increases. important differences between the three structures, and S7 stands out as more designable than the rest. This exercise allows us to emphasize the importance to assess the statistical significance and statistical errors in computational studies. The differences that we observe are relevant only if they are statistically significant, i.e., if the probability that they arise by chance is lower than a predetermined threshold (typically, 0.05). We propose two methods to verify that they are indeed significant. The first one, more rigorous, is based on the binomial distribution: given any structure a, we can compute the probability that the number of sequences assigned to a, Sa over a total of S that were tested, is the result of S tests with success probability 1/3, as if all structures have the same probability. If all these probabilities are small, then the designabilities are significantly different. The second method, easier to apply in general, consists in computing the standard error of the mean of the designability pa, sa = √ pa(1−pa)/S, and performing an unpaired Z-test with Z = (pa−pb)/ √ s2a+s 2 b for all pairs of structures. With 0.05 as significance threshold, the binomial test finds the difference from equal frequencies not significant for S≤100 (except for S1) but significant for S≥ 1,000 for all structures. The Z test yields the same result (in this case, S1 is not significant for S=100). Misfolding stability and energy gap It has been noted in several works that the analogy between the polymer model and protein folding only makes sense if the ground state structure, identified as the native state, is Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 16/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 much more stable than alternative structures, which collectively represent the misfolded ensemble. One of the first parameters used to assess the stability of the putative native state has been the energy gap between the putative native state and the misfolded conformation with minimum energy among those not in the attraction basin of the native state (Sali et al., 1994). If this gap is small, the native state lacks thermodynamic stability, since a small change in the parameters of the effective energy function, corresponding to a change of temperature or pH, would produce a different ground state; it lacks stability against mutations in the sequence, for similar reasons; and it is very difficult to reach kinetically, since the dynamics of the model polymer can become trapped in low energy conformations outside the native basin. In their work cited above, Li et al. (1996) found that more designable structures have a larger energy gap, and that the energy gap averaged over the sequences assigned to a given structure clearly separates the maximally designable structures from the rest. In our toy model there is one ‘‘native’’ and only two ‘‘misfolded’’ conformations. We denote by E(Ci,Ak) the effective energy of sequence Ak in structure i, and we define the energy gap as the effective energy difference between the ’’native’’ structure and the alternative conformation with lowest energy δE(Ci,Ak)=E(Ci,Ak)−min j 6=i E(Cj,Ak). (6) This quantity has then to be averaged over all sequences Ak assigned to structure Ci, which we denote with an overline: δE(Ci)=E(Ci,Ak)−min j E(Cj,Ak)(j 6= i). (7) The values of the average energy gap are shown in Fig. 7. We can see that, consistently with the results reported by Li et al. (1996) the structure S7 has both the highest designability and the largest energy gap, and that these differences are significant. We can then ask students whether this correlation between designability and energy gap is surprising or it should be expected based on the definitions. Another measure of the stability of the misfolded ensemble, more standard under the point of view of statistical mechanics, is its free energy, which can be evaluated through some simplified statistical mechanical model. We assume that the protein can exist in three macroscopic states: the native state (the attraction basin of the contact matrix with minimum effective energy, which can be described for instance through the structure-based elastic network model mentioned above), the unfolded state (analogous to the self-avoiding-walk model, with large conformation entropy and effective contact energy close to zero), and the misfolded state (alternative compact conformations dissimilar enough from the native structure). We describe each of these states (native, misfolded and unfolded) by a free energy which, at constant pressure and volume, is given by the difference between the total energy E and the conformational entropy S multiplied by the absolute temperature, i.e., the Helmholtz free energy G=E−TS. Furthermore, the Helmholtz free energy is proportional to the logarithm Nido et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.100 17/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.100 1 3 7 Structure identifier 2 2.5 3 3.5 A ve ra ge e ne rg y ga p Figure 7 Average energy gap for the three chosen structures, S1 (green), S3 (orange) and S7 (pink). The structure S7 shows a significantly higher value than the others. of the partition function G=−kBT logZ of the canonical ensemble, an important equation that permits to relate macroscopic thermodynamic quantities with the microscopic details encoded in the partition function Z. In the following, we show how the free energies of the different states can be estimated, and how we can define the folding free energy of the model polymer 1G as the difference between the free energy of the native state and that of the unfolded plus misfolded state (the non-native state). This gives us the opportunity to stress that both unfolding and misfolding are important and should be taken into account. We approximate the free energy of the native state as its effective contact energy, Gnat(A)= ∑ i