Abstract
There is great interest in the creation of genetically modified organisms that use amino acids different from the naturally encoded amino acids. Unnatural amino acids have been incorporated into genetically modified organisms to develop new drugs, fuels and chemicals. When incorporating new amino acids, it is necessary to change the standard genetic code. Expanded genetic codes have been created without considering the robustness of the code. In this work, multi-objective genetic algorithms are proposed for the optimization of expanded genetic codes. Two different approaches are compared: weighted and Pareto. The expanded codes are optimized in relation to the frequency of replaced codons and two measures based on robustness (for polar requirement and molecular volume). The experiments indicate that multi-objective approaches allow to obtain a list of expanded genetic codes optimized according to combinations of the three objectives. Thus, specialists can choose an optimized solution according to their needs.
This work was partially supported by São Paulo Research Foundation - FAPESP (under grants #2021/09720-2 and #2013/07375-0), National Council for Scientific and Technological Development - CNPq (under grant #306689/2021-9), and Center for Artificial Intelligence - C4AI (supported by FAPESP, under grant #2019/07665-4, and IBM Corporation).
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Proteins are vital macromolecules in living organisms [3]. They are composed of amino acids joined by covalent bonds forming series with different sizes and constitutions. Changes in the amino acid sequence can cause modifications in the three-dimensional structure of the protein and consequently in its function. Each amino acid is encoded in the DNA (Deoxyribonucleic Acid) by a sequence of three nucleotides, called codon. Sixty-one codons specify amino acids and three codons indicate the end of protein sequencing (stop codons), during its synthesis. There are 20 types of amino acids that are generally used in proteins (natural amino acids). Since there are \(4^3 = 64\) possible combinations of the 4 nucleotides (A, C, G, T) in a codon, some amino acids are encoded by more than one codon.
Living beings share the same standard genetic code, with rare exceptions. There are approximately \(1.4 \times 10^{70}\) hypothetical genetic codes, i.e., ways to associate codons to the natural amino acids. When compared to all codes, the standard genetic code is very robust [5], which is explained by two main factors. First, when the organization of the standard genetic code is examined, one can see that many amino acids are encoded by similar codons (see Fig. 1). That is, small changes in the nucleotide sequence sometimes generate no change in the sequence of amino acids of a protein. Second, most of the modifications result in changing the amino acid to one with similar physical-chemical properties [14].
Recently, there has been great interest in creating genetically modified organisms that use unnatural amino acids, i.e., amino acids other than the 20 amino acids naturally encoded in the standard genetic code. These amino acids can be interesting for many reasons. For example, they may contain heavy atoms that facilitate some crystallographic studies involving X-rays. New amino acids have been incorporated into genetically modified organisms to produce drugs, fuels and chemicals of great economic interest [11].
When adding new amino acids to genetically modified organisms, it is necessary to modify the standard genetic code. Expanded genetic codes can be created by using codons with four nucleotides instead of three [1]. Another possibility is to add synthetic nucleotides to create new codons [15]. However, the most attractive way to modify the standard genetic code is by replacing the association of some codons with their respective amino acids. In the new genetic code, these codons are now associated to the new amino acids. In general, the codons chosen to have the association replaced are those that are least frequently used in the organism.
The robustness is not taken in account when creating the expanded genetic codes. According to the authors’ knowledge, optimization methods have not been used for creating expanded genetic codes, with exception for a previous paper of our group [13]. In [13], we proposed a single-objective Genetic Algorithm (GA) for optimizing expanded genetic codes. GAs have been previously employed for creating hypothetical genetic codes with the aim of investigating the optimality of the standard genetic code. In [12], single-objective GAs were used to find robust genetic codes considering only one robustness measure each time. In [9, 10], more than one robustness measure are simultaneously optimized in a multi-objective evolutionary approach. That is, instead of comparing the codes using a single measure of robustness based on a given physical-chemical property, the codes are compared using two or more measures concurrently. Using more than one objective results in hypothetical codes more similar to the standard genetic code.
In this paper, we propose the use of multi-objective GAs for optimizing expanded genetic codes. The expanded codes are optimized in relation to the frequency of replaced codons and two measures based on robustness (for polar requirement and molecular volume). There are two main approaches for multi-objective GAs [6]: i) the weighted approach, where the multi-objective problem is transformed into a single-objective problem; ii) the Pareto approach, where a set of non-dominated solutions are considered. Here, the Pareto multi-objective approach is proposed for optimizing expanded genetic codes. The weighted approach was proposed in our previous work [13], but with only two objectives (robustness for polar requirement and frequency of replaced codons); here we consider the weighted approach with the same three objectives employed in the Pareto approach. An advantage of the Pareto approach over the weighted approach for this problem is that it is not necessary to set weights for each objective. In addition, the multi-objective Pareto approach can generate a list of solutions with different properties, rather than just one solution.
The rest of this paper is organized as follows. In Sect. 2, the proposed GAs for optimizing expanded genetic codes are presented. We present in Sect. 3 experiments where the GAs are compared in a problem where the genetic code should incorporate the codification for a hypothetical amino acid. Finally, the conclusions and discussion about future work are presented in Sect. 4.
Standard genetic code. The codon is a three-letter sequence from an alphabet with 4 nucleotides (A, C, G, T). The names of the amino acids are abbreviated in the figure (examples: Phe is Phenylalanine and Leu is Leucine). The numbers represent the frequency (F) of codons in E. coli. Source: adapted from [8].
2 Proposed Genetic Algorithms
In the proposed GAs, individuals represent hypothetical expanded genetic codes that incorporate the codification of a new unnatural amino acid. Here, we consider that only one new amino acid is incorporated. However, the algorithms can be easily adapted for problems where more than one amino acid are incorporatedFootnote 1. The two approaches based on GAs for the optimization of expanded genetic codes are presented in Sects. 2.3 and 2.4. Before, the common elements to all the proposed GAs are introduced in Sects. 2.1 and 2.2. Here, the optimization problem is considered a minimization problem.
2.1 Codification and Operators
The binary codification is used for representing an expanded genetic code in the chromosome of an individual of the GA (Fig. 2). The chromosome has 61 elements, each one representing a specific codon of the genetic code. The three stop codons (UAA, UAG and UGA) are not represented in the chromosome. An element equal to 1 in the i-th position of the chromosome means that the respective codon in the standard genetic code will now be related to the new amino acid. An element equal to 0 means that the amino acid in the respective codon is not changed.
All amino acids (natural and incorporated) must be present in the expanded genetic codes. In this way, the initial population of the GA is created ensuring that all amino acids are presented in the solutions (expanded genetic codes) associated to the individuals. In addition, when reproduction operators result in the removal of one of the amino acids from the expanded genetic code, the individual is penalized by adding a value of 10,000 to its fitness, making the selection of that individual very rare. The value of 10,000 was determined according to the theoretical maximum and minimal that can be obtained for each objective.
Here, the two GAs use the same reproduction and selection operators. Tournament selection and elitism are used to select individuals. In tournament selection, the best of \(s_t\) randomly chosen individuals is selected. In elitism, the best individual of the population is copied to the new population. Two-point crossover and flip mutation are used to transform individuals. Crossover is applied with rate \(p_c\), while each element of the chromosome is transformed with rate \(p_m\).
2.2 Objectives
Three objectives are considered. Two objectives are based on robustness and one objective is the frequency of replaced codons. Here, robustness of a genetic code C is defined as the inverse of \(M_{st}(C)\), that is the mean square change for the values of a given property regarding mistranslation and base position errors [7, 9]. The equation for \(M_{st}(C)\) is:
where X(i, C) is the amino acid property value for the amino acid codified by the i-th codon of the genetic code C, N(i, j) is the number of possible replacements between codons i and j, and a(i, j) is a weight for the change between amino acids codified by the i-th and j-th codons. By minimizing \(M_{st}(C)\) for a given amino acid property, we are selecting a more robust genetic code regarding mistranslation and base position errors.
The three objectives are:
- \(f_{F}(\textbf{x})\)::
-
given by the sum of frequency (codon usage) of the codons that encode the new amino acid in the genetic code C codified by chromosome \(\textbf{x}\), i.e.,:
$$\begin{aligned} f_{F}(\textbf{x}) = \sum _{i=1}^{61} x(i) \phi (i) \end{aligned}$$(2)where \(\phi (i)\) is the codon usage (frequency) of the i-th codon, considering organism E. coli. The codon usage of E. coli is given in Fig. 1. E. coli is a prokaryotic model organism very important in applications of biotechnology. This objective is considered in order to avoid codes with many replacements, specially in codons that are often used. Additional codon replacements incur in higher economic cost because new biological molecules must be designed and utilized. Besides, codon replacements can lead to unwanted biological effects.
- \(f_{PR}(\textbf{x})\)::
-
given by \(M_{st}(C)\) (Eq. 1) considering polar requirement of the amino acids (Table 1). When a new amino acid is incorporated, the polar requirement of this amino acid is also used in Eq. 2. Polar requirement is a very important property of amino acids regarding structure and function of proteins.
- \(f_{MV}(\textbf{x})\)::
-
given by \(M_{st}(C)\) (Eq. 1) considering molecular volume of the amino acids (Table 1). Molecular volume is also an important property of amino acids regarding structure and function of proteins.
In the experiments, values of \(f_{F}\), \(f_{PR}\), and \(f_{MV}\) are presented. In the weighted approach, the evaluation of each objective is given by the normalization by maximum values. For simplicity, we use the same symbols here for the original and normalized values.
2.3 Weighted Approach
In this approach, each evaluation of objective is calculated separately. Then, weights are assigned according to the importance of each objective. Thus, the multi-objective problem is transformed into a single-objective problem. Here, the fitness of a solution (genetic code) codified by chromosome \(\textbf{x}\) is given by:
where \(w_{F}\), \(w_{PR}\), and \(w_{MV}\) are the weights respectively associated to \(f_{F}(\textbf{x})\), \(f_{PR}(\textbf{x})\), and \(f_{MV}(\textbf{x})\).
The GA used in the weighted approach is here called WGA. Three versions of WGA are tested, each one with different values for the weights (Table 2). In WGA1, all weights are equal, i.e., no objective is prioritized. In WGA2, \(w_{PR}\) is higher, i.e., \(f_{PR}(\textbf{x})\) is prioritized. In WGA3, \(w_{F}\) is higher, i.e., \(f_{F}(\textbf{x})\) is prioritized.
2.4 Pareto Approach
This approach uses the concept of Pareto dominance in order to obtain a subset of non-dominated solutions to a multi-objective problem. According to the Pareto dominance concept, a solution \(\mathbf {x_A}\) dominates a solution \(\mathbf {x_B}\) if it is better in at least one of the objectives, and it is not worse in any of the objectives. There are many multi-objective GAs that uses the Pareto approach [2]. Here, the Nondominated Sorting Genetic Algorithm II (NSGA-II) [4] is used because of two main reasons. First, NSGA-II has good performance when the number of objectives is not high. NSGA-II presents a worst-case time complexity of \(O(m n^2)\) per generation for problems with n objectives and when the population size is equal to m. In addition, it has a mechanism for maintenance of population diversity. Second, it was used in [9] for investigating the genetic code adaptability, a problem similar to the expanded genetic code optimization problem. Despite being similar, the chromosome codification used in [9] is different from the codification used here. The reproduction operators, objectives, and other characteristics are different too.
The NSGA-II algorithm used here can be summarized as follows [9]:
-
i.
A population P(0) is randomly generated and sorted in layers (fronts) according to the Pareto dominance. Thus, the first layer is formed by solutions which are not dominated by other solutions, i.e., the best Pareto optimal solution set found so far.
-
ii.
At iteration t, the population P(t) is transformed into population Q(t) by using selection and reproduction operators. The next step is to sort the union population, \(P(t)+Q(t)\), according to the Pareto dominance.
-
iii.
A new population \(P(t+1)\) is created by merging the layers of P(t) and Q(t). When the number of individuals in the last layer exceeds the population size, the crowding distance is used to select the most diverse individuals.
Here, the objectives of the NSGA-II are the same used in the weighted approach: \(f_{F}(\textbf{x})\), \(f_{PR}(\textbf{x})\), and \(f_{MV}(\textbf{x})\).
3 Experiments
3.1 Experimental Design
Experiments were performed considering the insertion of one hypothetical amino acid, named here new. The values of polar requirement and molecular volume for the new amino acid were obtained by averaging all respective values for the natural amino acids (see Table 1). Experiments, not shown here, with other hypothetical amino acids were also done.
Most parameters of the GAs are equal to those used in [9]; preliminary tests were carried out in order to adjust the other parameters. The GAs (in the two approaches) have population size equal to \(m=100\) individuals. The number of runs is 10, each one with a different random seed. The number of generations is 120.
For all GAs, the same parameters for tournament selection and reproduction are used: \(s_t=3\), \(p_c=0.6\), and \(p_m=0.01\). The results of the best solutions (expanded genetic codes) obtained in the runs of the different versions of WGA are presented. For NSGA-II, the results of the Pareto front obtained by applying the dominance criterion in the union of first front solutions for each run is presented.
3.2 Experimental Results
The results for the evaluation of the best solutions obtained by the 3 versions of WGA (Sect. 2.3) are presented in Table 3. When the different versions of WGA are compared, the best solutions for \(f_{F}\) are those obtained by WGA3, as expected. The best solutions for \(f_{PR}\) and \(f_{MV}\) were obtained by WGA1, where the weights are equal. It is interesting that the solution with best \(f_{PR}\) is obtained by WGA1 and not by WGA2, that has a higher weight \(w_{PR}\). However, the difference in \(f_{PR}\) for the best solutions obtained by WGA1 and WGA2 is small.
The number of replacements, i.e., codons where the association to the amino acid changed, are high for the genetic codes obtained by WGA. Figure 3 shows the best solutions obtained by WGA1, WGA2, and WGA3. When the number of replacements is high, the robustness is also high. This occurs because most of the changes in the codons will result in the codification of the same amino acid, i.e., the new amino acid.
Unlike the weighted approach, NSGA-II allows to obtain a list of expanded genetic codes. The evaluation of the subset of solutions obtained by applying the dominance criterion to the union of the first front obtained in different runs of NSGA-II is shown in Fig. 4. The best solutions obtained by the weighted approaches are also presented in the figure. One can observe that the best solutions obtained by the weighted approaches are in the Pareto front obtained by NSGA-II. Table 4 shows the evaluation of the best solutions for each objective obtained by NSGA-II. It is interesting to observe that, for the values of the properties of the new hypothetical amino acid, the best solutions for \(f_{PR}\) and \(f_{MV}\) are the same. In additional experiments, not shown here, with other values for the properties of the hypothetical amino acid, this does not necessarily happen.
One can observe that, while the genetic code with best \(f_F\) obtained by NSGA-II replaces only one codon, 21 replacements are generated by WGA3. Codes that results in many replacements, specially in codons that are frequently used, incur in higher economic cost and can lead to unwanted biological effects. It is important to observe that less replacements could be obtained by setting \(w_f\) to values much higher than the values of the other two weights in WGA. However, this would result in manually testing many different settings for the weights. The Pareto approach is interesting because it is not necessary to define weights or priority for each objective. Besides, it allows to obtain a list of genetic codes, that could be offered to the specialist for a particular selection, given a real-world application.
This advantage of NSGA-II is illustrated in Fig. 5, that shows the expanded genetic codes for three solutions of the Pareto front: solution with best \(f_{F}\), solution with best \(f_{PR}\) for 2 replacements, and solution with best \(f_{PR}\) for 3 replacements. The two last solutions are those of the Pareto front with best robustness for polar requirement among those with two and three replacements. In this way, the specialist can define that she/he wants a list of solutions with best robustness for a given amino acid property and with a given number of replacements.
4 Conclusions
We propose the use of multi-objective GAs for the optimization of expanded genetic codes. Three objectives are considered: robustness regarding polar requirement, robustness regarding molecular volume, and frequency of use of replaced codons. Two approaches were investigated: weighted (WGA) and Pareto (NSGA-II).
Experiments with a hypothetical amino acid indicated that WGA found codes with many replacements of codons. NSGA-II allowed to obtain codes with only one replacement, while the best codes for WGA resulted respectively in 21 replacements. Replacing many codons is not interesting in many aspects. Both approached obtained robust codes. Another advantage of the Pareto approach is that a list of genetic codes is offered to the specialist, that can select a genetic code according to the characteristics of an application.
It is important to highlight that this is a theoretical work, without taking into account restrictions that may occur from technological and biological points of view. In practice, more information about the biological application is necessary to choose an expanded genetic code. However, the work shows a computational approach for optimizing expanded genetic codes that can be useful, when used in conjunction to other strategies, for helping specialists.
A possible future work is to investigate the introduction of new amino acids through the creation of synthetic nucleotides [15]. In this case, the standard genetic code is not modified; it is only expanded to accommodate the new codons related to the new synthetic nucleotides. For example, assuming that a synthetic nucleotide Y is created, we would have the possibility of associating the new codons that have Y, i.e., \(AAY, ACY,\ldots , AYA, \ldots , YGG\), to the new amino acids. Usually, not all new codons are associated with amino acids. In this case, optimization via GAs shows a promising approach.
Another possible future work, from a technological point of view, is to use other algorithms in the calculation of the Pareto Set, such as SPEA-II (Strength Pareto Evolutionary Algorithm 2). Other optimization techniques could also be considered if the number of replacements is constrained. For example, if the maximum number of replacements is small, exhaustive search can be used to find the best codes. Finally, the investigation of new objectives that may be interesting from an technological, experimental, and/or biological point of view could also be investigated.
Notes
- 1.
The binary encoding of the chromosome is used here because we consider only one unnatural amino acid. If more than one new amino acid is considered, the integer encoding must be used, where integer \(i>0\) represents the i-th new amino acid. The only modification needed in this case is in the way the new chromosomes are generated and mutated.
References
Anderson, J.C., et al.: An expanded genetic code with a functional quadruplet codon. Proc. Natl. Acad. Sci. 101(20), 7566–7571 (2004)
Coello, C.A.C., Lamont, G.B.: Applications of Multi-objective Evolutionary Algorithms, vol. 1. World Scientific, London (2004)
Cox, M.M., Nelson, D.L.: Lehninger Principles of Biochemistry, vol. 5. WH Freeman, New York (2008)
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Freeland, S.J., Hurst, L.D.: The genetic code is one in a million. J. Mol. Evol. 47(3), 238–248 (1998)
Freitas, A.A.: A critical review of multi-objective optimization in data mining: a position paper. ACM SIGKDD Explor. Newsl. 6(2), 77–86 (2004)
Haig, D., Hurst, L.D.: A quantitative measure of error minimization in the genetic code. J. Mol. Evol. 33(5), 412–417 (1991). https://doi.org/10.1007/BF02103132
Maloy, S.R., Stewart, V.J., Taylor, R.K., Miller, S.I.: Genetic analysis of pathogenic bacteria. Trends Microbiol. 4(12), 504 (1996)
Oliveira, L.L., Freitas, A.A., Tinós, R.: Multi-objective genetic algorithms in the study of the genetic code’s adaptability. Inf. Sci. 425, 48–61 (2018)
Oliveira, L.L., Oliveira, P.S.L., Tinós, R.: A multiobjective approach to the genetic code adaptability problem. BMC Bioinform. 16(1), 1–20 (2015)
Rovner, A.J., et al.: Recoded organisms engineered to depend on synthetic amino acids. Nature 518(7537), 89–93 (2015)
Santos, J., Monteagudo, Á.: Simulated evolution applied to study the genetic code optimality using a model of codon reassignments. BMC Bioinform. 12(1), 1–8 (2011)
Silva, M.C., Oliveira, L.L., Tinós, R.: Optimization of expanded genetic codes via genetic algorithms. In: Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, pp. 473–484 (2018)
Yockey, H.P.: Information Theory, Evolution, and the Origin of Life. Cambridge University Press, Cambridge (2005)
Zhang, Y., et al.: A semi-synthetic organism that stores and retrieves increased genetic information. Nature 551(7682), 644–647 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
de Carvalho Silva, M., Pereira, P.G.P., de Oliveira, L.L., Tinós, R. (2023). Multiobjective Evolutionary Algorithms Applied to the Optimization of Expanded Genetic Codes. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-45392-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)





