key: cord-0801982-cjclssbp
authors: Elsayed, Wesam M.; Elmogy, Mohammed; El-Desouky, B.S.
title: DNA Sequence Reconstruction Based on Innovated Hybridization Technique of Probabilistic Cellular Automata and Particle Swarm Optimization
date: 2020-09-02
journal: Inf Sci (N Y)
DOI: 10.1016/j.ins.2020.08.102
sha: 0ce4a1786b1eca71066dbf20540f4837f3524574
doc_id: 801982
cord_uid: cjclssbp

DNA sequence reconstruction is a challenging research problem in the computational biology field. The evolution of the DNA is too complex to be characterized by a few parameters. Therefore, there is a need for a modeling approach for analyzing DNA patterns. In this paper, we proposed a novel framework for DNA pattern analysis. The proposed framework consists of two main stages. The first stage is for analyzing the DNA sequences evolution, whereas the other stage is for the reconstruction process. We utilized cellular automata (CA) rules for analyzing and predicting the DNA sequence. Then, a modified procedure for the reconstruction process is introduced, which is based on the Probabilistic Cellular Automata (PCA) integrated with Particle Swarm Optimization (PSO) algorithm. This integration makes the proposed framework more efficient and achieves optimum transition rules. Our innovated model leans on the hypothesis that mutations are probabilistic events. As a result, their evolution can be simulated as a PCA model. The main objective of this paper is to analyze various DNA sequences to predict the changes that occur in DNA during evolution (mutations). We used a similarity score as a fitness measure to detect symmetry relations, which is appropriate for numerous extremely long sequences. Results are given for the CpG-methylation-deamination processes, which are regions of DNA where a guanine nucleotide follows a cytosine nucleotide in the linear sequence of bases. The DNA evolution is handled as the evolved colored paradigms. Therefore, incorporating probabilistic components help to produce a tool capable of foretelling the likelihood of specific mutations. Besides, it shows their capabilities in dealing with complex relations.

Mathematical approaches and algorithms can model several biological problems. Therefore, it seems to be very beneficial for both mathematics and biologics to combine the results of their disciplines. One of the most critical problems is the analysis of deoxyribonucleic acid (DNA), which is described by a sequence of bases. A DNA sequence holds four nucleic acid bases, which are adenine (A), cytosine (C), guanine (G), and thymine (T). A and T are a complement to each other. Similarly, C and G are also a complement to each other. DNA is a paired strand molecule where the two strands are linked to each other. The linkage is done over a hydrogen bond among a pyrimidine on one strand and a purine on the other or vice versa. A strand of DNA is formed by a sequence of these four bases, which is transcribed to yield another analogous sequence in the DNA replication procedure [27] .

As DNA stores genetic information, it is typical for phylogeny because the DNA of a complicated organism has some similarities with the DNA of a simpler one. Modeling DNA does not follow a specific procedure. In some modeling techniques, it demands to incorporate concepts from multidisciplinary fields, such as chemistry, physics, thermodynamics, and computer science.

This study will concentrate on models that are established on the notion of probabilistic cellular automata (PCA). On the one hand, we used our suggested probabilistic rules to study DNA evolution. A DNA strand can be viewed as a row of cells where every cell holds one of the four bases (A, C, G, or T). This sequence is transcribed to yield another analogous sequence in the DNA replication procedure [7] . The challenge in modeling DNA using Cellular Automata (CA) is the representation of the problem in a way that maps it in a real scenario that follows CA rules. We used CpG methylation followed by deamination and mutation of CG/CG to TG/CA as an example for computing mutation rates. The obtained rules by CA during DNA modeling can give a beneficial vision about the neighboring base pairs' effects on the DNA sequences evolution [27] .

On the other hand, as CA showed to be a promising tool in modeling DNA, we use these probabilistic rules again to reconstruct DNA sequence. For the ex-traction of optimum transition rules, we proposed a hybridized mechanism of both PCA and Particle Swarm Optimization (PSO). This integration makes the proposed framework more efficient and achieves optimum transition rules. Therefore, this study aims to detect the neighborhood rules, which consider the effect of mutations that occurred throughout the sequences' evolution. Besides, this study attempts to get rid of uncertainties in the intermediate sequences. It is fulfilled by introducing stochastic elements to our proposed technique, which leads to a more qualified simulation.

Also, the proposed model can be beneficial for understanding bacterial resistance to antibiotics, which represents a significant threat to public health. With thousands of known unique resistance genes, only the DNA analysis can give detailed genetic information. This retrieved information is required for an accurate evaluation of the existing resistance mechanism. Bacterial genomes are dynamic. They are exposed to diverse genetic events, including duplications, mutations, transpositions, inversions, recombination, insertion, and deletions. As the proposed model is a probabilistic evolution model, it will be convenient for the analysis of the DNA sequences of functional genes during phylogenesis.

The remainder of this article is organized as follows. Section 2 discusses some basic concepts. The design of DNA sequence modeling as a PCA procedure and a brief introduction to the PSO algorithm are presented. Section 3 discusses the literature review. Section 4 describes the proposed probabilistic model for studying the mutation of DNA sequences. Section 5 illustrates in detail how we merge the PCA and PSO algorithms and shows the innovative algorithm for evolution rules. Section 6 elucidates the simulation and experimental results. Section 7 investigates the conclusion of the study, along with further future expectations of this work.

In this section, some basic concepts will be introduced. First, we will discuss DNA modeling based on PCA. Then, the PSO technique will be elucidated.

In this section, the DNA sequence modeling is tackled as a stochastic procedure. This assumption is obtained from the observations, which clarified the randomly happen mutations. For example, Arndt et al. [2] deduced a neighbor-dependent impact in the mutagenesis process. CA are discrete-state systems comprising of a countable lattice of analogous cells that communicate with their neighbors [1, 34, 10] . These networks can take any number of dimensions, starting from a one-dimensional sequence of cells. The state of a CA is entirely detected by the values of the variables at each cell. Here, a one-dimensional CA is used for modeling DNA sequence, which can be expressed as a line of adjacent bases [42, 36, 26, 16] . Its mathematical formulation can be defined as a 4-uplet, as presented in Eq.

(1).

where Z is a CA system, Q is cell space, and d denotes the dimension of this space. S describes the states of all cells. N denotes a set of all cells within the neighborhood. Finally, f is the evolution rule where it detects how the state of the cell can alter.

In the probabilistic CA, the local rules may include a probabilistic element instead of dictating the state of an updated cell [5] . Therefore, it provides the probability for each updated cell, which will be in some states. PCA is constructed by introducing the probabilistic elements as deterministic local CA rules. The CA grid composed of similar cells, which are . .

respectively. The state of the i th cell holds values from a predetermined discrete set, as presented in Eq. (2) . Moreover, the system evolves in time through discrete time steps. The evolution of the probability P t (S) of state S at time step t is given by Eq. (3).

where W (S|Ś) is the probability of transition from stateŚ to state S with the two properties, which are listed in Eqs. (4) and (5) .

In the beginning, the PCA is simulated from any preliminary configuration [37, 39] . Then, a sequence of configurations is generated. Each configuration is attained from the previous state through an asynchronous update of all sites.

The PSO algorithm is a population-based stochastic optimization mechanism, which is proposed by [8, 32, 38, 15] . The PSO algorithm composes a set of feasible solutions that evolve to achieve an adequate solution for a problem. Its main target is to achieve a global optimum of a real-valued function outlined in a search space that is named fitness function.

PSO is initiated with a set of arbitrary particles. Then, it looks for the optimum solution by updating the succeeding generations. At every iteration, each particle is updated by two subsequent "best" values. The premier one is the best solution that is obtained until now. This value is called P i (or P − best). The other best value that is pursued by the PSO algorithm is the best value, which is attained so far by any particle. This value is a global best, and it is symbolized by P g or (g − best). After detecting these two values, the particle updates its speed according to Eq. (6) and its location by Eq. (7).

where V i is the speed of each particle. X i is the current location of each particle. C 1 and C 2 are acceleration constants. r 1 and r 2 are arbitrary numbers in the range [0, 1]. P i is the best position of each particle. P g is the best position of the swarm. The original PSO has been modified by Shi and Eberhart [32] . They introduced an inertia weight (ω) to balance exploitation and exploration by modifying Eq. (6) with Eq. (8) .

The PSO algorithm is highly common because of its simple implementation and capability of fast convergence to a rationally good solution. On the other hand, it has some limitations, which can be summarized in the following points:

• Premature convergence.

• Tending to get stuck in local optima.

• Low solution precision.

To overcome the above limitations, this study proposed an innovative hybridized model of PCA and PSO. First, PSO is considered as a simple and easy to execute technique. Then, the simplicity and performance of the PSO algorithm imply that it is inexpensive in terms of memory requirements [31] . Therefore, these reasons led us to make this hybridization between PCA and PSO for better and fast optimization tool.

In this section, some prior studies are discussed, which have considered the evolution of DNA sequences by taking mutations into account. Afterward, the study investigates the efforts that represent the DNA sequence reconstruction process with different techniques.

The DNA sequence evolution has been discussed by many studies [22, 35, 36, 30] . These studies analyzed the impact of neighboring bases on occurring a mutation. For example, Bulmer [3] found that there is an apparent growth in the transitions frequency from C and G bases. Also, he concluded that there are a few impacts of neighbor bases on the frequencies of transitions from T and A bases. Finally, he determined the transition frequency from these bases. The transition frequency is decreased by having C on the left (or G on the right). Besides, it is incremented by having A on the left (or T on the right).

Arndt et al. [2] introduced a model for the DNA sequence evolution that regard biases in mutation rates. These mutation rates relied on the knowledge of the neighboring bases. They improved an evolution analysis model by assuming non-linear dynamics techniques. They concluded that phylogenetic analysis could be broadened to involve neighbor-dependent impacts. All the previous attempts showed that neighbor bases have some impact on the mutation process. However, none of these studies investigated the effects of the neighboring base through each step of the evolution process. CA model can employ this neighbor reliance during the DNA sequences evolution.

Nowadays, reconstructing evolutionary history is still considered as one of the leading research issues. With the rise of molecular sequencing technologies, advanced computational approaches have been proposed to reconstruct phylogenies [28, 29, 6] . The programs described in [11] were designed to help in the assembly of long DNA sequences from the much shorter ones obtained as primary data. They detected the overlapping state between fragments and how to be oriented.

Peltola et al. [25] described the first program to control the DNA sequence reconstruction process. Their proposed technique is implemented in three stages. The premier stage calculated overlaps among fragment pairs. They showed these fragments as edges in a directed graph, which has a vertex for each fragment. Second, their procedure elected overlaps from the graph. Finally, the third phase inte-grated these overlaps into a synchronous alignment of the fragments from which a sequence was extracted.

As our model is probabilistic, we formed a combination of PSO and PCA to extract CA rules properly. Many studies used this combination successfully. Fengxia and Gang [8] used CA integrated with PSO for simulation, which improved the ability of premature convergence. Experiments showed that their algorithm had powerful global searchability. Besides, it could effectively improve the capability of premature convergence. Their method made some improvements but did not completely solve the premature convergence issue.

Pagel [24] used maximum likelihood models to deduce ancestral character states for discrete binary characters that have only two states. However, the generalization to more than two states demands no new concepts. He utilized a Markov model of binary character evolution on phylogenies to reconstruct ancestral states.

As indicated above, there are some limitations in the current related work. However, these algorithms contained the premier explanation of the sequence reconstruction problem with an error. They did not solve or approximate this problem. Nearly, most of the existing research considered the sequence reconstruction problem from the perspective of computational learning theory [9, 17, 41, 35, 21] . Also, these methods considered the mutations as deterministic events in the DNA reconstruction process. However, it was commonly appropriate that mutations occur randomly [2, 13, 3, 22, 40] .

To overcome the limitations mentioned above, we combine PSO and PCA to develop a technique for DNA reconstruction problem. We proposed a new PCA-PSO algorithm, which integrates the cellular space, cells, and neighbors of CA with the PSO configuration. The PSO algorithm is applied to discover the optimal and convenient transition rules of CA for the reconstruction process.

In this section, the proposed DNA evolution model will be discussed in detail. First, let's postulate that A, C, G, and T bases occur at different frequencies as calculated in [2] . Neighbor-dependent mutation rates are calculated according to both these frequencies and our new suggested rules [7] . Then, the six parameter rate matrix model command of the general time-reversible (GTR) is utilized to specify the calculated rates of each type of nucleotide change. We inserted a random sequence with 30 bases as a first initial taxon.

The evolution of the DNA sequence can be represented as a colored graphical representation for the DNA bases. We get the graphical output by assigning a color to each base. Red color is used for A, green color is assigned for C, yellow color is utilized for G, and blue color is used for T. To test the suggested rules, we used for the processes ApC → ApA or GpT → T pT . Besides, we allow only a single transversion rate (i.e., p = 1). With these processes, the maximum-likelihood solution for the mutation rates became more credible, with Q AG = Q T C = 3.10p, Q CT = Q GA = 3.78p, R CGCA = R CGT G = 43.02p, and R ACAA = R GT T T = 4.35p. From these values, we form 3 matrices as listed in Eqs. 

where Q represents the rates of single-nucleotide occurrence. R l is the rates of the left pair neighboring nucleotides. R r is the rates of the right pair neighboring nucleotides. Then, these matrices are substituted in the mutation probability (Q + R).∆t. The time increment (∆t) must be selected such that all non-diagonal transition probabilities in Eqs. (9), (10), and (11) are small (≪ 1). So, we get the mutation matrices as shown in Fig. 1 .

The DNA mutation plays a significant role in the DNA sequence evolution [14] . Neighbor-dependent mutations influence the evolution of DNA sequences. We used the previously calculated mutation rates, as listed in Fig. 1 , where we utilized the calculated frequencies by [2] . Therefore, we got the mutation rates that produced our evolution. Fig. 2 presents the simulation results of a DNA sequence evolution. The simulation begins with an arbitrary DNA sequence that contains 30 bases. It generates the sequences for 30 consecutive time steps, where A is shown in red, C in green, T in blue, and G in yellow.

In this section, we introduce an innovative mechanism for DNA sequence reconstruction based on a hybrid PSO-PCA technique. First, the search space is partitioned into cells by applying CA. At any time, a set of particles looks for a local optimum. Then, the best solution is found in their neighborhood cells. In the PSO algorithm, each particle updates its immediate location together with all particle states. In PCA, every cell is closely incorporated with its neighbors and their positions governed by a specific rule. Its immediate state is updated with neighbor's states. In our proposed PSO-PCA algorithm, the state of the cell is evaluated by its state besides its neighbor's states. The last speed updates each cell's state. Then, to approach the optimal state, the current cell's state is updated once more regarding both immediate and optimal states. PSO-PCA modeling is as follow:

• Cells: Each particle is interpreted as a single cell.

• Cellular space: The group of all cells in the space (1-dimensional CA is considered in this scenario).

• State space: The cell's state is mentioned as the location of every particle. The state of i th cell is calculated by Eq. (12) .

• Neighborhood: The whole cells that may influence the variation of the state of the i th cell are the neighbor cells. The neighborhood is presented by Eq. (13) .

. . , S i+r , r = 0, 1, 2, . . . , m where r is the size of the neighborhood. Here, we use r = 1, then the neighbors of the i th cell are comprised of the cell itself and its right and left direct neighbors.

• Update rules: Every cell updates its state with its immediate location, velocity, individual optimal value, and its neighbors' optimal values. The state of the i th cell at a time step is a function of the states of its neighbors at the prior time step.

• Discrete time step: It represents the number of iterations in the PSO.

DNA may be modeled as a 1-dimensional CA. The four DNA bases stand for the probable states of a CA cell. The state of the i th cell of this CA gets values from the discrete set that incorporates the four bases.

The bases are coded with numbers as follows: A → 0, C → 1, T → 2, G → 3. Here, we take into account just the rules with a neighborhood-sized by 1. The transitions of base-pairs through evolution are clarified in Table 1 . The right-hand side of each transition can be one of the four base-pairs. 0  000  16  001  32  002  48  003  1  100  17  101  33  102  49  103  2  200  18  301  34  302  50  303  3  300  19  201  35  202  51  203  4  010  20  011  36  012  52  013  5  110  21  111  37  112  53  113  6  310  22  311  38  312  54  313  7  210  23  211  39  212  55  213  8  130  24  031  40  032  56  033  9  330  25  131  41  132  57  133  10  230  26  331  42  332  58  233  11  020  27  231  43  232  59  023  12  120  28  021  44  022  60  123  13  320  29  121  45  122  61  323  14  220  30  321  46  322  62  223  15  030  31  221  47  222  63  333 Here, CAs have four states for each cell. Hence, the number of all probable rules is 4 4 3 . The whole rule space must be investigated. It detects the assumed CA rules that control the evolution of the DNA sequence. Here, PSO is applied in order to investigate the enormous CA rule space.

Evolution is visualized with the aid of a phylogenetic tree that represents a group of organisms that are connected [23, 33, 4, 12, 19, 28] . A phylogenetic tree is a tree demonstrating the evolutionary mutual relations among diverse species or other organisms that are accepted to have a mutual ancestor. Several species, organisms, or genomic sequences are represented on the leaves of the tree. Our work seeks to detect the rules for neighbor-based mutations that may have been resulted in the sequence evolutions. Here, we used linear rules, which represent a promising tool for analyzing mutation rates [32] . The linear evolution rules status takes a matrix format as listed in Eq. (17) . Consequently, the velocity is updated using Eq. (18) .

The states of all CA cells at time step t are represented by the column array at the right-hand side of Eq. 17. This array is multiplied by the evolution rule array (M ). The array components M i,j may hold only two values (0 and 1). The output array at the left-hand side of Eq. 17 contains the states of all CA cells at time step t + 1.

In this stage, the fitness function should be evaluated after representing each cell. The optimal solution is obtained based on the resulting fitness value. In the DNA sequence reconstruction, the optimum solution represents the highest matching score of the bases among the sequences at the previous steps. First, we have to compute the evolution of DNA sequences to extract the proper rules for the reconstruction process, as shown in Fig. 2 . The matching score is evaluated by enumerating the equivalent nucleotide of sequences. The matching score for a pair of sequences is computed by using Eq. (19) .

where, Score S i,j &Ŝ i,j is a matching score of two consecutive sequences. i and j are the indices of the cell. After calculating the score of each two consecutive sequences, the total score is evaluated by Eq. (20) .

where, f (x) indicates the fitness value for individual i and j of the PSO. max indicates that our target is to get the maximum value of f (x). The optimal solution is the one that has the largest value of f (x). The fitness function is determined by summing all scores (Eq. (19)). The main goal is to find out symmetry relations among various sequences. Algorithm 1 lists the proposed PSO-PCA algorithm for the election of CA evolution rules.

This section deals with the experimental setup and outcomes that are achieved after the simulation. The proposed system is implemented by using the Matlab 2019b [20] Program. Besides, we utilized the Mesquite program [18] , which is an analysis tool for evolutionary biology. It offers a simple and powerful tool, which motivated us to use it for DNA simulation. For hardware specifications, we implemented the proposed system on a computer with an Intel Core i7 processor (8th generation, 1.8 GHz) with 16 GB Ram. The DNA data used to support the findings of this study are available at [https://www.ncbi.nlm.nih.gov/Taxonomy/Browser /wwwtax.cgi?mode=Info&id=48640]. Our proposed PSO-PCA algorithm is started by inserting the following parameters to obtain the desired probabilistic rules:

• The native sequence of the DNA.

• The sequences of the DNA that belong to in-between evolution steps.

• The eventual sequences of the DNA.

• The maximum iteration numbers.

rules are used for simulation based on the fact that not all the base-pairs will be altered at every time step. In our evolution, we attempt to dynamically generate a CA rule utilizing a sequence achieved within the evolution and the consecutive sequence. Fig. 3 clarifies the dynamic structure of a rule. We aim to generate a CA rule by using a sequence resulting from the evolution process along with its next step sequence. The transitions on the left-hand side are established by applying a rule on the immediate sequence. The next step sequence configures the right-hand direction of the transitions. For instance, the left-hand side of a transition is outlined by the premier three base-pairs of the immediate sequence, labeled CAT, as shown in Fig. 3 . As for the right-hand side of the transition, it is outlined by the corresponding base-pair in the consecutive sequence, 0.

The major aim of this study is to detect the most probable rules for mutations, which take into consideration the impact of the neighbor cells. Table 2 demonstrates some of the resulted rules that were stratified to some of the phylogenetic tree branches obtained from random DNA sequences. We used sequences of size equals 100 bases, as shown in Fig. 4 . The achieved rules by using the proposed technique are used instantly for foretelling of next step sequences and the construction of the phylogenetic tree itself. In Fig. 4 , the DNA evolution outcomes after 100 generations in the DNA sequences, which are generated by the implementation of the parallel CA rule. The resulted CA rule is demonstrated in Fig.5 . The PSO-PCA is proved to be a successful procedure for reconstructing the evolution paradigm of the given DNA sequences. We developed a proper algorithm that efficiently elicits the CA rule, which controls the evolution of the sequence. During these random experiments, the innovated procedure dictated the Figure 5 : The used CA rule to reproduce the evolution pattern in Fig. 4 . possible rules that produced the given evolution paradigm. The algorithm showed to be a successful simulation tool. As a result, we demonstrated that having a series of DNA sequences that represent a set of evolution steps, this procedure can be utilized for generating the probabilistic rules of this evolution pattern. These rules are properly capable of reconstructing DNA sequences. Our attempt to incorporate probabilistic components produces a system capable of predicting the likelihood of particular mutations. Also, our technique properly showed to be a promising tool for simulating the evolution of large sequences, as in Fig. 6 : First, the data clarified in Fig. 6 is tested with the help of the Mesquite simulation software. Then, the mutation rates are applied besides using the proposed innovative PSO-PCA algorithm. Finally, the obtained results can be shown in Fig.  7 . Figure 7 : The simulation results of the evolving DNA characters after using our rates and the probabilistic rules.

The protein, DNA, or RNA sequence can be used to classify the sequence into sets of analogous bases that share characteristics in terms of their function. It can be quite beneficial in recognizing the functions of new sequences. Also, it can be beneficial in phylogenetic prediction. Studying DNA evolution and the effect of mutation will help recognize bacterial species as well as potential antibiotic resistance mechanisms. It will lead to a chance to employ DNA sequence information to guide medication.

Finally, we presented a comparative survey on DNA sequence using our model and deterministic methods based on genetic algorithm. Our proposed technique is capable of the following:

• It can be used for analyzing the evolution of many various species. It helps in many practical applications, including drug detection, population surveillance, and management.

• Dealing with these models, such as PCA, enables us to discover the influence of some neighbors base-pairs evolution.

• We were able to resolve uncertainties (i.e., detecting anonymous base-pairs in intermediate sequences and the number of time steps for evolution).

• The proposed model of evolution is a probabilistic model. Therefore, it will be convenient for the DNA sequence analysis of functional genes during phylogenesis.

On the other hand, other methods that consider only the deterministic direction focus on the neighbor-dependent mechanics of DNA sequence alteration without considering the processes of natural selection. Therefore, the deterministic methods are not suitable for the analysis of DNA sequences.

This paper introduced a novel technique based on PCA to investigate the rules for the neighbor bases regarding mutation effects. CA proved to be a robust system for analyzing DNA mutations. CA rules are generated by simulating DNA mutations. They can give us beneficial insights about the influence of neighboring base pairs on the evolution of DNA sequences. The proposed tool for simulation is based on the usage of PSO, as it extracts the PCA evolution rules very efficiently. There are paramount concerns that neighboring base-pairs influence mutations of DNA base-pairs. We tried to reveal this correlation by modeling DNA as a CA model where the CA rules govern the DNA mutations. Due to the enormous rule space comprised in our simulation, we adopt the PSO algorithm for extracting these rules efficiently. Then, we applied the resulted rules for predictions of sequences in phylogenetic trees. Simulating DNA as a PCA model facilitates viewing, analyzing, and comparing various sequences.

Also, our method is suitable for the comparison of many long sequences. Establishing powerful and reasonable hybridization strategies is required for creating beneficial and practical models for predicting most of the future changes that occur during the evolution of DNA sequences. It will highly increase information on which mutations are most popular in certain bacteria. Besides, PCA can reveal patterns in enormous amounts of gene expression data, and discover groups of disease-related genes to be able to detect medicine.

The main contributions of this paper can be summarized in the following points:

• A methodology is developed to locate the impact of neighboring DNA basepairs on the mutation of a base-pair.

• The model presented here is based on the assumption that mutations are probabilistic events, and that their evolution can be modeled using PCA.

• A hybridized technique is developed to discover the optimal and proper transition rules of CA for the reconstruction task. This integration increases the performance of the algorithm.

• A modified method is proposed for the reconstruction of DNA sequences based on PCA integrated with the PSO algorithm.

In our future work, we will popularize this study to handle distinct rules for diverse structures of the neighborhood regarding mutation effects. Particularly, diverse neighborhood sizes with larger sizes will be discussed. Also, we can simulate the evolution and reconstruction of DNA sequences on small-world networks. We might be able to use this probabilistic model to foretell possible mutations of viruses and other pathogens. For example, we may hopefully be able to explain the CORONA virus.

[42] Shihua Zhou, Bin Wang, Xuedong Zheng, and Changjun Zhou. Study and Application of DNA Cellular Automata Self-assembly, pages 654-658. Springer, 2014.

Identification of Cellular Automata

Dna sequence evolution with neighbor-dependent mutation

Neighboring base effects on substitution rates in pseudogenes

Phylogenomics and the reconstruction of the tree of life

Nondeterministic cellular automata

Combining cellular automata and particle swarm optimization for edge detection

Evolutionary behavior of dna sequences analysis using non-uniform probabilistic cellular automata model

The simulation and improvement of particle swarm optimization based on cellular automata

Reconstruction of dna sequence information from a simulated dna chip using evolutionary programming

A survey on cellular automata, centre for high performance computing, dresden university of technology

Computer programs for the assembly of dna sequences

Maximum likelihood inference of phylogenetic trees, with special reference to a poisson process model of dna substitution and to parsimony analyses

Wide variations in neighbordependent substitution rates

The application of a linear algebra to the analysis of mutation rates

A discrete binary version of the particle swarm algorithm

Regular biosequence pattern matching with cellular automata

Towards a dna sequencing theory (learning a string)

Mesquite: a modular system for evolutionary analysis, version 3.61

T-rex: reconstructing and visualizing phylogenetic trees and reticulation networks

The MathWorks Inc., Natick, Massachusetts

Reconstruction of dna sequences using genetic algorithms and cellular automata: Towards mutation prediction?

Variation in mutation dynamics across the maize genome as a function of regional and flanking base composition

fastdnaml: a tool for construction of phylogenetic trees of dna sequences using maximum likelihood

The maximum likelihood approach to reconstructing ancestral character states of discrete characters on phylogenies

Seqaid: A dna sequence assembling program based on a mathematical model

Evaluation of methods for detecting recombination from dna sequences: computer simulations

Dna methylation and gene function

The neighbor-joining method: a new method for reconstructing phylogenetic trees

The neighbor-joining method: a new method for reconstructing phylogenetic trees

Deep insight from simple models of evolution

Cellular particle swarm optimization

A modified particle swarm optimizer

Phylogenetic estimation of contextdependent substitution rates by maximum likelihood

Evolving uniform and non-uniform cellular automata networks

An algorithm for the study of dna sequence evolution based on the genetic code

Adonios Thanailakis, and Ph Tsalides. A cellular automaton model for the study of dna sequence evolution

An introduction to probabilistic automata

Dna sequence assembly using particle swarm optimization

A probability cellular automaton model for hepatitis b viral infections. Biochemical and biophysical research communications

Neighborhood detection and rule selection from cellular automata patterns

Reconstruction of dna sequencing by hybridization

We thank Prof. Dr. E. Ahmed, Mathematics Dept., Faculty of Science, Mansoura, EGYPT, for his guidance and comments.

The proposed PSO-PCA rule extraction algorithm. 1 We used phylogenetic trees for the reconstruction process. Besides, they are used for representing the species samples that are used for simulation. For each branch, we applied a set of CA rules for altering a predecessor sequence to the offspring sequence. First, the proposed technique compares the current sequence, with the produced sequence at the next time step using a similarity score. We defined this score as the proportion of the number of matching base pairs in two consecutive sequences. Then, the proposed algorithm randomly chooses a CA rule at each time step and applies it to the immediate sequence. Then, we observe the progression of similarity score, meanwhile evolution. Non-uniform probabilistic