1 Introduction

Swarm Intelligence (SI) is a branch of Computational Intelligence in which a collective behavior is exhibited by a group of decentralized and self-organized simple reactive agents interacting with each other and the environment. The interaction among them generates a collective adaptation to allow them to solve complex problems [7]. The simple reactive agents are represented by positions in the search space with simple historical memories that explore it while exchanging information about their experiences with other agents. Based on the individual’s experience and the information received by other members of the swarm, the agents can adapt the behavior in the search space and, with enough time, find and refine reasonable solutions to the presented problems. Swarm-based algorithms emerge as an alternative to classical optimization methods in high-dimensional optimization problems [1]. They are an excellent alternative for optimization problems since they considerably reduce the computational cost and do not require a complete understanding of the problem regarding the characteristics of the search space.

Many SI algorithms are based on animal social behavior metaphors, such as Ant colony optimization (ACO) [2] inspired by the behavior of ant colonies, Particle swarm optimization (PSO) [6] inspired by a flock of birds, Artificial Bee Colony (ABC) [5] inspired by the bee hives. Swarm-based meta-heuristics are applied in several problems, such as nuclear engineering [18] and diagnosing diseases [14], among others. We also find applications to solve tasks related to data science and image and signal processing [13, 16, 19].

Many efforts have been made to improve the performance of swarm-based algorithms. In the case of PSO, Xu et al. [21], and Wu et al. [20] suggested the application of reinforcement learning due to this method’s characteristic of being able to learn which actions are best for a given state. It allows the dynamic modification of the behavior of the PSO through the adjustment of communication topology using a reinforcement learning agent, generating better results for complex problems and increasing the convergence speed. Recently, Lira et al. [8] proposed a self-adaptive metaheuristic that considers the real-time information acquired during execution. For the algorithm to adapt, a reinforcement learning agent collects information and chooses actions that modify the metaheuristic’s behavior. Despite the good preliminary results presented by reinforcement learning to create advanced approaches for swarm-based algorithms, it needs to be clarified if behaviors learned in some scenarios could be adapted in other scenarios not experienced by the algorithms.

This paper evaluates the transfer learning capability of a reinforcement learning strategy when different topologies can be selected along the optimization with a PSO algorithm, seeking to understand how the changes influence the various observed metrics and providing information about the learning of reinforcement agents and possible patterns that can be found. We find that RL is able to transfer the knowledge from one function to the other functions, and that changing topologies can be more effective than using a dynamic topology for PSO.

This paper is divided as follows: Sect. 2 briefly describes Particle Swarm Optimization, Proximal Policy Optimization, and Interaction Networks. Section 3 describes the methodology and parameterization for the experiments. Section 4 presents our findings and results, and we finish in Sect. 5 with our conclusions.

2 Background

2.1 Particle Swarm Optimization

In 1995, Kennedy and Eberhart [6] proposed Particle Swarm Optimization (PSO) after observing the social behavior of flocks of animals, such as birds. PSO is one of the metaheuristics of swarm intelligence most known and used in the literature [9]. Particle Swarm Optimization consists of a group of simple agents, called particles, that will be scattered in a search space. While a stopping criterion is not reached, the particles update their velocities and positions at each iteration, keeping the information on the best solutions found by them (\(\vec {p}_i\)) and the best solutions found by their neighbors (\(\vec {n}_i\)). This information is used to calculate its next movement within the search space. In each PSO iteration, the velocity (\(v_i\)) and position (\(x_i\)) information of each particle is updated according to Eqs. 1 and 2. Eventually, with enough iterations, the swarm will likely return a solution approaching the optimum position in the search space.

$$\begin{aligned} \vec {v}_i(t+1) = \chi \Big \{\vec {v}_i(t) + c_1\epsilon _1 [\vec {p}_i(t) - \vec {x}_i(t)] + {c}_2\epsilon _2[\vec {n}_i(t) - \vec {x}_i(t)]\Big \} \end{aligned}$$
(1)
$$\begin{aligned} \vec {x}_i(t+1) = \vec {v}_i(t+1) + \vec {x}_i(t), \end{aligned}$$
(2)

\(c_1\) and \(c_2\) are the acceleration coefficients, \(\epsilon _1\) and \(\epsilon _2\) are uniform random numbers, and \(\chi \) is the constriction factor defined by \( \chi = \frac{2}{{\left| {2 - \varphi - \sqrt{\varphi ^{2} - 4\varphi } } \right| }}\), where \(\varphi = c_1 + c_2\).

The communication topology defines the neighborhood in PSO. It describes the relations among particles, influencing the way the swarm behaves. Global (gbest) and Local (lbest) are two well-known topologies for PSO. Global is a fully connected topology where all particles can communicate with the entire swarm, allowing the best solution found to be shared quickly in the entire swarm. The Global topology causes a quick convergence and may not adequately explore the search space. For the Local topology, each particle is connected to k immediate particles in the swarm. It creates a ring-like communication structure using k=2. The particles have information only from the particles next to them so that the sub-swarms can independently converge on several optimal points. Local topology has slower convergence but allows a better exploration of the search space [7]. However, These two topologies are more suitable for specific problems, which led Oliveira et al. [4] to develop a more balanced topology that operates adaptively. In this approach, stagnant particles look for better particles to communicate. This approach adds a new attribute to the particle, called \(p_{k}\text {-}failure\), which is incremented every iteration that the particle does not improve its fitness. If the \(p_{k}\text {-}failure\) value exceeds a threshold (\(p_{k}\text {-}failure^{T}\)), the particle looks for a new neighbor to communicate. The choice of the neighbor for the particle to communicate with is probabilistic, based on roulette wheel selection, so that particles with better fitness have more chances of being chosen.

2.2 Reinforcement Learning

Reinforcement learning (RL) refers to learning which action should be taken in a situation (i.e., state) to achieve one or more goals [14]. At each iteration, a reinforcement learning agent receives observations of the current state scenario and takes an action from a list of actions allowed in that problem. After the action, the agent receives a reward, which measures whether the action was beneficial. Through trial and error, the agent maps and learns the actions that obtained the best rewards for the observed states, seeking to choose the ones that maximize the accumulative reward.

In reinforcement learning, the agent’s strategy to map the relationship between the observed states and the actions that must be taken is called policy. The agent aims to find the best strategy (i.e., policy) within the environment that maximizes the reward function. Therefore, the agent can learn which policy works better for a given environment. Reinforcement learning has been used with optimization meta-heuristics, including PSO [14], seeking to improve the algorithm’s convergence speed. For applications in continuous and complex problems where mapping the set of states and actions is difficult, it is possible to use a deep reinforcement learning approach. The name deep reinforcement comes from deep learning because of the use of deep neural networks to map the set of states and actions.

Proximal Policy Optimization (PPO) Schulman et al. [17] proposed the Proximal Policy Optimization (PPO), a policy gradient method more stable, efficient, and more straightforward than other predecessors, as Trust Region Policy Optimization (TRPO). PPO works to improve a policy, performing slight modifications. Its main improvements are using clipped surrogate objective, value function clipping, reward and layer scaling, orthogonal, and Adam learning rate [3]. PPO performed well in multiple benchmark problems for Reinforcement Learning [17].

2.3 Interaction Network

Oliveira et al. [11] proposed the Interaction Network (IN) aiming to understand the swarm dynamics better. IN is a framework that assesses the flow of information generated from the agents’ interactions. The Interaction Network captures the exchange of information between agents, seeking to understand how the swarm influence each other. Oliveira et al. [12] have demonstrated that using interaction networks helps to compare, for instance, the balance between exploration and exploitation tasks across algorithms. IN is represented using a graph where each node represents an agent, and the edges represent the interactions between agents. The edges can be modelled in many ways, here, we modelled as shown in Eq. 3. In this network, we do not consider how much one particle influenced each other, but who influenced whom over time that signs the communication topology structure [10].

$$\begin{aligned} I_{{t_{{i,j}} }} = \left\{ {\begin{array}{*{20}l} 1 &{} {\textrm{if}\,j\,\epsilon \,\vec {n}_{i} } \\ 0 &{} {\text {otherwise}} \\ \end{array} } \right. \end{aligned}$$
(3)

IN can be evaluated individually or by the accumulation of successive networks. Iterations can be accumulated using a Time Window (TW) to capture the social interactions among agents in a frequency of iterations. The TW allows an analysis of which agents were neighbors to each other within an interaction interval so that large time windows make the interaction networks show the interactions that are most repeated. Short TWs contain the most recent interactions, while TWs = 1 contain instant interactions.

3 Methodology

We used the reinforcement learning framework for swarm intelligence, created by Lira et al. [8] based on Python programming language and RLLLibFootnote 1. We chose the PSO in our experiments since it is a widely known and deployed swarm intelligence metaheuristics in the literature [9].

The simulation runs in time steps, allowing different swarm configurations. At the beginning of each time step, the RL agent acts, selecting Local, Global, or Dynamic topology. After a preset number of iterations, the RL agents evaluate the reward and the new simulation state before starting the same cycle until the stop criterion is reached. We used this preset number of iterations equal to 10 in this paper. At the end of the training stage, the agent should be able to recommend the best topologies for functions with similar characteristics. We used PPO [17] as the reinforcement learning agent to solve this problem. The RL agent is responsible for learning the topologies that best adapt to the tested functions and modifying the topology in the PSO based on the characteristics learned during execution.

We used simple and widely used functions to evaluate the algorithms in unimodal and multimodal scenarios for simplicity and first validation. Yet, it is challenging to appropriately cover all needed scenarios to evaluate a methodology [15], we argue that these scenarios are well-explored in the literature in regard to performance [9]. Thus, we selected Schwefel 2.21, Sphere, and Schwefel 2.22 unimodal functions, and Rastrigin, Griewank, and Schwefel multimodal functions. The search space chosen for these functions is based on Plevris and Solozano [15] (Table 1). Then, two instances of RL agents were trained for each type of function per scenario. We trained with Rastrigin for multimodal functions, and we used Schwefel 2.22 for unimodal functions. We expect that the RL learns the characteristics needed from each type of function by training only in one example.

Table 1. Description of the benchmark functions used in the simulations.

A given metaheuristic may perform well on a function with few dimensions and poorly on a function with multiple dimensions. This problem is called the “Curse of Dimensionality” – a well-known problem in data science that refers to the phenomena that arise when analyzing and organizing information in spaces of many dimensions that do not occur when few dimensions are implemented [15]. Due to this problem, two different scenarios were empirically chosen for the number of dimensions and particles, Scenario 1 with 20 particles and 50 dimensions and Scenario 2 with 10 particles and 25 dimensions. In both scenarios, we used 1000 iterations as the stop criterion for the PSO simulations. These values were chosen to guarantee convergence for multidimensional problems [15], and we point out that the number of particles is smaller than usually used in the literature in order to make the problem more complex with a smaller dimensionality. Additionally, in PSO, we used the parameters \(c_{1}=c_{2}=2.05\), and \(p_{k}\text {-}failure^{T}=1\).

For the evaluation, we selected three metrics: (i) fitness, which is an indication of algorithm success; (ii) the distribution of the selected topology among Local, Global, and Dynamic, i.e., which actions the Reinforcement Learning agent recommended the most; (iii) interaction networks (IN) [10], which will be used to observe the accumulated interactions of the agents during a defined time interval, allowing us to analyse the importance of the selected topologies over time. We execute the PSO without reinforcement, seeking to observe how the topologies perform in each chosen function. This topology performance information is used to compare the results obtained with the reinforcement, so the topologies with the best performance in the tested functions might be the majority of the agent’s recommendations.

4 Results

We divided our results into three subsections. First, we show the fitness performance of the RL proposal compared with PSO using Global, Local, and Dynamic topologies. Next, we focus on our proposal, evaluating the topologies chosen in each scenario. Finally, we analyze the Interactions Networks in the RL approach.

4.1 Evaluating the Fitness

In Figs. 1 and 2, we present the boxplots of the best fitness values found in 50 simulations in each scenario. We see that RL applied to PSO reached, in general, better performance than using the PSO with a specific topology for six functions.

For Scenario 1 (Fig. 1), using unimodal functions, we see that the RL performed as well as the best communication topology found for each function. RL could have been more efficient for the multimodal functions. In Schwefel function, it reached the worst results. In Scenario 2 (Fig. 2), we see our approach reaching competitive results. However, the results showed again that the training in Rastrigin led RL to achieve bad results in Schwefel.

Fig. 1.
figure 1

Boxplot of the best fitness found on 50 simulations of each algorithm in Scenario 1.

We then compared the results using a signal-ranked Wilcoxon test with a confidence rate of 99.9% in Tables 2 and 3. ‘–’ indicates no statistical difference between the solutions, ‘\(\blacktriangle \)’ indicates the RL approach achieved better results than the other algorithm, and ‘\(\blacktriangledown \)’ represents that our proposal reached worse results than the algorithm compared. Based on the Wilcoxon test results, we can assure the RL capability for solving different functions, even when we train only one of them with similar characteristics. Only in Schwefel function it did not work well.

Fig. 2.
figure 2

Boxplot of the best fitness found on 50 simulations of each algorithm in Scenario 2.

Table 2. Results of fitness values and Wilcoxon test with a confidence level of \(99.9\%\) comparing the RL with the other algorithms for Scenario 1.
Table 3. Results of fitness values and Wilcoxon test with a confidence level of \(99.9\%\) comparing the RL with the other algorithms for Scenario 2.

4.2 Evaluation of the Selected Topologies

We now analyse the distribution of the selected topologies by RL over time steps. We expect that even by training on one example of a benchmark function (Schwefel 2.22 or Rastrigin), the RL agent will be able to learn a good policy for solving similar functions.

Fig. 3.
figure 3

Percentage of times that a topology was selected by the agent, with its respective fitness evolution on the bottom of each plot for Scenario 1.

Fig. 4.
figure 4

Percentage of times that a topology was selected by the agent, with its respective fitness evolution on the bottom of each plot for Scenario 2.

We plot the percentage of time that a topology was chosen over time step coupled with the best fitness evolution over iteration in Figs. 3 and 4. We can observe that in the first phase, “exploration phase”, the Global topology was chosen most of the time, but the “exploitation phase” varies across experiments. The Dynamic topology was most chosen for the “exploitation phase” indicating that the swarm needs more diversity from the connections to improve the fitness. In the “exploration phase”, being widely connected is more important than having a diverse set of connections. Therefore, regardless of being unimodal and multimodal functions, it might be true that diversity on the connections is better as the swarm starts to exploit.

The fitness improvement is larger while using the Global topology, but we argue that this is not due to the fact that this topology is more efficient for the swarm. Actually, this might be true because of the easiness of improving in a “exploration phase”. We see that for the Schwefel function the swarm did not converge, the chosen topology is the Local, corroborating with the literature that this topology works better than the Global topology for complex multimodal functions.

Fig. 5.
figure 5

Interaction Network generated from the simulation with the best fitness for each function of Scenario 1.

4.3 Analysing the Interaction Network

We are also interested in understanding how the agents influence each other in their movement over iterations. We use the cumulative Interaction Network (IN) to analyze the social interactions of the best simulation for each experiment in four-time windows: (i) between 0 and 99 iterations, (ii) between 100 and 199 iterations, (iii) between 200 and 299 iterations, and (iv) between 300 and 999 iterations, shown in Figs. 5 and 6. Each line represents the intensity of the influence of one particle on the displacement of the other particles. Therefore, strong lines (yellow-red) indicate particles that strongly influence the swarm, and strong columns represent particles that are strongly influenced by the swarm. We can identify which topology impacted the most across time windows by analyzing the networks. In Sect. 4.2, we see which topologies were more frequently chosen across simulations; here, we can observe which topologies impacted the most on the movement for the best experiments. If we observe strong diagonal lines and random points, the Local, Global, and Dynamic topology substantially affected the displacement, respectively.

We observe that for the best simulations, in Scenario 1 (Figs. 3 and 5), for unimodal functions, the Global topology was more chosen combined with the Dynamic at the end of the simulation which can be observed on the networks. Nevertheless, the Local topology strongly affects the movement from the middle to the end of the simulation (by looking at the diagonals from Sphere and Schwefel 2.22). For the multimodal functions, the Local topology also appears as an essential element for the best simulations, even though it was not the most chosen one for Rastrigin and Griewank functions.

In Scenario 2 (Figs. 4 and 6), we observe some similarities to Scenario 1, but the Dynamic topology is more present on the networks. The importance of the Dynamic topology is in line with its performance, depicted in Table 2. In contrast to the fact that the Global topology was frequently chosen for the multimodal functions, the effect of this topology could have been more substantial than the other topologies.

Fig. 6.
figure 6

Interaction Network generated from the simulation with the best fitness for each function of Scenario 2.

5 Conclusions

In this paper, we applied RL to the PSO, allowing the swarm to change its communication topology over time. We compared the efficiency of RL when trained on two functions and tested it on two other new functions with similar characteristics. We chose two well-established functions in the literature (Rastrigin and Schwefel 2.22) and tested them on two other multimodal and unimodal functions, respectively.

Using our simulated scenario, we demonstrated that applying Reinforcement Learning in Swarm Intelligence could be efficient across functions. We observed that RL could learn how to adapt to the environment even when not trained in the same function, indicating the capability of transferring learning among functions. Nevertheless, a more comprehensive set of experiments is still essential for drawing stronger conclusions.

Our work was a step further in understanding how to automatize the use of Swarm Intelligence for unknown problems and how to understand the performance and patterns from SI. Swarm Intelligence still requires expertise in the domain, so it is not as straightforward as it can become.

In our future work, we aim to understand more clearly the reason for the topologies selected, seeking to understand why some functions, such as Schwefel, obtained worse results in Reinforcement Learning. It is also necessary to evaluate if the training in a single unimodal or multimodal function is enough for the agent to learn the characteristics presented by the functions. The RL agent may more accurately identify the characteristics of the observed functions using multiple functions with the same characteristics in the training phase.