A Monte Carlo Algorithm for Time-Constrained General Game Playing

Putrich, Victor Scherer; Tavares, Anderson Rocha; Meneguzzi, Felipe

doi:10.1007/978-3-031-45368-7_7

Victor Scherer Putrich⁹,
Anderson Rocha Tavares¹⁰ &
Felipe Meneguzzi^9,11

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14195))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

527 Accesses

Abstract

General Game Playing (GGP) is a challenging domain for AI agents, as it requires them to play diverse games without prior knowledge. In this paper, we develop a strategy to improve move suggestions in time-constrained GGP settings. This strategy consists of a hybrid version of UCT that combines Sequential Halving and , favoring information acquisition in the root node, rather than overspend time on the most rewarding actions. Empirical evaluation using a GGP competition scheme from the Ludii framework shows that our strategy improves the average payoff over the entire competition set of games. Moreover, our agent makes better use of extended time budgets, when available.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Testing Hybrid Computational Intelligence Algorithms for General Game Playing

Self-adaptive MCTS for General Video Game Playing

Towards Human-Competitive Game Playing for Complex Board Games with Genetic Programming

1 Introduction

General Game Playing (GGP) is a research area focused on developing intelligent agents capable of playing a wide variety of games without prior knowledge of any specific game being played [5]. GGP agents receive the rules of potentially unknown games and must play them effectively. This prevents the creation of game-specific heuristics.

The Upper Confidence for Trees (UCT) [7] algorithm has been effectively utilized in GGP environments. UCT is based on building a search tree using Monte Carlo Tree Search (MCTS). MCTS employs Monte Carlo simulations to iteratively build a game tree.

UCT guarantees asymptotic optimal convergence. However, it may spend too much time on the most promising choice so far, instead of attempting other less explored options to potentially find a better one. This results in UCT taking too long to produce high-quality recommendations in certain scenarios. A significant challenge in GGP is designing algorithms that can efficiently find solutions in a timely manner, particularly in competitive contexts, where the time required to find a solution is critical to the agent’s performance.

In this paper, we tackle the problem of GGP with scarce time resources. Specifically, we focus on the following question: Is it UCT the best option for GGP environments with strict time constraints?

In response, we develop UCT$_{{\sqrt{\text {SH}}}}$^{Footnote 1} (presented at Sect. 4). Our algorithm is based on a hybrid scheme for building game-trees though MCTS [15], which is presented at Sect. 3. UCT$_{{\sqrt{\text {SH}}}}$ aims to be more exploratory in the root node than UCT, avoiding repeatedly probing the currently most promising action, and using the budget to find potentially better alternatives.

We conduct two distinct experiments (Sect. 5). First we present the Prize Box Selection experiment, which is a simplified Multi Armed Bandit (MAB) problem to compare how selection polices allocate their resources under scenarios with high and low reward variance. The second experiment aims to measure the agents’ performance relative to UCT under time constraints. For this purpose, we use the Ludii GGP environment [12]. Specifically, we use the Kilothon tournament, one of the tracks of Ludii’s GGP competition^{Footnote 2}. Such international competitions have a crucial role in motivating GGP research [14].

The main contributions of this paper are as follows: (i) the UCT$_{{\sqrt{\text {SH}}}}$ algorithm, a new decision-making method that attempts to use less budget on the greedy choice to favour less-explored ones; (ii) the Clock Bonus Time (cbt) approach, which enhances the estimation of thinking time in a GGP environment; (iii) the Prize Box Selection experiment, which highlights the resilience of UCT$_{{\sqrt{\text {SH}}}}$ allocation criteria on decision-making problems with high and low variance, compared to $\text {UCB}_1$ and other selection policies (examined in our work at Sect. 2 and 3); and (iv) empirical evidence of the improved performance of UCT$_{{\sqrt{\text {SH}}}}$ over UCT on a GGP environment, suggesting its effectiveness as a selection policy.

2 Monte Carlo Methods

Monte Carlo techniques employ random sampling via simulations to acquire information to address problems that are otherwise intractable. In game-playing, Monte Carlo methods can be used to evaluate a game-tree node by computing the expected outcome of its actions by sampling a sufficient large number of random completions of a game, also called playouts. This section presents useful concepts regarding Monte Carlo methods, that are useful throughout the remainder of the paper.

2.1 Regret on Bandit Problem

Multi-Armed Bandit (MAB) problems [1] represent a category of decision-making scenarios in which the outcomes of chosen actions are unknown. Imagine a casino slot machine (bandit) with k distinct arms, each with its own reward and probability of winning. The gambler’s objective is to plan a strategy that maximizes their overall profit in a previously unknown bandit. The challenge lies in determining the number of times to pull each arm to maximize returns (exploitation) while learning rewards and probability distributions (exploration).

One way to measure performance in the Multi-Armed Bandit problem is via regret, defined as the difference in the reward obtained from the pulled versus the optimal arm. We use two important measures of regret adapted from the definitions of Pepels [10] and Bubeck [3]. Specifically, we use cumulative and simple regret from Definitions 1 and 2.

Definition 1

Cumulative regret is the regret accumulated over a number n of arm pulls. Let $\mu ^\star $ be the expected reward of the best arm, $\mu _i$ be the expected reward of the arm pulled in the i-th trial. Then, the cumulative regret $R_n$ can be defined as:

$$\begin{aligned} R_n &= \sum _{i=1}^{n} (\mu ^\star - \mu _i) \end{aligned}$$

(1)

An alternative experimental setup involves just finding the optimal arm without the need to maximize the reward during this process. Then, the gambler makes a final pull in this arm. Simple regret then measures the sub-optimality of this final pull, as per Definition 2.

Definition 2

Simple regret is the difference between the expected reward of the best arm $\mu ^\star $ and the expected reward of the arm of the final pull $\mu _n$:

$$\begin{aligned} r_n = \mu ^\star - \mu _n \end{aligned}$$

(2)

Bubeck et al. [3] showed that there is a trade-off between minimizing cumulative and simple regret. Specifically, they found that a smaller upper bound on cumulative regret ($R_n$) leads to a larger lower bound on simple regret ($r_n$), meaning that when an algorithm performs well in terms of cumulative regret (worst-case scenario), it is likely to have a higher minimum simple regret (best-case scenario), and vice-versa.

2.2 Upper Confidence Bound

Upper Confidence Bound (UCB) [1] is a selection policy in MAB problems and MCTS. The policy optimizes cumulative regret over time at a logarithmic rate over the number of trials performed. A widely adopted variant, $\text {UCB}_1$, is favored for its simplicity and its ability to consistently deliver robust performance outcomes.

$\text {UCB}_1$ establishes a statistical confidence interval for the estimated mean action-value.

The $\text {UCB}_1$ equation, adapted from Auer et al. [1], is presented in Eq. 3:

$$\begin{aligned} \text {UCB}_1(s,a) = Q_{s,a} + \sqrt{\dfrac{2\ln {N_s}}{n_{s,a}}} . \end{aligned}$$

(3)

Here, $Q_{s,a}$ is the mean reward from action a, when selected from state s. $N_s$ is the number of visits on state s, while $n_{s,a}$ is the number of times action a has been selected in state s. The square-root term quantifies the upper-confidence bound, i.e., the uncertainty in the estimate of taking action a on state s. It serves as a bonus to encourage exploring less visited actions.

$\text {UCB}_1$ offers a desirable property: the discovery process can be interrupted at any time, providing an estimate of each option’s quality based on collected samples. This anytime property allows for more flexibility in managing computational resources.

2.3 Sequential Halving

Sequential Halving [6] is a flat, non-exploiting approach^{Footnote 3} for the MAB problem. The algorithm evenly distributes a pre-defined budget (representing the number of simulations to be conducted) among all actions and progressively eliminates the bottom half in terms of performance. While effective at reducing simple regret compared to $\text {UCB}_1$, it reduces cumulative regret at slower rates. Sequential Halving cannot be interrupted prematurely and offers less robust assurances regarding the quantity of suboptimal selections made [6].

Algorithm 1 depicts Sequential Halving, employing a tree-like structure for compatibility with subsequent algorithms. Each node, denoted as v, retains the following data: a state s, total reward Q, number of visits N, and a list of child nodes originated from v (produced by the $\textsc {children}{}$ function if it’s not a leaf node). We use k to restrict the quantity of possible actions, and the $\textsc {head}{}$ function gives the first k children of $v_{root}$. The Sequential Halving formula in Line 7 divides the budget $\mathcal {B}$ by the number of times $v_{root}$’s children can be halved, given by $\log _2{\textsc {children}{v_{root}}}$. To distribute the budget over the remaining children in the current iteration, $\mathcal {B}$ is also divided by k.

2.4 Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) [13] employs Monte Carlo simulations to iteratively build a game tree. MCTS is designed to progressively converge on the best action as it gathers more statistical information about the domain.

MCTS is based on two principles: (1) with sufficient time, the sampled average reward from random simulations converges to the true state value, and (2) previous samples can guide future searches.

Algorithm 2 outlines the MCTS process, starting instantiating a root node, denoted as $v_{root}$. A node v consists of a state s, the list of applicable actions in s, the parent node, a list of children, and the Q and N values for cumulative reward and visits count, respectively.

The search process involves the following four steps:

Selection: beginning at tree root, the selection phase traverses the tree using a tree policy ($\pi $) that guides the search towards promising nodes, until finding a node with untried actions.
Expansion: a node is expanded by applying a random untried action to its state, resulting in a new child node. This new node is initialized with the new state, a list of applicable actions, an empty child list, and its parent reference.
Simulation: a playout evaluates the potential reward r of the new node. This is done by following a default policy ($\pi _{\varDelta }$), which usually applies random actions until reaching a terminal state.
Backpropagation: each node from $v_{k+1}$ up to the root are updated: their Q is updated by rw and N increases by 1.

The search process continues until the algorithm uses up a specified resource $\mathcal {R}$, which can be time or a number of iterations. $\textsc {recommend}{}$ function selects a move according to one of three criteria: Max Child, with the highest Q value; Robust Child, with the highest N value; or Max-Robust Child, combining both Q and N values.

The most popular tree policy for the selection phase is the Upper Confidence Bound ($\text {UCB}_1$, Sect. 2.2), which considers each node as an individual MAB problem. When used in MCTS, the algorithm is called Upper Confidence Bound Applied to Trees (UCT). MCTS and UCT exhibit an anytime property, allowing them to recommend useful actions even if the search execution is interrupted.

3 Alternatives to UCT

Simple regret (SR) minimization is strictly related to choosing a child node from the root at the recommendation phase, and the cumulative regret (CR) is related to the searching process. $\text {UCB}_1$ has optimal bounds on cumulative regret recommendation, but it is penalized in terms of simple regret. At the root node, sampling in MCTS/UCT typically focuses on finding the best move with high confidence. Once $\text {UCB}_1$ identifies such a move, it continues to spend time on it, possibly resulting in low information gain [15].

In time-sensitive situations, not considering other options and continuing with the current best choice may be a potential flaw that could be improved. Bubeck et. al [3] shows that $\text {UCB}_1$ exhibits a slow decrease in terms of simple regret, with the best-case scenario being a polynomial rate decrease. This can be problematic when fast recommendations are required. Karning et. al [6] suggest that the more exploratory policies have better bounds on simple regret minimization, like the Sequential Halving algorithm (Sect. 2.3).

3.1 and SR+CR

Tolpin and Shimony [15] modify $\text {UCB}_1$’s policy into . This policy adjusts the $\text {UCB}_1$ formula using a quicker-growing sublinear function, leading to a faster increase in the uncertainty bonus. The new policy changes the $\ln N_s$ term in $\text {UCB}_1$ to $\sqrt{N_s}$, aiming to narrowing the gap between the selections of non-greedy nodes.

Tolpin and Shimony also point out that nodes closer to the root and those deeper in the tree have different goals. The former is more crucial for move recommendations. As a result, the search strategy near the root should prioritize reducing simple regret more quickly, while nodes deeper in the tree should aim to match the value of taking the optimal path, aligning more with cumulative regret minimization.

The Simple Regret plus Cumulative Regret (SR+CR) scheme proposed by Tolpin and Shimony integrates two different policies to strike a balance between minimizing simple regret and cumulative regret. They introduced two specific algorithms, both of which combine the UCT policy with more exploratory strategies. The first one, $\text {UCB}{\sqrt{\text {}}} + \text {UCT}$, operates by applying the $\text {UCB}{\sqrt{\text {}}}$ policy at the root node and $\text {UCB}_1$ to all child nodes. Their second algorithm, $\frac{1}{2}\text {-greedy} + \text {UCT}$, operates by randomly selecting a node at the root with a 50% probability.

3.2 Hybrid Monte Carlo Tree Search

Hybrid Monte Carlo Tree Search (H-MCTS) [10] is a SR+CR algorithm that combines the Sequential Halving Applied on Trees (SHOT) algorithm [4] with UCT. SHOT is a recursive adaptation of Sequential Halving, used for constructing game-trees using a non-exploiting policy. In H-MCTS, SHOT is used not only at the root node but also deeper within the tree to address simple regret minimization.

The proposed method switches from UCT to SHOT when the computational budget spent in the node achieves a certain threshold, after considering changing the policy to be safe (when a subset of good moves are already identified and evaluated). The algorithm shifts its focus from minimizing cumulative regret to minimizing simple regret after beginning to expand SHOT within UCT regions that have been sufficiently visited. Since the computational budget per node is initially small, the simple regret tree remains shallow, as SHOT eliminate nodes from selection, the budget spent increases, causing the SHOT tree to grow deeper.

H-MCTS outperforms UCT for various exploration coefficients [10] and is highly effective in games with large branching factors, as it prunes low-promising nodes and directs the search towards the most promising areas.

4 Improving UCT in Time-Restricted Scenarios

In this Section, we present a new SR+CR algorithm, and a new method to calculate the amount of time the agent should spend in each move, assuming a time constraint for the entire game.

4.1 $\text {UCT}_{{\sqrt{\text {SH}}}}$

Although H-MCTS is promising at balancing simple and cumulative regret, it requires a predefined budget for the SHOT portion, which is not possible to estimate for previous unknown domains. Furthermore, by neglecting the exploitation of nodes, the agent becomes prone to excessive resource allocation in unpromising regions.

To enhance the performance of UCT under a GGP environment with rigid time constraints, we propose a different SR+CR method, using Sequential Halving and , as shown in Algorithm 3.

UCT$_{{\sqrt{\text {SH}}}}$ prioritize simple regret minimization at root node by combining with Sequential Halving eliminations, and the cumulative regret component uses UCT. In UCT$_{{\sqrt{\text {SH}}}}$, the aim of Sequential Halving is not to converge to the single best move, but rather to limit the number of children to search, which allows to explore the most promising areas.

We set a lower limit, $k_{min}$, on the number of child nodes necessary for the elimination process to occur. The halving operation is executed by dividing k by two. During the child selection process at the root, we use k to constrain the selection to the first k child nodes. Upon reaching a new elimination point, the algorithm arranges the root’s child nodes in descending order according to their anticipated reward.

We employ an iterative methodology (presented at line 11) to ascertain when to halve the number of children. We compute a ratio representing the fraction of halving stages already completed. For that, we divide the halve counter h by the maximum number of halving operations $\log _2{n}$. We compute the resource allocation required for the next halving operation multiplying this ratio with the total resource $\mathcal {R}$. After the used resource r surpasses this value, we increment h by one. This method ensures that the same portion of $\mathcal {R}$ is equally divided across all halving stages.

A key distinction from traditional MCTS lies in the separate treatment of root selection. The root selection, depicted at line 13, iterates over the first k-th children, by calling the $\textsc {head}{ch, k}$ function. The selected child is the one which maximizes $\pi _{UCB_{\sqrt{\text {}}}}$. Notice that rather than eliminating moves based on the number of visits a node has, we determine when to apply eliminations based on $\mathcal {R}$, which can be time or number of simulations.

4.2 Clock Bonus Time

GGP agents play games without prior knowledge, usually with a time limit for the entire game. Agents must allocate time for each move. A common strategy is to use a predefined fixed time budget, which can lead to inefficient time management, either by exhausting the time on long games, or leaving unused time in shorter games. We propose a method for estimating the time to spend on each move in a GGP environment, using a certain number of simulations during the search to gather information about the game itself. Our model employs a minimum thinking time, and for games where the agent can spend more time, it provides a thinking time bonus. The Clock Bonus Time (cbt) formula is as follows:

$$\begin{aligned} cbt = \max (\tau _{min}, \min (\tau _{max}, G/\overline{m})) - \tau _{min} . \end{aligned}$$

(4)

In Eq. 4, G is the total time budget, $\tau _{min}$ and $\tau _{max}$ are the minimum and maximum allowed thinking times per move, respectively. The bonus is given by G divided by the estimated number of moves left to finish the game $\overline{m}$, which we compute using playouts. The max and min functions ensure that the agent performs at least the minimum thinking time and avoids overestimating the time it has. We then discount the new time by $\tau _{min}$ because it is a bonus increased to the minimum thinking time. One way to integrate cbt with MCTS consists of calling cbt after half of $\tau _{min}$ has passed, which is when $r \ge \mathcal {R}/2$.

5 Experiments

To evaluate UCT$_{{\sqrt{\text {SH}}}}$, we conduct two experiments. First, we examined the agent’s decision-making in two different scenarios of reward variance. This design simulate game situations where making a suboptimal choice significantly affects the outcome (high variance), as well as those where suboptimal choices have a milder effect and are less harmful (low variance). However, in these latter scenarios, a series of misjudgments due to lack of confidence in the most rewarding decision can potentially lead to an overall loss.

The second test examine our agent’s performance within a practical context. We utilized a game competition called Kilothon, hosted within the Ludii environment. This competition served as a benchmark on UCT$_{{\sqrt{\text {SH}}}}$ performance in a GGP environment.

5.1 Prize Box Selection Experiment

The Prize Box Selection Experiment is a simplified MAB where there are K boxes containing a deterministic amount of money. The money for each box is pre-selected from a Gaussian distribution $N(\mu ,\sigma )$. We test different policies for a given number of trials and number of boxes, recording how often the policy selects each box during the experiment.

We compare $\text {UCB}_1$, Sequential Halving, , and UCB$_{\sqrt{\text {sh}}}$ (root policy of UCT$_{{\sqrt{\text {SH}}}}$ ), at a scenario with high and low variance in reward’s distributions. Both scenarios with 10000 trials for distribute across 30 boxes. In the low variance case, we set $\mu =0.3$ and $\sigma = 0.05$, limited to [−0.5, 0.5]. For the high variance case, we set $\mu = 0.3$ and $\sigma = 0.5$, limited to [−1,1].

Figure 1 depicts the low variance scenario, where the boxes are arranged in descending order, displaying only the 20 boxes with the highest rewards (i.e., boxes 1–10 have the lowest reward and are omitted).

In this scenario, presents the most dispersed trials among all boxes, with $\text {UCB}_1$ following a similar pattern. The use of eliminations in this specific scenario guides the policies that adopt them towards more focused selections, UCB$_{\sqrt{\text {sh}}}$ and Sequential Halving concentrated a higher quantity of resources on a smaller subset of boxes than $\text {UCB}_1$ and .

In the high variance test at Fig. 2, both $\text {UCB}_1$ and UCB$_{\sqrt{\text {sh}}}$ exhibit a stronger preference for the highest rewarded box. While Sequential Halving does not change its selection distribution no matter the reward distribution is, due to its non-exploiting nature. and Sequential Halving share more similar frequencies of selection between them, indicating their preference for exploration over $\text {UCB}_1$. Using UCB$_{\sqrt{\text {sh}}}$ avoids overspending trials on the best box in high-variance scenarios, which can be a desirable characteristic from the perspective of simple regret minimization, although it has a clear preference for the best rewarding box.

Our analysis emphasizes that while exploiting (i.e. allocating budget to the most promising option) is valuable for reward maximization, exploration is crucial in game playing as it leads to the rapid discovery of beneficial moves. In this context, UCB$_{\sqrt{\text {sh}}}$ displays a particularly desirable quality of enhanced exploration in our experiments.

Essentially, the primary objective is to identify and execute the optimal move in the game. The frequency of selecting the best move during the search is not of primary concern. Furthermore, UCB$_{\sqrt{\text {sh}}}$ appears to be less sensitive to varying rewards distributions. This resilience stems from its ability to not overlook exploitation in high variance scenarios, and to focus resources on low variance situations through the application of Sequential Halving eliminations.

5.2 GGP Competition Experiment

General Game Playing (GGP) is a research area focused on developing intelligent agents capable of playing a wide variety of games without prior knowledge of any specific game being played [5]. Ludii is a system for general game research, which has contributed significantly to the field. Games are implemented using Ludii’s Game Description Language (GDL). Ludii’s GDL is robust and straightforward, it allows researchers and game designers to create new games and even reproduce historical ones [12].

Ludii hosted a GGP competition, where games chosen for the competition were turn-based, adversarial, sequential, and fully observable, including deterministic and stochastic games. Kilothon was one of the competition tracks, where participants play 1094 games against an implementation of UCT algorithm, native from Ludii (which we will refer to as Kilothon agent). Each agent has a strict one-minute time limit to play each game in its entirety. When the one-minute time limit is reached, the agent must resort to random moves until the game concludes.

The Kilothon agent uses a fixed thinking time of 0.5 s per move, and incorporates two modifications to the pure UCT algorithm: Tree Reuse enables the agent to store the search tree from previous plays and reusing it in the future, and Open Loop [11] for dealing with stochastic games.

Experimentation. As a baseline, we implemented our UCT version, without tree reuse neither open loop, to compete against the Kilothon agent, and compare results with UCT$_{{\sqrt{\text {SH}}}}$ . For both, we use 0.5 s of thinking time, and we differentiate agents when using the cbt method (which also uses 0.5 s as its $\tau _{min}$ and 2 s as $\tau _{max}$). We conduct 10 Kilothon trials for each agent, computing the average payoff of our agents to evaluate their effectiveness in Kilothon.

Table 1. Average payoff ± standard deviation and maximum payoff of each agent in Kilothon.

Full size table

Table 1 presents the average payoff ± standard deviation of our tested agents across all games, along with the maximum payoff achieved by each of them. Our results highlights the performance of UCT$_{{\sqrt{\text {SH}}}}$ method over UCT, which achieves better scores than UCT including after adding the cbt method. UCT$_{{\sqrt{\text {SH}}}}^{cbt}$ had the highest score, that could achieve second place in the official competition, where the first place achieved 0.231, and the second 0.031.

The board games category, contains the vast majority of games in Kilothon contest. Ludii board games are classified according the following classifications [2, 8]: hunt, where a player controls more pieces and aims to immobilize the opponent; race, where the first to complete a course, with moves controlled by dice or other random elements, wins; sow or mancala, where players sow seeds to specific positions and capture opponent seeds; space, where players place and/or move pieces to achieve a specific pattern, with possibility of blocks and captures; and war, where the goal is to control territory, immobilize or capture all opponent’s pieces.

Figure 3 showcases the winning rate of our agents under the five board game categories. The win ratio is computed as win/(win+loss), not including draws.

UCT$_{{\sqrt{\text {SH}}}}$ outperforms UCT in sow (+6%), space (+7%), and war (+4%) games. Against the Kilothon agent, UCT$_{{\sqrt{\text {SH}}}}$ secures the respective win ratios in hunt, race, sow, space and war, respectively: 56%, 53%, 57%, 55%, 51% . Sow and space games shows the highest variability among agents, with the highest scores achieved by UCT$_{{\sqrt{\text {SH}}}}^{cbt}$ of 69% and 63%. Both these games display significant performance boosts via the cbt method for $\text {UCT}^{cbt}$ and UCT$_{{\sqrt{\text {SH}}}}^{cbt}$ , both surpassing a 60% win rate. Our evaluations reveal that the UCT$_{{\sqrt{\text {SH}}}}$ strategy, especially with the cbt method, outperforms baseline UCT. The UCT$_{{\sqrt{\text {SH}}}}^{cbt}$ agent had the highest score, showcasing its improvement over the baseline.

GGP Subset: Five Board Games. While Kilothon competition encompassed an extensive variety of games, we examine a subset that fell within the board game category. To this end, we selected games that were also part of a study conducted by Pepels [9].

We compare UCT$_{{\sqrt{\text {SH}}}}$ vs UCT, where both agents had 0.5, 1, and 2 s of thinking time for each move. Each experiment running over 1000 matches. Table 2 showcases the results.

Table 2. Win rate of UCT$_{{\sqrt{\text {SH}}}}$ against UCT.

Full size table

Our study reveals the advantages of UCT$_{{\sqrt{\text {SH}}}}$ over UCT across various game domains. UCT$_{{\sqrt{\text {SH}}}}$ makes a better use of the time budget, as its win rate over UCT increases significantly when the budget increases. In Pentalath, the win rate began at 51.7% and rose to 66.8% as the thinking time increased, while NoGo achieved the highest final win rate of 79.6%. Both Breakthrough and Amazons exhibited a significant increase in win rates, escalating from less than 50% to 66.9% and 70.1%, respectively. The overall result shows that UCT$_{{\sqrt{\text {SH}}}}$ improve its advantage over UCT after both had increased its thinking time, achieving win rates of 53% up to 71%.

6 Conclusion

This work addresses two drawbacks in UCT, the base method of most general game playing (GGP) agents: (i) UCT exploitation factor guarantees asymptotic optimality but suffers from simple regret minimization; (ii) the use of fixed time-budget for search per move may be an overestimate and an underestimate for long and short games, respectively.

For (i), we introduce UCT$_{{\sqrt{\text {SH}}}}$, a new MCTS method, which foregoes the asymptotic optimality in exchange for a timely response. UCT$_{{\sqrt{\text {SH}}}}$ uses the simulation budget more exploratively than traditional UCT, since during the search time the goal is to find the best possible move and return it to the game. For (ii), we present the Clock Bonus Time (cbt) strategy to better allocate the search time per move, given a fixed time budget to play the entire game.

We use two experiments to empirically evaluate UCT$_{{\sqrt{\text {SH}}}}$ against UCT. The Prize Box Experiment indicates that UCT$_{{\sqrt{\text {SH}}}}$ is less sensible to changes in the distribution of rewards as $\text {UCB}_1$ and do. These latter two have a more spread-out allocation when rewards have minimal variation. However, when there’s a lot of variation in the rewards, UCT$_{{\sqrt{\text {SH}}}}$ tends to favor the best option, although not as consistently as $\text {UCB}_1$, which almost always selects the top choice.

Although it may appear that constantly selecting the known best option would maximize rewards, as proposed in Tolpin and Shimony’s study [15], they argue differently. They suggest that policies that promote more exploration at the root level can actually lead to faster identification of better moves.

In the Kilothon GGP competition, our method exceeded the performance of the baseline UCT. The implementation of the cbt strategy more than doubled the scores for both agents. UCT$_{{\sqrt{\text {SH}}}}^{cbt}$ could achieve the second place in the official Kilothon competition, according to the final competition results. This achievement is remarkable, considering our agent relies solely on Monte Carlo simulations and does not utilize any other enhancements or parallelism.

Our study also showed that UCT$_{{\sqrt{\text {SH}}}}$ uses increased time budgets significantly better than UCT. Moreover, our findings suggest that using 0.5 s of thinking time, as the default UCT Kilothon agent does, the time constraint imposed might be too unrealistic for agents to play in many game domains.

Notes

1.
https://github.com/schererl/FinalYearProject.
2.
https://github.com/Ludeme/LudiiAICompetition.
3.
In contrast with exploiting policies, that allocate most resources to the most promising choice, non-exploiting policies allocate resources uniformly among choices, iteratively discarding the poorly-performing ones.

References

Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2), 235–256 (2002)
Article MATH Google Scholar
Brice, W.C.: A history of board-games other than chess. by h. j. r. murray. oxford: Clarendon press, 1952. pp. viii 287, 86 text figs. 42s. J. Hellenic Stud. 74, 219–219 (1954). https://doi.org/10.2307/627627
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in finitely-armed and continuous-armed bandits. Theoret. Comput. Sci. 412(19), 1832–1852 (2011)
Article MathSciNet MATH Google Scholar
Cazenave, T.: Sequential halving applied to trees. IEEE Trans. Comput. Intell. AI Games 7(1), 102–105 (2014)
Article Google Scholar
Genesereth, M., Love, N., Pell, B.: General game playing: overview of the AAAI competition. AI Mag. 26(2), 62–62 (2005)
Google Scholar
Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: International Conference on Machine Learning, pp. 1238–1246. PMLR (2013)
Google Scholar
Kocsis, L., Szepesvári, C., Willemson, J.: Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep. 1, 1–22 (2006)
Google Scholar
Parlett, D.: The Oxford history of board games. Oxford University Press, Oxford (1999)
Google Scholar
Pepels, T.: Novel selection methods for monte-carlo tree search. Master’s thesis, Department of Knowledge Engineering, Maastricht University, Maastricht, The Netherlands (2014)
Google Scholar
Pepels, T., Cazenave, T., Winands, M.H., Lanctot, M.: Minimizing simple and cumulative regret in monte-carlo tree search. In: Workshop on Computer Games, pp. 1–15. Springer (2014)
Google Scholar
Perez Liebana, D., Dieskau, J., Hunermund, M., Mostaghim, S., Lucas, S.: Open loop search for general video game playing. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 337–344 (2015)
Google Scholar
Piette, É., Soemers, D.J.N.J., Stephenson, M., Sironi, C.F., Winands, M.H.M., Browne, C.: Ludii - the ludemic general game system. In: Giacomo, G.D., Catala, A., Dilkina, B., Milano, M., Barro, S., Bugarín, A., Lang, J. (eds.) Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020). vol. 325, pp. 411–418. IOS Press (2020)
Google Scholar
Świechowski, M., Godlewski, K., Sawicki, B., Mańdziuk, J.: Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, pp. 1–66 (2022)
Google Scholar
Świechowski, M., Park, H., Mańdziuk, J., Kim, K.J.: Recent advances in general game playing. Sci. World J. 2015 (2015)
Google Scholar
Tolpin, D., Shimony, S.: Mcts based on simple regret. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 26.1, pp. 570–576 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
Victor Scherer Putrich & Felipe Meneguzzi
Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Anderson Rocha Tavares
University of Aberdeen, Aberdeen, Scotland
Felipe Meneguzzi

Authors

Victor Scherer Putrich
View author publications
Search author on:PubMed Google Scholar
Anderson Rocha Tavares
View author publications
Search author on:PubMed Google Scholar
Felipe Meneguzzi
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Victor Scherer Putrich .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Putrich, V.S., Tavares, A.R., Meneguzzi, F. (2023). A Monte Carlo Algorithm for Time-Constrained General Game Playing. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-45368-7_7
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

A Monte Carlo Algorithm for Time-Constrained General Game Playing

Abstract

Similar content being viewed by others

Testing Hybrid Computational Intelligence Algorithms for General Game Playing

Self-adaptive MCTS for General Video Game Playing

Towards Human-Competitive Game Playing for Complex Board Games with Genetic Programming

1 Introduction