key: cord-0551078-prm6ewh1
authors: ShaaradA., R; Dukkipati, Ambedkar
title: A Regret bound for Non-stationary Multi-Armed Bandits with Fairness Constraints
date: 2020-12-24
journal: nan
DOI: nan
sha: 176a5ae0dcdea49a401ee3089409218fbcf0211c
doc_id: 551078
cord_uid: prm6ewh1

The multi-armed bandits' framework is the most common platform to study strategies for sequential decision-making problems. Recently, the notion of fairness has attracted a lot of attention in the machine learning community. One can impose the fairness condition that at any given point of time, even during the learning phase, a poorly performing candidate should not be preferred over a better candidate. This fairness constraint is known to be one of the most stringent and has been studied in the stochastic multi-armed bandits' framework in a stationary setting for which regret bounds have been established. The main aim of this paper is to study this problem in a non-stationary setting. We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe algorithm for solving a slowly varying stochastic $k$-armed bandit problem. With this we present two results: (i) Fair-UCBe indeed satisfies the above mentioned fairness condition, and (ii) it achieves a regret bound of $Oleft(k^{frac{3}{2}} T^{1 - frac{alpha}{2}} sqrt{log T}right)$, for some suitable $alpha in (0, 1)$, where $T$ is the time horizon. This is the first fair algorithm with a sublinear regret bound applicable to non-stationary bandits to the best of our knowledge. We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart as the variation in the environment tends to zero.

Multi-armed bandits and other related frameworks for studying sequential decisionmaking problems have been found to be useful in a wide variety of practical applications. For example, bandit formulations have been used in healthcare for modelling treatment allocation (Villar et al., 2015; Durand et al., 2018) , studying influence in social networks (Wu et al., 2019; Wen et al., 2017) , recommendation systems (Zhou et al., 2017) etc. The present paper deals with incorporating fairness conditions in multi-armed bandit problems where the underlying environment is non-stationary.

How can a bandit algorithm be 'unfair' ? In the classic stochastic k-armed bandit problem, at each time step, an agent has to choose one out of k arms. When an arm is chosen, the agent receives a real-valued reward sampled from a probability distribution corresponding to the chosen arm. The goal of the agent is to maximize the expected reward obtained over some time horizon T . For this, the learning algorithm has to initially try out each arm to get an idea of its corresponding reward distribution. This is referred to as the exploration phase. Once the agent gathers enough information about the reward distribution of each arm, it can then make an informed decision and choose among the arms in such a way as to maximize the rewards obtained. The performance of the agent is usually measured with a notion of regret. This is defined as the expected difference in the rewards obtained if the agent follows the optimal policy of choosing the best arm versus the policy actually followed by the agent.

In some socially relevant practical problems of sequential decision making, judging an algorithm solely based on regret may not be enough. The reason being that regret only provides the picture of expected returns and does not deal with behavior or in what way decisions are taken, especially during the initial learning phase. Is it OK to be 'unfair' to a better candidate just because the learning agent or algorithm is still trying to learn? For example, selecting a poorly performing arm for a constant number of times does not affect the asymptotic regret achieved by the algorithm, but this behavior would still be unfair to better performing arms, which have been ignored either deliberately or due to carelessness during the learning phase.

To deal with such issues, one can enforce some well-defined fairness constraints on a learning algorithm in addition to the goal of minimizing regret. Various definitions of fairness motivated by real-world applications have been studied in the context of stationary multi-armed bandit problems (Joseph et al., 2016; Li et al., 2020) . Intuitively, these fairness conditions insist that any algorithm solving the bandit problem should consider all the arms and their rewards and should be 'fair' when selecting the arms. This might be an important requirement in many real-world applications, especially those involving humans. The learning algorithms should be designed such that they do not give rise to decisions that unduly discriminate between different individuals or groups of people.

One notion of fairness is what could be considered equality of opportunity, in which arms with similar reward distributions are given a similar opportunity of being chosen at any given time. For instance, with high probability (at least 1 − δ), at each time step, an arm can be assigned a higher probability of being chosen than another arm only if the expected reward of the former is strictly greater than that of the latter (Joseph et al., 2016) . This notion of δ-fairness is what is considered in this paper when referring to the fairness of the proposed algorithm.

Most standard solutions to stochastic multi-armed bandit problems assume that the rewards are generated independently from reward distributions of arms. These distributions remain fixed over the entire time horizon T . However, in many practical problems, the underlying environment cannot be expected to remain fixed. This amounts to multi-armed bandit problems where the reward distributions may change at each time step. This requires developing learning algorithms that should be able to cope with different kinds of changes in the environment. For example, a bandit algorithm for a recommendation system should be able to handle a change in user preferences over time (Zeng et al., 2016) .

If there are no statistical assumptions or constraints on the rewards corresponding to any arm at any time step, the problem becomes what is referred to as the adversarial bandit problem (Auer et al., 1995) . This problem is difficult to solve under the classic notion of regret since any information obtained about the reward distribution of an arm at a certain time becomes useless in the next time step, which means that any arm which is optimal at a one-time step need not be optimal later. However, this setting can still be studied and solved with respect to a weaker notion of regret. Here, the regret of an algorithm is measured against a policy that is allowed to choose the same arm during all time steps.

Since the adversarial setting is too general for obtaining good results with respect to the standard form of regret, other variations of non-stationary bandit problems have been extensively studied, which constrain how the reward distributions of the arms change as time passes. One possible constraint is bound on the absolute change in the expected rewards of all arms at each time step (Wei and Srivatsva, 2018) . In this paper, we consider a variant of this slowly varying environment.

Since existing fair algorithms assume a stationary environment, their fairness guarantees do not hold when the stationarity assumptions are no longer true. Hence, modifying these algorithms to respect the fairness constraints in a non-stationary environment is non-trivial. In this work, we address the problem of satisfying fairness constraints in a slowly varying non-stationary environment.

Contributions. In the literature, fair bandit algorithms have not been studied in the non-stationary setting. The main contribution of this paper is a fair UCB-like algorithm for solving a non-stationary stochastic multi-armed bandit problem. The environment considered is a slowly varying environment. We prove that the proposed algorithm is δ-fair (the fairness condition considered in (Joseph et al., 2016) ) and achieves a regret of order O k 3 2 T 1− α 2 log(cT 1+ α 4 ) for some α ∈ (0, 1) and constant c ∈ R + . As the non-stationarity of the environment considered is reduced, this regret bound approaches that achieved by a fair algorithm in the stationary setting, up to logarithmic factors.

Consider a bandit with k arms and a time horizon T . At time t ∈ [T ], let the reward distribution of arm i ∈ [k] be P t i on [0, 1], with mean µ t i . Here, [T ] denotes the set {1, . . . , T } and similarly [k] denotes the set {1, . . . , k}. Given history h t ∈ ([k] × [0, 1]) t−1 of arms chosen and rewards obtained till time t − 1, the agent chooses an arm at time t by sampling an arm from the probability distribution π t .|h t , with the probability of choosing arm i being π t i|h t . Let i t be the arm chosen at time t, i.e, i t ∼ π t .|h t . Note that in the stationary case, the reward distribution and the mean will remain constant, that is P t i = P i and µ t i = µ i , for all t ∈ [T ], i ∈ [k]. In this paper, we consider multi-armed bandits in a non-stationary setting, and hence, we assume that the means of the rewards distributions change as time progresses. Our assumption can be stated as follows. We assume that there exists known parameter κ ∈ R + such that for all t < T , and all arms i ∈ [k], |µ t+1 i − µ t i | < T −κ , where µ t i and µ t+1 i are the means of the reward distribution of arm i at times t and t + 1 respectively. In other words, κ controls how much the mean of the reward distribution of an arm is allowed to change at each time step. It is to be noted that the change in the mean depends only on the horizon T and not the current time step t.

In this paper, we consider the notion of δ-fairness that has been introduced in Joseph et al. (2016) . The intuition behind this definition of fairness is that at each time step, with a high probability of 1 − δ, arms with similar reward distributions should have a similar chance of being selected. In other words, at any point in time, for any pair of arms, the learning algorithm should give preference to one of the arms over the other, only if it is 'reasonably' certain that its expected reward is strictly greater than that of the other. This can be stated as follows.

Definition 1 (δ-Fairness). (Joseph et al., 2016) A multi armed bandit algorithm is said to be δ-fair if, with probability atleast

where π t i|h t and π t j|h t denote the probability assigned by the algorithm to choose arm i and j respectively at time t given history h t of arms chosen and rewards obtained till time t − 1, and µ t i and µ t j are the means of the rewards distributions of arms i and j respectively at time t.

The dynamic regret achieved by a bandit algorithm is defined as

where µ t i is the mean of the reward distribution of arm i at time t, and i t ∈ [k] is the arm selected by the algorithm at time t. Using the dynamic regret as defined above, the performance of a bandit algorithm is measured by comparing the expected reward of the arm selected at each time step against the expected reward of the optimal arm at that time step, taking into account the fact that the optimal arm changes with time. This is in contrast to the static regret considered in stationary and adversarial settings. In these settings, the performance of a bandit algorithm is measured against a single fixed optimal arm, which is the arm that gives the highest total expected reward over the entire time horizon T , when chosen at every single time step. Thus, dynamic regret is a stronger performance criterion than static regret.

Now we present some analysis that leads to our proposed algorithm. For satisfying the fairness constraint as given by Definition 1, an arm should be preferentially chosen only if it is known that, with high probability, that arm indeed gives a strictly greater expected reward. To estimate this with high probability, confidence intervals are constructed for each arm, similar to the Upper Confidence Bound (UCB1) algorithm . However, instead of choosing the arm with the highest upper confidence bound, arms with high estimates of expected rewards are chosen in such a way that fairness is maintained, as described below.

Let [a t i , b t i ] be the confidence interval of arm i at time t. Suppose it has been proved that with probability atleast 1 − δ, a t i ≤ µ t i ≤ b t i , for all i ∈ [k] and t ∈ [T ]. At each time t, to minimize the regret, UCB1 deterministically chooses the arm with the highest upper confidence bound b t i , say i * ∈ [k]. Fairness demands that i * be chosen only if µ i * > µ j for all j = i * . However, if there exists an arm j = i * such that b j > a i * , their confidence intervals overlap. There is no guarantee that the expected mean of arm i * is greater than that of j, forcing any fair agent to assign both these arms an equal probability of being chosen. Now, with the above constraint, at time t, regret minimization requires the agent to choose arms i * and j with probability 1/2 each and ignore all the other arms. However, this is fair only if all the other arms have expected rewards strictly less than those of i * and j, which is not true if the confidence interval of some other arm i overlaps with either of those of i * or j. Arm i should also be assigned an equal probability of being chosen as i * and j, and the same argument extends to other arms whose confidence intervals overlap with that of i and so on.

Let B t be the set of arms to be chosen at time t with equal nonzero probability and let arms in [k] \ B t be ignored. For optimality, B t should contain i * . For fairness, any arm whose confidence interval overlaps with that of any arm in B t should be added and this process should be repeated until no other arm can be added to B t . B t is called the active set of arms (Joseph et al., 2016) and at each time step, an arm is chosen uniformly from the active set.

is defined recursively as the set satisfying the following properties:

Intuitively, the active set of arms is the set of arms whose confidence intervals are chained via overlap to the confidence interval with the greatest upper confidence bound. For the algorithm to be fair, each arm in the active set should be assigned an equal probability of being selected.

Due to non-stationarity, as time progresses, older samples become less indicative of the current reward distribution. Therefore, at each time step, we choose only the latest t α k samples of each arm to estimate the expected reward and construct the confidence interval, for suitably chosen α ∈ (0, 1). This progressive increase in the number of samples considered is similar to the use of a progressively increasing sliding window by Wei and Srivatsva (2018) . Now, as time progresses, due to the increased number of samples obtained, the confidence intervals shrink, and the active set becomes small. If the arms that are not in the active set are ignored and not sampled for a long time, due to the nonstationarity of the environment, their expected rewards can change such that they fall into the confidence intervals of arms that are in the active set.

So, to ensure that the learning algorithm does not remain oblivious to the reward distributions of inactive arms, we propose, with some fixed probability at each time step, to choose uniformly from all arms. This exploration probability is chosen to be T −α 2 , for suitable α ∈ (0, 1) to be specified. Due to this fixed exploration probability at each step, we refer to our proposed algorithm as Fair-UCB with exploration or Fair-UCBe. The overall steps involved are listed in Algorithm 1. We present two results in this regard. First, we show that the proposed algorithm is indeed δ-fair. Then, we establish an upper bound for the regret.

Algorithm 1: Fair UCB with Exploration Given:

Active set B ← all arms whose intervals are chained to

else Exploit Sample an arm i t uniformly from B

Choose arm i t and append observed reward r t to sample sequence S i t Theorem 3.1. The Fair-UCBe algorithm is δ-fair, as defined in Definition 1, for δ ≥ 2T − α 2 . Theorem 3.2. The regret R(T ) achieved by Fair-UCBe satisfies

Remark. The above regret bound is non-trivial since it guarantees that this fair algorithm achieves sublinear regret even in the context of a non-stationary environment.

The choice of parameters α and in the algorithm is constrained by the inequality κ > 2α+ or equivalently α < 1 2 (κ− ). For large T , can be chosen close to 0, leading to the constraint κ > 2α. Thus, when the non-stationarity in the environment is very high and κ → 0, we have α → 0 as well. Similary, when the non-stationarity is very low and the environment is almost stationary, κ is large and α can be chosen close to 1. Joseph et al. (2016) showed that in a stationary environment, their δ-fair algo-

in the limiting case of δ → 1 √ T . They also showed that Fair-Bandits achieves the best possible performance in that setting. One can see that this bound is equivalent, upto logarithmic factors, to the regret of

achieved by Fair-UCBe in the limiting case of α → 1, which occurs for large κ and T . In other words, as the non-stationarity of the environment is reduced, the performance of our algorithm remains consistent with the best performance possible in that setting.

When the change in the environment is high (i.e., κ is close to zero), the regret bound for Fair-UCBe is similar to that of SW-UCB# (Sliding Window Upper Confidence Bound) (Wei and Srivatsva, 2018) , which assumes a non-stationarity constraint similar to ours but does not maintain fairness. The regret bounds are almost the same in terms of T up to logarithmic factors, with the difference being a factor of T /4 , whose exponent goes to zero for large T . 4.1. On Exploration. One aspect of our algorithm that distinguishes it from other upper confidence bound algorithms is the incorporation of an explicit fixed probability of exploration, in addition to the implicit exploration present in other similar algorithms. This exploration probability T − α 2 depends on the non-stationarity of the environment through the constraint α < 1 2 (κ − ). Smaller values of kappa lead to a larger probability of exploration. This is intuitive in the sense that the more the environment varies, the greater the need to sample inactive arms via explicit exploration to keep track of changes in their reward distributions. Thus, there is a smooth trade-off between exploitation and exploration, depending on the degree of non-stationarity of the environment. 4.2. On Sublinearity. Even though the upper bound O k 3 2 T 1− α 2 log(CT ) for the regret of the algorithm is seemingly sublinear in T (since the exponent of T is 1 − α 2 < 1), the extra factor of k 3 2 may actually result in the regret being more than T . In order to achieve sublinear regret, ignoring logarithmic factors for simplicity, it is necessary that k 3 2 T 1− α 2 = O(T ), or equivalently T = Ω(k 3 α ). Due to the constraint α < 1 2 (κ − ), k 3 α increases drastically for small values of κ, necessitating a very large value of T to obtain sublinear regret. In other words, the regret of the algorithm is linear in the context of highly non-stationary environments. As the non-stationarity reduces, α → 1 and the constraint becomes T = Ω(k 3 ), which is identical to the constraint for FairBandits (Joseph et al., 2016) .

5. Proofs for Theorems 3.1 and 3.2 5.1. Proof of Theorem 3.1. At each time step, all arms in the active set are assigned an equal probability of being chosen, say p act , and all arms not in the active set are assigned an equal probability of being chosen, say p nonact . Now, p nonact = 1 kT α/2 , since 1 T α/2 is the probability of exploration and 1 k is the probability of choosing a specific arm when exploring. For any arm in the active set, the probability p act of being chosen is

From this, we have p act p nonact . Therefore, due to the choice of the active set definition, for the algorithm to be δ-fair, it is sufficient to prove that with probability at least 1−δ, the expected rewards of all arms at all time steps fall in their confidence intervals. Now we proceed to prove these results.

, the probability of that arm being chosen is at least 1 kT α/2 . So, the expected number of time steps required for obtaining at least one sample from an arm is at most kT α 2 . This fact can be used to prove the following Lemma.

Lemma 5.1. Let the time interval [0, T ] be divided into intervals of size kT α 2 + , for ∈ (0, 1) as specified in Algorithm 1. Let G be the event that each arm has atleast one sample in each of these intervals. Then, P(G) ≥ 1 − δ 1 , for δ 1 = 1 T α 2 .

The constraint on can be simplified by the following Lemma. The proofs of both these Lemmas are given in the Appendix.

Lemma 5.2. 1 log T log 1 2 log( 18 11 ) log T ≤ 1 2e log(18/11) .

From the above Lemma, = 0.3735 is sufficient for arbitrary T . Moreover, for T > 15, the lower bound is a decreasing function of T , and thus can be chosen much smaller, with the value going to 0 for large T . 5.1.2. Sufficiency of samples. At each point t in time, we wish to choose the latest t α /k samples. But these many samples may not be available, especially if t is small. From Lemma 5.1, we see that if G is true, then each sample requires atmost kT α 2 + time steps. So, for the availability of sufficient number of samples, we need T α 2 + t α < t. Suppose M = T 1 1−α ( α 2 + ) , then for t > M , we have t > T 1 1−α ( α 2 + ) , which implies t 1−α > T α 2 + and hence T α 2 + t α < t. So, if G is true, after the initial M time steps, the number of samples is always sufficient to construct a good confidence interval, provided the exponent of T is sensible. We add the constraint, 1 1−α ( α 2 + ) < 1 − α 2 , to be satisfied by the exponent. This will be useful in the regret calculation. This can be simplified to < (α−2) 2 2 − 1 or α < 2 − √ 2 + 1, which is the constraint specified in Algorithm 1.

At time t, for arm i, consider the latest τ i (t) rewards obtained. Let those rewards be X

i . The Hoeffding inequality gives

We make use of the following Lemma, whose proof is provided in the Appendix.

i , the empirical estimate of the expected reward of arm i at time t, using the latest τ i (t) samples obtained from that arm. Then, if G is true, for

By letting A = τ i (t) 2 log( kπ 2 t 2 3δ 2 ), the above probability (that the true mean is outside the confidence interval) summed over all arms i and times t ∈ [T ] is bounded above by

Here, we use the fact that t 1 t 2 ≤ π 2 6 . Since the above analysis holds if G is true, which happens with probability atleast 1 − δ 1 , the above confidence intervals hold with probability atleast 1−δ 1 −δ 2 and the algorithm becomes δ-fair, where δ = δ 1 +δ 2 . 5.2. Proof of Theorem 3.2. The length η i (t) of the confidence interval of arm i is

The regret at any time step of exploitation is atmost k times the size of the largest confidence interval, and atmost 1 (and also, with probability less than δ, when any of the means fall outside their confidence intervals, it will be bounded by 1). When G is true, for the first M time steps, the number of samples may be insufficient and the regret is bounded by 1, and for later time steps,

where the first two terms are due to exploitation epochs, the third term is due to exploration epochs and the fourth term corresponds to the event G c . Since M = T 1 1−α ( α 2 + ) , we obtain

Since α < 2 − √ 2 + 1 and equivalently 1 1−α ( α 2 + ) < 1 − α 2 , the first term becomes O(T 1− α 2 ). Now, consider the second term in the regret bound. Let C = π k 3δ . When G is true and t > M , τ i (t) = t α k and the term becomes

The second term in the above expression is O(kT α 2 + −κ+α+1 ), which is O(kT 1− α 2 ), since κ > 2α + . By letting log(Ct) = x, the first term in the above expression becomes

Let √ x = t, then 1 2 √

x dx = dt =⇒ √ xdx = 2xdt and x = t 2 . The above expression becomes

The first inequality above is obtained by using integration by parts with functions t and te t 2 . The overall regret bound becomes

log T , and α < 2 − √ 2 + 1.

In this section, we present results from applying the proposed algorithm in a simulated environment. We consider a bandit with k = 10 arms. The initial expected rewards for all arms are chosen uniformly from [0.05, 0.95]. The rewards at each time step are sampled from a beta distribution. For non-stationarity, each arm is assigned randomly at the beginning to be drifting either upwards or downwards. An upward drifting arm is more likely to drift upwards with some fixed probability 0.8 and vice versa. At each time step, the drift in the expected value of each arm is sampled uniformly from [0, T −κ ] and added to the expected reward while also constraining it to remain within the original interval. The results are plotted in Figure 1a as the ratio between the regret R(t) achieved by the algorithm and the regret bound

The apparent linear increase in the regret with time is due to the constant exploration probability at each time step, which is necessary to ensure fairness. This (c) show the change in the confidence intervals of the two arms under Fair-UCBe and Fair-Bandits respectively in an environment with two arms, time horizon T = 10 6 and κ = 1.0, where the expected rewards of the arms continuously evolve in opposite directions. It can be observed that due to lack of exploration, throughout most of the time horizon, FairBandits fails to accurately estimate the reward distribution of whichever arm is not in the active set at that point in time and thus does not maintain fairness.

does not contradict the sublinear regret bound since the bound is for the cumulative expected regret achieved over the entire time horizon and does not constrain how the regret changes with time. It can be seen from the Figure 1a that the cumulative regret achieved does not exceed the derived upper bound.

The sharp changes in the slopes of some of the lines in the plot correspond to changes in the composition of the active set. Dropping a sub-optimal arm from the active set results in a drastic reduction in the expected regret at each time step of exploitation.

To further illustrate the necessity of a sliding window and exploration to deal with non-stationarity, we consider another experiment with two arms and parameters T = 10 6 and κ = 1.0. In this experiment, we let the first arm start with an expected reward of 0.95 and decrease by T −κ , the maximum amount possible, at each time step. Similarly, we let the second arm start with an expected reward of 0.05 and increase it by T −κ at each time step. We run Fair-UCBe and FairBandits in this environment and plot the upper and lower confidence bounds of the arms in Figures  1b and 1c respectively.

Under Fair-UCBe, we can see the gradual shift of the confidence intervals of both arms as the underlying reward distributions change. In contrast, under FairBandits, we observe that as soon as the first arm becomes known to be better than the second arm, the latter is discarded from the active set, and the algorithm loses track of its reward distribution. Only much later does the estimated mean reward of the first arm become low enough for its confidence interval to overlap with that of the second arm. The algorithm then becomes aware of the change in the reward distribution of the second arm. After a few time steps, the active set again contains only one arm, this time the second arm, and the lack of exploration leads to a biased estimation of the other arm again, as seen by the lack of change in the confidence bounds of the first arm nearing the end of the horizon. Thus, we see that every aspect of our algorithm is crucial for dealing with a non-stationary environment while being fair to all arms.

In this work, we considered a specific version of a slowly varying environment to study regret bounds for a fair algorithm solving a non-stationary multi-armed bandit problem. Several variations of non-stationary bandit problems have been extensively studied, which constrain in various ways how the reward distributions of the arms change as time passes. Garivier and Moulines (2011) studied a setting with a constraint on the number of times of occurrence of an arbitrary change in the reward distribution, referred to as an abruptly changing environment. Alternatively, change in the reward distribution could be allowed at every time instant, but the nature of each change is constrained, leading to a slowly varying or drifting environment like the one considered in this work. The constraint could be on the absolute change in the expected reward at each time step (Wei and Srivatsva, 2018) , or a stochastic change in the form of a known distribution (Slivkins and Upfal, 2008) . Another extensively studied setting consists of a constraint on the total absolute variation, throughout all time steps, of the expected rewards (Besbes et al., 2019; Russac et al., 2019) .

Our non-stationarity assumption is similar to that of Wei and Srivatsva (2018) . Their algorithm, SW-UCB# (Sliding Window UCB), was designed for dealing with a slowly varying environment in which the change in the expected mean of each arm's reward distribution is assumed to be O(T −κ ), which is a slightly weaker assumption than ours. Its use of an increasing sliding window of samples to estimate the current mean is similar to our algorithm. At each step, instead of considering all the reward samples obtained till then, the only rewards used are those obtained in the last λt α time steps, for suitable values of α and λ. However, their work differs from ours in terms of fairness due to the lack of explicit exploration and the arm's deterministic selection with the highest upper confidence bound at every time step. Another difference is that they use a sliding window of a certain size, whereas, in our method, a certain number of latest samples are used, irrespective of how old those samples are.

A similar algorithmic choice for dealing with non-stationarity is the use of a fixedsize sliding window. Another technique is to discount older rewards when estimating the expected reward of an arm. This ensures that older samples affect the estimation less and reduce bias in the estimation induced due to the environment's non-stationarity. These two approaches have been studied by Garivier and Moulines (2011) for abruptly varying environments. EXP3 Auer et al. (2002) is an algorithm used for solving an adversarial multiarmed bandit. Besbes et al. (2019) repurposed this algorithm and showed that by restarting it every ∆ time steps, for some suitable ∆, this algorithm can be used to solve a stochastic and non-stationary bandit and achieves the lower bound of regret achievable in that setting.

The notion of δ-fairness considered in this paper has been studied for classic contextual bandits in Joseph et al. (2016) . Another notion of fairness is to constrain the fraction of times an arm is chosen by a pre-specified lower bound (Li et al., 2020) . However, the δ-fairness of Joseph et al. (2016) differs from this notion significantly since this definition of fairness depends on an external lower bound specification independent of the reward distributions of the arms themselves.

In this work, we have studied the problem of designing a δ-fair algorithm for a stochastic non-stationary multi armed bandit problem. Our non-stationarity assumption is that the absolute change in the expected reward of each arm is assumed to be at most T −κ at each time step, for some known κ ∈ R + . We have shown that the proposed algorithm Fair-UCBe indeed satisfies δ-fairness condition, for δ ≥ 2T −α 2 . We also show that it achieves a regret of O k 3 2 T 1− α 2 log cT 1+ α 4 ) , for some constant c ∈ R + . since 2 − 2 √ e = 0.7869... and 7 9 = 0.777...

A.1. Proof of Lemma 5.1. The probability that more than N time steps are required to obtain a single sample is at most 1 − 1 kT α 2 N , for an arm to be rejected consecutively at least N times. This is the probability that, for a specific arm, there is no sample in a single time interval of length N . We need to consider this failure probability for all intervals and all arms. So, setting interval size N = kT α 2 + and dividing the available failure probability δ into k parts for each of the arms, we need attains a maximum of 1 2e log( 18 11 ) , at 1 2 log( 18 11 ) log T = e or T = e 2e log( 18 11 ) , which is approximately 14.5669... Since we considered the maximum value possible, this upper bound is a worst case bound for the value of and for T ≥ 15, this can be improved considerably, since the function reduces to 0 as T increases.

≥ A, it is sufficient that

Therefore, forμ t i = 1

i , the empirical estimate of the expected reward of arm i at time t using the latest τ i (t) samples obtained from that arm, and c i = 

Finite-time analysis of the multiarmed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem

The nonstochastic multiarmed bandit problem

Optimal exploration-exploitation in a multi-armed bandit problem with non-stationary rewards

Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis

On upper-confidence bound policies for switching bandit problems

Fairness in learning: Classic and contextual bandits

Combinatorial sleeping bandits with fairness constraints

Weighted linear bandits for non-stationary environments

Adapting to a changing environment: the brownian restless bandits

Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of

On abruptly-changing and slowly-varying multiarmed bandit problems

Online influence maximization under independent cascade model with semi-bandit feedback

Factorization bandits for online influence maximization

Appendix A. Proof of Lemma 5. 1-5.3 For the proof of Lemma 5.1, we need the following result.Proof. For n ≥ 1, f n (x) = (1− x n ) n is a convex function in [0, 1 2 ]. For n = 1, this is clear since it is linear, and for n = 2, the function becomes ( x 2 −1) 2 , which is clearly a convex quadratic function. For n > 2, f n (x) = −(1 − x n ) n−1 and f n (x) = n−1 n (1 − x n ) n−2 ≥ 0. Now, consider the sequence f n (1/2) = (1 − 1 2n ) n . Clearly, it is an increasing sequence, and it converges to 1/ √ e, which implies f n (1/2) ≤ 1/ √ e.So, for x ∈ [0, 1 2 ], x = (1 − 2x).0 + 2x. 1 2 , and by the convexity of f n ,