Multi-armed Bandit Algorithms on System-on-Chip: Go Frequentist or Bayesian?

Multi-armed Bandit (MAB) algorithms identify the best arm among multiple arms via exploration-exploitation trade-off without prior knowledge of arm statistics. Their usefulness in wireless radio, IoT, and robotics demand deployment on edge devices, and hence, a mapping on system-on-chip (SoC) is desired. Theoretically, the Bayesian approach-based Thompson Sampling (TS) algorithm offers better performance than the frequentist approach-based Upper Confidence Bound (UCB) algorithm. However, TS is not synthesizable due to Beta function. We address this problem by approximating it via a pseudo-random number generator-based approach and efficiently realize the TS algorithm on Zynq SoC. In practice, the type of arms distribution (e.g., Bernoulli, Gaussian, etc.) is unknown and hence, a single algorithm may not be optimal. We propose a reconfigurable and intelligent MAB (RI-MAB) framework. Here, intelligence enables the identification of appropriate MAB algorithms for a given environment, and reconfigurability allows on-the-fly switching between algorithms on the SoC. This eliminates the need for parallel implementation of algorithms resulting in huge savings in resources and power consumption. We analyze the functional correctness, area, power, and execution time of the proposed and existing architectures for various arm distributions, word-length, and hardware-software co-design approaches. We demonstrate the superiority of the RI-MAB over TS and UCB only architectures.


I. Introduction
Multi-armed bandit (MAB) algorithms are designed to identify the best arm among multiple arms without prior knowledge of type of arm distribution and their statistics [1]- [5]. They achieve this by optimizing an exploration-exploitation trade-off over a finite horizon (i.e. time slots) [4], [5]. Here, exploration refers to selection of all arms sufficient number of times to accurately learn their statistics and exploitation refers to selection of the best arm as often as possible. Their applications include online advertisement selection to increase the number of clicks [6], [7], clinical trails to identify best drugs [6], [7], news personalization [6], [7], decision making in financial markets [6], [7], and resource selection in wireless networks [8]- [10], internet of things (IoT) [11]- [14] and robotics [15], [16].
The performance metric for the MAB algorithm is the regret which is proportional to the number of selection of the suboptimal arms and it should be as low as possible [1]- [5]. An optimal MAB algorithm guarantees a logarithmic regret which is the best one can achieve. The upper confidence bound (UCB) algorithm [4], Kullback Leibler UCB (KL-UCB) [2], and Thompson Sampling (TS) [3] are the popular optimal MAB algorithms. The KL-UCB algorithm is computationally This work is supported by the funding received from core research grant (CRG) awarded to Dr. Sumit J. Darak from DST-SERB, GoI.
In wireless radio, IoT, and robotics applications, MAB algorithms are used for decision-making tasks in the mediaaccess control (MAC) layer. With tight integration of MAC and physical (PHY) layer, there is an option to accelerate the MAB and other MAC algorithms on hardware such as ASIC or FPGA instead of sequential processors based software implementation. The hardware implementation exploits the parallel architecture thereby offering lower execution time i.e. latency. Such applications also demand deployment on edge devices and hence, mapping of the MAB algorithms on system-on-chip (SoC) is desired. Recently, we discussed the implementation of UCB and KL-UCB algorithms on the heterogeneous Zynq SoC from Xilinx consisting of the dualcore ARM processor and 7-series FPGA [17]. In this work, we focus on the TS algorithm which has not been realized on SoC yet.
Considering two types of distributions, Bernoulli and Gaussian, frequentist approach based UCB algorithm remains the same while there are two variants of KL-UCB and Bayesian approach based TS algorithm, one each for Bernoulli and Gaussian distribution [2]- [7]. When arm distribution is known, we can select the appropriate algorithm and both guarantee lower regret than the UCB algorithm [3]. When arm distribution is unknown, the challenge is to decide the correct variant of the KLUCB/TS algorithm, and error in algorithm selection leads to significant degradation in performance. Specifically, the use of the Bernoulli variant of the KLUCB/TS algorithm for Gaussian distribution or vice-versa leads to high regret compared to the UCB algorithm [18], [19]. This demands intelligence to select the appropriate algorithm in an unknown environment.
In this paper, we design and implement a reconfigurable and intelligent architecture for MAB algorithms (RI-MAB) that can learn and select an appropriate algorithm in an unknown environment so as to minimizes regret. The main contributions of this paper are summarized as below: 1) We propose a synthesizable TS algorithm for Bernoulli distribution (BTS) by approximating the Beta function via pseudo-random number generator-based approach and map it on Zynq SoC via hardware-software codesign. The architecture is optimized to reduce the computational complexity without compromising on the regret performance. 2) For an environment with unknown arm distribution, we propose RI-MAB architecture. Here, intelligence enables the identification of appropriate algorithm (UCB or BTS) for a given environment, and reconfigurability allows anytime on-the-fly switching between UCB and BTS algorithms via dynamic partial reconfiguration (DPR). The DPR on FPGA eliminates the need for arXiv:2106.02855v1 [eess.SY] 5 Jun 2021 parallel implementation of algorithms resulting in huge savings in resources and power consumption.
3) The functional correctness, resource requirement, power consumption, and execution time of the proposed BTS and RI-MAB architectures are analyzed for various arm distributions, word-length, and hardware-software codesign approaches. 4) We also demonstrate the superiority of the proposed RI-MAB architecture over BTS and UCB only architectures. The rest of the paper is organized as follows. The MAB problem setup and a review of the relevant works are done in Section II followed by the synthesizable TS algorithm in Section III. The improved version of the TS algorithm and its architecture on SoC is presented in Section IV. In Section V, in-depth performance analysis and comparison with the UCB algorithm is done. The RI-MAB algorithm and its architecture are discussed in Section VI followed by its performance analysis in Section VII. Section VIII concludes the paper. Please refer to [20] for source codes and tutorial to reproduce results presented in this paper.

II. MAB Algorithms and State-of-the-art Review
In this section, we discuss the MAB setup, review state-ofthe-art MAB algorithms, and feasibility on the SoC.
In MAB setup, each experiment consists of a horizon of N, n ∈ {1, 2, ..., N} sequential slots and and the aim is to select the optimal arm from K, k ∈ {1, 2, ..., } arms as often as possible. Let's denote the arm selected in slot n as I n , and the reward received from the selected arm I n in slot n as R n . The reward of an arm k is generated from a distribution with mean, µ k . The mean rewards are unknown and the performance metric, regret, is given as [2]- [7] where µ * is the mean reward of an optimal arm and T k is the number of times the arm k selected in an experiment of horizon size N. Note that the distribution of arm rewards is unknown but fixed over a horizon. In this paper, we focus on Bernoulli (reward R n is either 0 or 1) and Gaussian distribution (reward R n is between 0 and 1) though the discussion can be extended to Exponential and Poisson distributions. As discussed in Section I, UCB, KL-UCB, and TS algorithms are the popular regret-minimization MAB algorithms with logarithmic regret guarantees. In the case of the UCB and KL-UCB algorithms, the first phase is initialization where each arm is selected once in the first K time slots. Thereafter, in each slot, quality factor (QF) Q(k, n) is calculated for each arm. For UCB, the QF, Q u (k, n), is given by, [4] where X(k, n) = X(k, n − 1) T (k, n) = T (k, n − 1) where 1 cond is an indicator function and it is equal to 1 (or 0) if the condition, cond is TRUE (or FALSE). The parameter X(k, n) is the total reward received using the arm k which has been selected for T (k, n) time slots in total n time slots. The parameter, α, is the exploration factor that quantifies the aggression by which the UCB algorithm explores all arms and theoretically, it lies between 0.5 and 2. Then, the arm with the highest QF is selected and it is denoted by, I n .
After the arm I n is played, its parameters, T and X, are updated using the received reward, R n as shown in Eq. 3 and Eq. 4. The KL-UCB algorithm is similar to UCB except for the calculation of QF which is denoted as Q kl (k, n) [2]. As shown in Eq. 6, QF is computationally complex due to underlining optimization function [2], [17]. where The term, d(p, q), in Eq. 8 denotes the KL divergence between p and q. The TS algorithm does not need an initialization phase and it uses Beta function for QF calculation. It is discussed later in Section II. All these algorithms have been extended for various other settings. The multi-play setting is the same as above except that the aim is to identify the best L arms instead of one arm [21], [22]. In a time-limited pure exploration setting, the aim is to identify the best arm within a given number of time-slots such that the regret incurred during these slots is not considered i.e. pure exploration phase [23]- [25]. In a confidence-driven pure exploration setting, the aim is to identify the best arm with given confidence and in as few time slots as possible [23]- [25]. In a delayed and complex case, the reward of the selected arm in time slot n is delayed by few time slots and such delay is not deterministic [26]- [28]. Also, the received reward might be a function of arms selected in multiple time slots instead of a separate reward for each slot [26]- [28]. Existing works mainly focus on the design and performance analysis of these algorithms while the focus of this work is on efficient mapping of MAB algorithms on the SoC. Since all these extensions are based on UCB/KLUCB/TS algorithms, an efficient implementation of these three algorithms is the first and important step towards the realization of all other algorithms on the SoC.
In [29], we discussed the mapping of the UCB algorithm and its extensions on Zynq SoC via a hardware-software codesign approach. In [17] we proposed the modified KL-UCB algorithm by replacing optimization function in Eq. 6 with finite-iteration based synthesizable function. Though KL-UCB offers lower regret, the resource, latency, and power consumption of the KL-UCB is high compared to the UCB. To reduce the latency and power consumption without compromising on the regret performance, we proposed reconfigurable KL-UCB architecture that enables on-the-fly switch from KL-UCB to light-weight UCB after initial exploration [17]. In the proposed architecture, UCB QF calculation is accomplished using the KL-UCB QF blocks and hence, parallel implementation of two architectures is not needed. Since the TS is the most popular MAB algorithm, efficient mapping of the TS on SoC and performance analysis for different word-length is an important research problem. Furthermore, an intelligence to identify the appropriate algorithm in an unknown environment is critical to get optimal regret performance. The work presented in this paper offers innovative solutions to these challenges.

III. Synthesizable Thompson Sampling Algorithm for
Bernoulli Distribution (SBTS) The frequentist modeling-based UCB and KLUCB algorithms assume the mean reward of an arm is proportional to the average reward in repeated plays of a given experiment [2]- [7]. On the other hand, the Bayesian modeling-based TS algorithm assumes the mean reward of an arm is proportional to a degree of belief that the arm is optimal [3]. These beliefs are updated based on the observations from the environment via Baye's rule that takes a prior belief as an argument and returns a posterior belief for a given likelihood. Since the arm statistics are unknown, the uncertainty about arm optimality is modeled as probabilities and the arm with the highest probability of being optimal under the posterior distribution is selected [3].
In the MAB setup, posterior belief becomes a prior in subsequent time slots, and the distributions which exhibit such behavior are known as conjugate prior. For example, Beta distribution is a conjugate prior for Bernoulli likelihood function [3]. Similarly, Gamma and Pareto distributions are a conjugate prior for Poisson and Gamma distributions, respectively [3]. Thus, the Bayesian approach needs to explicitly specify prior beliefs upfront in the form of the distribution and hence, each likelihood distributions have a specific variant of the TS algorithm. None of these TS variants are realized on the SoC yet and the proposed work on the implementation of the BTS algorithm on SoC is the first contribution in this direction.
The mapping of the BTS algorithm on the SoC consists of three steps: 1) Parameter update (Eq. 3 and Eq. 4), 2) QF Calculation, 3) Arm selection (Eq. 5). Since steps 1 and 3 are identical to UCB and KL-UCB algorithms, we request readers to refer to [17], [29] for in-depth understanding and implementation. Due to limited page constraints, the discussion is focused only on Step 3: QF calculation though our implementation and tutorials include all three steps. In the BTS algorithm, the QF for each arm, denoted by Q ts (k, n), is calculated by drawing the random sample from the Beta distribution with parameters, α = X(k, n) and β = T (k, n) − X(k, n). For the Bernoulli distribution, α refers to a number of successes and β refers to a number of failures. The probability distribution function (PDF), y beta , of Beta distribution is given by [3], where B(·) is the Beta function. The indicator function I [0,1] (x) ensures that only values of x ∈ (0, 1) have nonzero probability. The QF calculation of the BTS algorithm involves two steps: 1) Integration of the PDF, y beta given in Eq. 9, over x to generate the cumulative distribution function (CDF), F(x|α, β) and computation of its inverse F −1 (x|α, β).
2) Generation of a uniformly distributed random number x and its substitution into inverse CDF. The value obtained is the random number sampled from the Beta distribution and it is considered as the QF of the arm. The implementation of the above steps is computationally intensive and not well suited for hardware implementation due to the need for gamma random generators followed by the division of random numbers. Please refer to the in-built Matlab function, betarnd for more details. In the proposed approach, we approximate the betarnd function using an alternative synthesizable function. In each time slot n, we generate T (k, n) uniform random numbers for each arm k. These random numbers are sorted in ascending order and X(k, n) th random number in the sorted array is taken as the QF of the arm k. This approach needs the generation of random numbers in hardware and we implement a popular Mersenne Twister pseudo-random number generator (PRNG) due to its high throughput [30]- [32]. We referred to it as a synthesizable BTS (SBTS) algorithm.
In Fig. 1, the functionality of the proposed SBTS algorithm is verified by comparing its regret performance with the BTS algorithm realized using the betarnd function. We consider K = 6, N = 10000 and 150 experiments. In each experiment, arm statistics are chosen randomly and the plots in Fig. 1 include the cumulative regret averaged over 150 experiments, and standard deviation (shown with a shaded region). We observed that the proposed approach selected the best arm on an average 9436 number of times (≈ 94%) compared to 9445 times (≈ 94%) in the betarnd based BTS algorithm. The regret of both algorithms is nearly identical validating the correctness of the proposed QF calculation approach.

IV. Improved SBTS Algorithm For Efficient Mapping on SoC
The proposed SBTS QF calculation approach is synthesizable on the SoC but it suffers from two drawbacks: 1) In each time slot, a large number of random numbers need to be generated. For K arms, we need to generate k=K k=1 T (k, n) = n random numbers in each time slot n. This is followed by K sorting operations, one for each arm. In the worst case (when n=N i.e. end of the horizon), we need to generate and sort N random numbers in a slot. Thus, the time required to generate random numbers is not fixed and increases with n. This is not desirable for most applications. 2) Even if each random number is represented using fewer bits, say 8 bits, we need total storage of N bytes (For example, 10 Kbytes when N=10000) which is not feasible due to cost and area constraints of the majority of the embedded applications. Ideally, MAB execution time should be fixed and as small as possible. The time taken by the MAB algorithm to select the arm affects the time available for subsequent tasks. For instance, in wireless radio, communication is time-slotted which means arm (channel) selection is followed by transmission in each time slot. The higher the time taken for channel selection, the lower is the time available for actual data communication resulting in lower throughput. In the BTS algorithm, the time required to calculate QF for all arms increases with time due to an increase in the number of random numbers and subsequent sorting tasks. Please refer to Section for the detailed resource requirement and execution time comparison. To overcome the above drawbacks of the SBTS algorithm, we present a further improvements to simplify the sorting operation and minimize the number of random number generation in each slot.

A. SBTS-ES: SBTS Algorithm with Efficient Sorting
In the MAB setup with K arms, the SBTS algorithm involves sorting of K arrays consisting of random numbers between 0 and 1. For arm k, array size is T (k, n) with its maximum value as N. After sorting, random value at the X(k, n) th index of the sorted array is considered as QF of the arm. For accurate QF calculation, floating-point representation of random numbers is preferred which results in computationally complex sorting operation.
In the proposed SBTS-ES algorithm, the sorting operation is simplified by grouping the random numbers in pre-defined ranges and keeping the track of number of random numbers generated in each range. For illustration, consider an array, β k , of size 10 such that β k (1) represents the number of random numbers out of T (k, n) lies between 0 and 0.1, β k (2) represents the number of random numbers out of T (k, n) lies between 0.1 and 0.2. In the same fashion, β k (10) represents the number of random numbers out of T (k, n) lies between 0.9 and 1. For k th arm with T (k, n) = 4 and X(k, n) = 2, four random numbers are generated in time slot n. Lets assume these random numbers as {0.342, 0.012, 0.753, 0.553}. Then, β k = {1, 0, 0, 1, 0, 1, 0, 1, 0, 0}. The X(k, n) th non-zero value lies in the β k (3) and hence, QF of the arm is equal to the mean of β k (3) range i.e. 0.3+0.4 2 = 0.35. For random numbers as {0.342, 0.012, 0.083, 0.553}, we have β k = {2, 0, 0, 1, 0, 1, 0, 0, 0, 0} and hence, the QF of the arm is equal to mean of β k (1) range i.e. 0+0.1 2 = 0.05. In Fig. 2, we consider K = 3 arms. In the first K time slots, each arm is selected once. In time slot 4, T (k, n) random numbers are generated for k th arm. In Fig. 2, one random number is generated separately for each arm. After updating respective β k , QF is calculated for each arm and the third arm is selected due to the highest QF. As shown in Fig. 2, received reward is 0 in time slot 4 and hence, only T (3, 5) is updated i.e. T (3, 5) = T (3, 4)+1. The rest of the parameters do not change. Note that β k of all arms is initialized to zero at the beginning of each time slot. In time slot 5, one random number is generated for the first two arms and two random numbers are generated for the third arm. The same process of QF calculation, arm selection, and parameter update are repeated in each time slot till the end of the horizon. The advantages of the proposed SBTS-ES are: 1) Instead of k=K k=1 T (k, n) i.e. at most N floating-point random numbers, the storage of only |β|K integer numbers with word length of log 2 N is needed. Here, |β|= |β 1 |= |β 2 |= .. = |β k |. SBTS needs 32KN bits while SBTS-ES needs |β|K log 2 (N + 1) . For N ≥ 100 and 2 ≤ |β|≤ 100, SBTS memory requirement is at least 3 kilo Bytes (KB) higher than SBTS-ES. For N > 1000 and N > 2000, respectively, the difference is at least 50 KB and 80 KB, respectively.

B. SBTS-ESSR: SBTS-ES Algorithm with Single Random Number Sample
In the SBTS-ES algorithm, total k=K k=1 T (k, n) floating-point random numbers need to be generated in each time slot. This means total N random numbers will be generated in the last time slot of the horizon. Generating such a huge number of random numbers is a time-consuming, memory-intensive, and inefficient approach. Since only one arm is selected in each time slot, the parameter T (k, n) of all arms except the selected arm will remain unchanged. Though T (k, n) random numbers are needed to calculate QF of an arm k, SBTS, and SBTS-ES algorithms discard previous generated random numbers. This is inefficient since instead of generating all random numbers in each slot, we can use the random numbers from previous slots as well. Furthermore, separate random number generators for each arm can be avoided.
In the proposed SBTS-ESSR approach, β is not initialized at the beginning of each slot. In each time slot, a single random number is generated followed by updation of parameter β for all arms. To incorporate a new random number, we discard any one of the entries in β and update it with a newly generated random number. For illustrations, consider two arms with β 1 = {0, 1, 0, 0, 1, 1, 0, 0, 1, 1} and β 2 = {0, 1, 0, 1, 1, 0, 0, 1, 0, 0} in time slot 9. Then, T (1, 9) = 5 and T (2, 9) = 4. Assume X(1, 9) = 2 and X(2, 9) = 3. In the SBTS algorithm, 5 random numbers are generated for arm 1 followed by sorting and selection of X(1, 9) th random number as the QF of the arm. The same process is repeated for arm 2 with 4 random numbers. In the SBTS-ES algorithm, existing β is discarded and 9 random numbers are generated. Parameters, β 1 , and β 2 are updated using these random numbers followed by QF calculation as discussed in Section. In the SBTS-ESSR algorithm, the first step is to randomly remove one sample from β 1 and β 2 instead of discarding them completely. Then, a single random number is generated which is used to update β 1 and β 2 . After that, QF is selected using the same approach as in the SBTS-ES algorithm. It is important to note that SBTS-ESSR is a functional equivalent to SBTS-ES since the former generates T (k, n) random numbers in a one-time slot while the latter uses (T (k, n) − 1) random numbers generated in the previous time slots and only one random number is generated in the current time slot. Compared to SBTS-ES, SBTS-ESSR reduces the number of random number generations in each time slot as well as the number of comparisons by a factor N i.e. from 2KN|β| to 2K|β|. Furthermore, the execution time of the SBTS-ESSR algorithm is same in each time slot compared to SBTS and SBTS-ES algorithms where execution time in each slot increases with the increase in the index of the time slot.
The SBTS-ESSR algorithm is given in Algorithm 1. In the beginning, all elements of X and T are initialized to 1 assuming an initial uniform prior i.e. all arms have equal probability of being optimal. For clarity of notations, the subscript n is removed in X and T . In each time slot of the horizon, the QF of each arm is calculated (Line 2) and the arm with the highest QF is selected (Line 3). The selected arm is played and the algorithm receives the reward from the environment (Line 4). Based on the reward, parameters X and T are updated (Line 5).
The QF generation is explained using Subroutine 1. For a given L = |β| and K, β is a matrix where each column belongs to one arm. In the first time slot, β is initialized in the same way as T (Lines 1-3). Otherwise, β is updated for the arm selected in the previous time slot (Lines 4-6) by generating a single random number. Then, one sample is removed from each column of β, and the corresponding row index is selected randomly (Lines 9 -10). Next, a new random number is generated and β of all arms is updated (Lines 11-13). Using updated β and parameter X, the QF is calculated for each arm (Lines 14-15).
The environment generates the reward in each slot based Calculate Q ts (:, n)=QF SBTS ESSR(X,T ,K,n, I n−1 ) 3: Select and play arm, I n = arg max k Q ts (:, n).

5:
Update X and T : X(I n ) = X(I n ) + R n , T (I n ) = T (I n ) + 1 6: end for 7: Calculate regret using Eq. 1. Generate a random number, p between 0 and 1.

6:
Update β(β index , I n−1 )= β(β index , I n−1 ) + 1. 7: end if 8: for k = 1, 2, · · · , K do 9: Generate an integer random numbers, s, between 0 and L. Generate a random number, p between 0 and 1. 12:  Fig. 3, we repeat the experiments similar to Fig. 1 and compare the regret of the three proposed algorithms. As expected, the regret of the SBTS-ESSR is highest followed by SBTS-ES and SBTS. However, the difference in the regret is less than 12 for a horizon size of N = 10000. On average, the best arm was selected 9421 (94%), 9346 (93.5%), 9271 (92%) number of times. These results validate the functional correctness of the proposed algorithms in the MAB setup i.e. ability to identify and select the best arm as many times as possible.
The proposed algorithms are mapped on the ZSoC platform and the corresponding architecture is shown in Fig. 4. The architecture is designed and implemented using Vivado 2019.1, We have explored various other configurations via hardwaresoftware co-design. Also, the WL of various blocks realized in FPGA is carefully chosen so as to optimize the resource utilization and power consumption without compromising on the regret performance. Corresponding results are presented in Section V. The section of the proposed architecture realized on FPGA is made reconfigurable via DPR. Specifically, the number of active arms, K, and |β| can be dynamically configured via processor configuration access port (PCAP) using the partial bit-streams pre-loaded in the SD card [33], [34]. The required bitstreams are sent to the FPGA, through the bare-metal application deployed on the ARM processor, for reconfiguration using the device configuration (DevC) direct memory access (DMA). Please refer to [20] for source codes and tutorial explaining the building blocks of the proposed architectures. V. Performance Analysis: SBTS Algorithms In this section, we verify the functional correctness of the proposed SBTS algorithms on Zynq SoC and compare its regret performance with state-of-the-art UCB algorithms for different WLs. The rewards are assumed to have Bernoulli distribution. All results are obtained after averaging over 100 different experiments to consider the non-deterministic nature of the online machine learning algorithms. Later, the resource utilization, power consumption, and execution time of these algorithms are analyzed. MAB algorithms such as KL-UCB, UCB v, and UCB t are not considered since UCB offers regret which is close to these algorithms with significant savings in resources, power consumption and execution time [17], [29].

A. Regret Comparison
Similar to Fig. 1 and Fig. 3, we repeat the experiments for K = 4 and K = 8 for algorithms realized on ZSoC with single-precision floating-point (SP-FL) WL. In Fig. 5, we consider four algorithms: 1) UCB, 2) SBTS, 3) SBTS-ES (|β|= {10, 20}), and 4) SBTS-ESSR (|β|= {10, 20}). It can be observed that the proposed SBTS algorithms offer significantly lower regret than UCB. This is expected since the TS algorithm has shown to outperform the UCB algorithm in analytical and simulation results. It can be observed that the regret of SBTS-ES and SBTS-ESSR algorithms decreases with an increase in |β|. This is because higher |β| allows accurate calculation of QF leading to a reduction in the selection of sub-optimal arms. The optimal arm in each case is highlighted in bold font. Compared to µ 1 and µ 3 , the difference between the arm statistics is small in µ 2 and µ 4 . This makes the learning and identification of the optimal arm challenging. As shown in Fig. 6, the regret of the SBTS, SBTS-ES, and SBTS-ESSR algorithms is lower than that of the UCB algorithm for all arm distributions. In each case, it is verified that the optimal arm is chosen the highest number of times by all algorithms. The appropriate selection of |β| is important as it affects the precision of QF selection. This results in multiple arms with identical QF values and hence, frequent selection of suboptimal arms. For instance, |β|= 10 is not sufficient for µ 2 as it offers high regret as shown in Fig. 6 (d).Based on extensive performance analysis, 15 ≤ |β|≤ 20 leads to a higher number of selection of optimal arm i.e. lower regret and the gain in performance is not significant for |β|> 20. The proposed architecture in Fig. 4 allows on-the-fly selection of |β| via DPR. Next, we compare the effect of WL on the regret performance of the SBTS-ESSR (|β|= 20) and UCB algorithms. In Fig 7, we compare the regret of these algorithms at the end of the horizon of size, N = 10000 for µ 1 and µ 3 . We compare the regret for SP-FP and fixed-point implementations with a total WL of 27, 11, and 6 bits. In each case, the number of bits to represent integer and fractional parts are chosen carefully so as to minimize regret. It can be observed that the regret degrades with the decrease in WL. However, degradation is not significant till WL=11. For WL=6, regret is high and this happens due to insufficient bits to represent QF which in turn leads to the selection of sub-optimal arms. Thus, the selection of appropriate WL is important since lower WL leads to significant savings in resources but it should not come at the cost of regret performance.

B. Complexity Comparison
In Table I, we compare the resource utilization (LUT, FFs, DSP and BRAM), and power consumption of four different architectures realized on ZSoC. These architectures correspond to SBTS, SBTS-ES, SBTS-ESSR, and UCB algorithms. Each architecture is made reconfigurable via DPR at the arm level i.e. the number of arms can be dynamically configured (i.e., the number of arms can be tuned to any values less than or equal to K max . Each architecture is realized with three different WLs: SP-FP, fixed point WL of 27, and 11 bits. The reconfigurable architecture is compared with non-reconfigurable Velcro approach-based architecture with fixed K max arms and SP-FP WL. As shown in Table I, resource utilization and power consumption of DPR based architecture depend on the number of active arms, K compared to the Velcro approach, which corresponds to architecture with K max arms, i.e., all blocks are active all the time compared to dynamic activation and deactivation of arms in the DPR based architecture. It can be observed that the SP-FP version of the DPR-based architecture offers around savings of 18-39% in LUTs, 17-38% in FFs, 25-44% in DSP48Es, and 20-40% in BRAM over the Velcro approach for K < K max . Further, they offer a 19-40% reduction in the dynamic power consumption over the Velcro approach for K < K max . Using the fixed-point implementation with WL=27, one can achieve up to 83%, 79%, 100%, 100% savings in the consumption of LUTs, FFs, DSP48Es, and BRAM 18Ks respectively for K < K max . This can be achieved with almost identical performance as that of the SP-FP architecture. With WL=11, further improvement in savings of up can be achieved with a slight degradation in the regret. In terms of the dynamic power consumed, the architectures with fixed-point WLs offer up to 94% savings. Among SBTS algorithms, the SBTS-ESSR algorithm offers significant savings in resource utilization as well as power consumption. Compared to the UCB algorithm, SBTS-ESSR is computationally efficient, offers lower regret and power consumption. This makes the proposed SBTS-ESSR a superior alternative to the state-of-the-art UCB al-  gorithm for the environment with Bernoulli rewards. Execution time of the algorithm is an importanceperformance metric that depends on efficient implementation and underlining architecture. In Table II, we consider three different configurations obtained by realizing the algorithm using: 1) PS and PL with optimal partitioning, 2) Only PS (ARM + NEON), 3) Only ARM. It can be observed that the first approach offers the lowest execution time due to the parallel execution of the QF function in PL compared to sequential PS execution. Furthermore, the gain improves as K increases. Among various TS algorithms, SBTS-ESSR offers the lowest execution time as expected. In applications like wireless networks, MAB algorithms are realized in upper layers (MAC/Network) i.e. in ARM or other processors while the PHY is present in the SoC. The proposed architecture enables shifting of the MAB algorithms from MAC to PHY layers thereby resulting in an accelerator factor ranging from 465.6-776 for UCB and 2.5-33.6 for TS. For larger K > 20, the acceleration factor will be significantly higher. Between UCB and TS algorithms, UCB is faster on ZSoC due to hardwarefriendly arithmetic operations compared to PRNG in SBTS but UCB incurs high regret. On the PS-only architectures (i.e. ARM and ARM+NEON), SBTS-ESSR offers the lowest execution time.

VI. Reconfigurable and Intelligent MAB (RI-MAB)
The SBTS-ESSR algorithm is well-suited only for arm rewards with Bernoulli distribution and hence, it may not outperform the UCB algorithm when arm distribution is unknown and random [3]. For example, in wireless Radio, arm i.e. wireless channel distribution is usually Bernoulli at high signal-to-noise ratios (SNR) and Gaussian at medium and low SNRs. In IoT and robotics applications, the type of arm distribution is unknown. Such applications demand architecture which can learn, identify and deploy appropriate MAB algorithm.
Very few works have addressed this problem and [18] is one of the recent works in which the aggregator algorithm selects the arm chosen by one of the candidate MAB algorithms in each time slot. The aim is to identify an optimal algorithm for a given unknown environment and this is achieved by learning from the past performance of various algorithms i.e algorithm exploration-exploitation trade-off in addition to arm exploitation-exploration trade-off. The major problem with [18] is the need for hardware implementation of all candidate MAB algorithms in parallel (referred here as Velcro MAB). Such architecture is area and power inefficient. The proposed RI-MAB algorithm augmented with DPR-based reconfigurable architecture overcomes this problem.
The proposed RI-MAB algorithm is given in Algorithm 2. The number of candidate algorithms, number of arms, and horizon size is A, K, and N, respectively. The RI-MAB algorithm maintains the probability distribution on all candidate algorithms to indicate their optimality. Since all algorithms are equally likely to be optimal at the beginning of each experiment, the prior belief, π 0 , is initialized as uniform distribution (Line 2). To update the belief and identify the optimal algorithm, the RI-MAB algorithm performs epochbased exploration in the initial N learn time slots . In this phase, each algorithm is selected for 2 e time slots before incrementing the parameter e by 1 (Line [12][13][14][15][16][17]. This allows a sufficient number of learning samples of each algorithm before finalizing the algorithm for the rest of the horizon i.e. after N learn (Line 20). Such approach allows only one MAB algorithm to be active in hardware in each Algorithm 2 RI-MAB Algorithm Update the learning rate, η n = log A/(n × K) 9: Compute the unbiased reward using the belief of the selected algorithm,R n = R n π n−1 (alg) 10: Update the belief of the selected algorithm: π n (alg) = exp(η nRn ) × π n−1 (alg) 11: Normalize the belief of all algorithms: π n (i) = π n (i)/ 2 a=1 π n (alg), i ∈ {1, 2} Update X and T : X(I n ) = X(I n ) + R n , T (I n ) = T (I n ) + 1 27: end for 28: Calculate regret using Eq. 1. time slot and increasing length epoch reduces the number of algorithm switching. Even though only one algorithm is active, the parameters, X, and T , are common across all candidate algorithms which means there is no compromise on arm learning aspects of MAB setup.
During algorithm exploration (Lines 6-18), RI-MAB updates the parameter, π n in each time slot based on the selected algorithm and received reward. Similar to [18], we obtain the unbiased estimate of received reward using the probability of the arm selection i.e. belief of the selected algorithm (Line 9). Then, the belief of the selected algorithm is updated using the exponential multiplicative factor (Line 10) and learning rate (Line 8) [18]. In the end, the beliefs of all algorithms are normalized.
The functionality of the RI-MAB algorithm is explained using Fig 8. In the beginning, the prior belief, π 0 = {0.5, 0.5}. The algorithm starts by selecting UCB for the first 2 1 = 2 slots. It should be noted that we simulate the algorithm with Bernoulli arm rewards for ease of understanding. In slot 1, UCB selects the arm, I 1 = 4 and receives a reward, R 1 = 1. This results in an increase in the belief value of UCB by a factor of exp(η 1 * 1). The same happens in slot 2 when the belief of UCB further increases by receiving R 2 = 1 for the selected arm, I 2 = 2. In slots 3 and 4, the algorithm selects SBTS-ESSR, which receives rewards, R 3 = 0 and R 4 = 0 on arms 4 and 2 respectively. Hence, the belief of TS is not updated in both slots (as exp(η n * 0) = 1). This is when the parameter e is incremented by 1. Skipping over to slots 19 and 24, the algorithm selects UCB, which receives a reward, R 19 = 1 and R 24 = 1 respectively, improving the belief of UCB. In slots 128 and N learn = 500, the selected candidate algorithm, SBTS-ESSR, receives a reward of 1, which improves its belief. In summary, the belief of the selected candidate algorithm increases when it receives a reward, R n = 1, and is not updated when R n = 0. After slot N learn = 500, the algorithm finalizes SBTS-ESSR for the rest of the horizon, pertaining to its higher belief in slot N learn .
The proposed RI-MAB algorithm is mapped on SoC and the corresponding architecture is shown in Fig. 9. Compared to Fig. 4, an additional algorithm selection unit is included in the ARM processor and the QF calculation block is reconfigured via DPR depending on the selected candidate algorithm.  In Fig. 10 (a), we compare the regret of the four algorithms at different instances of the horizon. The regret is averaged over 20 independent experiments. The final regret (i.e. regret at the end of the horizon) for each experiment is shown in Fig. 10 (c). In all 20 experiments, SBTS-ESSR offers the lowest regret while UCB incurs the highest regret though both are able to identify the optimal arm. Velcro MAB offers performance  where it takes more time to learn TS is better than UCB. On the other hand, RI-MAB selects SBTS-ESSR in all experiments. Despite that, the regret of RI-MAB is higher than Velcro MAB in all experiments except 16 and 30. This happens due to the initial N learn period of exploration to choose between UCB and SBTS-ESSR algorithm. Still, averaged regret of RI-MAB and Velcro-MAB is nearly identical as shown in Fig. 10 (a) since regret incurred by Velcro-MAB is very high in experiments 16 and 20. Such high regret events are avoided in the RI-MAB algorithm.

VII. Performance and
In Fig. 10  . Though RI-MAB incurs higher average regret in experiments where SBTS-ESSR is optimal, it avoids the wrong selection of algorithm as well as optimal arm in the rest of the experiments. It can be observed that Velcro-MAB and RI-MAB offer lower regret than UCB and SBTS-ESSR in some experiments. This happens because learning due to switching between the two algorithms helps to avoid the wrong selection of optimal arm in SBTS-ESSR. These experiments demonstrate the need for careful algorithm switching and selection for a given environment since a single algorithm does not offer optimal performance in all experiments.
Next, we select the arm statistics randomly in each of the experiments instead of fixed µ 3 in Fig. 10. We consider N = 10000, K = 8 and two types of arm statistics such that the minimum difference between two arm statistics is 1) 0.07 (Case 1), and 2) 0.025 (Case 2). Corresponding results are shown in Fig. 11 (a)-(d). It can be observed that SBTS-ESSR offers significantly higher regret in almost 25% of the experiments. Though UCB offers an overall higher average regret, it successfully identifies optimal arm in all experiments leading to better performance than SBTS-ESSR. Proposed RI-MAB offers similar performance as UCB but with lower average regret. On the other hand, Velcro UCB fails to identify the appropriate algorithm and hence, the optimal arm in some experiments while in other experiments it offers lower regret than RI-MAB. On average, both offer nearly identical average regret but RI-MAB eliminates high regret events.
To gain further insights into the regret performance, we study various statistical properties of the final regret over 100 experiments in Fig. 12. Using the Boxplot feature in Matlab, we consider median (central red line), and percentile (bottom and top edges of the box indicate the 25th and 75th percentiles, respectively). The outliers are plotted individually using the '+' symbol. It can be observed that the SBTS-ESSR and Velco-MAB have large size boxes indicating large variation in regret performance and the number of outliers are more indicating wrong selection of algorithm or arm. On the other hand, UCB and RI-MAB guarantee the selection of optimal arm in all experiments and RI-MAB offers lower regret between them.  Next, we compare the resource utilization and power consumption of RI-MAB and Velcro-MAB architectures in Table III. Both architectures are on-the-fly reconfigurable in terms of the number of arms and type of the algorithm. Proposed RI-MAB with SP-FP WL architecture offers around 44-65%, 36-59%, 27-56% and 15-49% savings in LUTs, FFs, DSPs, and BRAMs over Velcro-MAB architecture. Similarly, it offers 44-65% lower dynamic power consumption. The savings improve further when the WL is optimized via fixedpoint representation. With the increase in the number of arms (i.e. for K max > 20, the proposed architecture and DPR approach further improvements in resource utilization. In addition, savings will increase further with the increase in the number of candidate algorithms.

VIII. Conclusions and Future Works
In this paper, we present synthesizable and reconfigurable architecture of the Thompson Sampling (TS) multi-armed bandit (MAB) algorithm for arms with Bernoulli distribution. The functional correctness, execution time, resource, and power consumption comparisons demonstrate its superiority over the upper confidence bound (UCB) algorithm. For an environment with unknown arm distribution, a reconfigurable and intelligent MAB (RI-MAB) algorithm is proposed along with its architecture. The RI-MAB offers significant savings in resources and power consumption without compromising on the regret performance. In the future, we plan to integrate the proposed RI-MAB with wireless radio and analyze the gain in throughput using real radio signals. Other possibilities include the extension of the RI-MAB architecture for an environment where the number of arms, as well as arm statistics, are not fixed i.e. the non-stationary environment in which an optimal arm changes over time.