Federated Reinforcement Distillation with Proxy Experience Memory

This paper is presented at 28th International Joint Conference on Artiﬁcial Intelligence (IJCAI-19) 1st Wksp. Federated Machine Learning for User Privacy and Data Conﬁdentiality (FML’19), Macau, August 2019


Introduction
Recent advances in mobile computing power has led to the emergence of intelligent autonomous systems [Park et al., 2018, Shiri et al., 2019], ranging from driverless cars and drones to self-controlled robots in smart factories.Each agent therein interacts with its environment, and carries out decision-making in real time.Distributed deep reinforcement learning (RL) is a compelling framework for such applications, in which multiple agents collectively train their local neural networks (NNs).As illustrated in Figure 1(a), this is often done by: (i) uploading every local experience memory to a server, (ii) constructing a global experience memory at the server, and (iii) downloading and replaying the global experience memory at each agent to train its local NN [Rusu et al., 2016].However, a local experience memory contains all local state observations and the corresponding policies (i.e., action logits), and exchanging this may violate the privacy of its host agent.
To obviate this problem, we propose a distributed RL framework based on a proxy experience memory, which is * Contact Author  In this work, we consider actor-critic RL architecture comprising two separate NNs, i.e., policy (actor) and and value (critic) NNs, and study how to construct the local and global proxy memories, how often the proxy memories are exchanged, and finally how to update each agent's local NN using the global proxy memory.
Related Works.Distributed deep RL has been investigated as policy distillation [Rusu et al., 2016] and advantage actorcritic (A2C) [Mnih et al., 2016] algorithms, under policy NN and actor-critic based RL architectures, respectively.Both algorithms rely on exchanging actual experience memories.For classification tasks, distributed machine learning via exchanging NN outputs has been proposed as federated distillation (FD) in our preceding work [Jeong et al., 2018].In FD, the outputs are quantized based on the classification labels, for maximizing communication efficiency.FRD leverage and extend this idea to distributed RL scenarios, in the context of its preserving privacy, rather than improving communication efficiency.It is noted that federated learning [McMahan et al., 2017] is another promising enabler for private distributed RL by exchanging NN model parameters, which has been recently studied as federated reinforcement learning (FRL) in [Zhuo et al., 2019].In view of this, we conclude this paper by comparing FRL and our proposed FRD in the last section.
2 Background: Distributed Reinforcement Learning with Experience Memory We consider the episodic, distcrete state and action space Markov decision process, with state space S, action space A and reward at each time slot denoted by r t ∈ R. The policy is stochastic and denoted by π θ : S → P(A), where P(A) is the set of probability measures on A. The parameters of local model are denoted by θ ∈ R n , and π θ (a|s) is the conditional probability of a when the state is s.The reinforcement learning (RL) interacts with the environment without any prior knowledge about the environment.
In policy distillation presented in [Rusu et al., 2016], the agents i = 1, • • • , U construct the dataset named experience memory for training local model θ i .The experience memory M = {(s k , π(a k |s k ))} N k=0 consists of the state s k and the policy vector π(a k |s k ) tuple, where a = (a 1 , • • • , a |A| ) is action vector.As illustrated in Figure 1(a), the experience memroy M is collected with following procedures.

• Each agent records the local experience memory
tuple during E episodes.The size of local experience replay N i is identical with the learning steps.In this paper, we assume that all the agents wait for the last agent completing the episode.
• After all the agents complete the E episodes, the server collects M i of each agent.
• Then, the server constructs a global experience memory , where N = U i=1 N i and π denotes the arbitrary policy of agents.
After the global experience memory M is constructed, the agents update their local model θ i with procedures.
• To reflect the knowledge of other agents, the agents download the global experience memory M from the server.
• Similar to the conventional classification setting, each agent i fits the local model θ i minimizing the cross entropy loss L i (M, θ i ) between the policy of local model π θi (a k |s k ) and the policy π of global experience memories M, where Unfortunately, direct exchaging the local experience memories of agents has the privacy leakage issues.The server can get the all information about the state visited by the host agent and the corresponding policy of the host agent.To utilize the policy distillation, privacy leakage is inevitable.

Federated Reinforcement Distillation (FRD) with Proxy Experience Memory
In this section, we introduce the novel federated reinforcement distillation (FRD) method that provides communication-efficient privacy-preserving federated reinforcement distillation.The agents utilizing the FRD construct the novel dataset named proxy experience memory As illustrated in Figure 1(b), the proxy experience memory M P is formed with following procedures.
• Each agent categorizes the policy π θi (a|s) along the states s included in the proxy state cluster, i.e., s ∈ C j .
• After all the agents complete the E episodes, each agent calculates local average policy π p θi (a k |s p k ) by averaging the policy π θi (a|s) in the proxy state cluster C j and make local proxy experience memory The size of local proxy experience memory N P i is identical with the number of proxy state cluster that visited by the agent.Note that the π p θi (a k |s p k ) is not generated by the local model of agent.• When the local proxy experience memories of every agent is ready, the server collects M P i of each agent.• Then, the server constructs the global proxy experience memory M P = {(s p k , π p (a k |s p k )} N P k=0 by averaging the local average policy of local proxy experience memory along the state cluster.The size of global proxy experience memory N P is identical with the number of proxy state cluster that visited by entire agents.
As same as the policy distillation case, the agents utilizing the FRD update their local model θ i with following procedures.
• As the distritubed RL procedure, the agents download the global proxy experience memory M P from the server.
• Each agent i fits the local model θ i minimizing the cross entropy loss L P i (M P , θ i ) between the policy of local model π θi (a k |s k ) and the global average policy (2) We note that the loss is calculated with the policy produced by the local model as the input of proxy state.
As we refered above, the size of global proxy experience memory is much smaller than that of experience memory due to state clustering.When the memory sharing occurs through wirelee channel, the payload size is key factor of sharing fisibility.In this point of view, FRD provides communicationefficient distributed RL framework.
Furthermore, exchanging the proxy experience memories keeps the privacy of the agents.The server just knows about a group of states that the agent visited and the policy of the host agent is totally concealed due to the policy of agent is shared in the form of averaged policy over the proxy state.

FRD under Actor-Critic Architectures
The advantage actor-critic (A2C) algorithm [Mnih et al., 2016] consists of two parts, actor and critic NNs.The actor generates the action a ∈ A according to the policy π θ and the critic evaluates selected action how much is beneficial than another actions with respect to gaining more expected future reward.But the actor and the critic have no prior knowledge of environment, the actor-critic pair have to interact with environment and learn optimal policy to getting maximum expected future reward.By adopting neural network structure, the actor and the critic effectively learn optimal policy π * .
The advantage function [Wang et al., 2016] is the metric evaluating the action generated by the actor.If the value of the advantage function is positive, it means that the selected action is not the optimal action compare to another actions.In other words, the The advantage function A is defined as follows: where Q π (s, a) = E [r γ 0 |s 0 = s, a 0 = a; π], the value function V π (s) = E[r γ 0 |s 0 = s; π], and r(s t , a t ) is intant reward at learning step t.As we can see in equations (3), we can obtain the advantage function with just only value function.As a result, the neural network of the critic approximates the value function and estimate the advatage function in every updating step of policy network.
Under A2C algorithm, we have to select which model to learn using FRD framework -only one among two models or both?As we mensioned in section 2 and 3, the policy network forms the experience memory with the policy π.Similarly, the value network forms the value memory that consists of the state and corresponding value pairs.In the FRD case, average policy is replaced by the average value.
In Figure 2, we represent the performance comparison in each case: both, policy network, and value network.Three cases have similar performance in terms of the number of episodes until complete the mission.Unlike two other cases, the case exchanging the policy network shows stable learning results, i.e., the variation of mission completion time is smaller than two other cases.For this reason, we select the policy network to applying FRD framework.In the rest of the paper, we utilize FRD framework with the experience memory made by the output of policy network unless we mention.

Experiments
The group of the RL agents share the output of policy network to construct the proxy experience memory M P utilizing federated reinforcement distillation under advatage actor-critic algorithm.In this paper, we implement the proposed federated reinforcement distillation framework in the Cartpole-v1 in OpenAI gym environment to evaluating the performance.We evaluate the performance of propose FRD framework in terms of the number of episodes until the group of agent complete the mission.The mission of the group of agents is defined as achieve the average standing duration of the pole over 10 episodes exceed the predetermined time duration.We assume that the group of agents complete the mission if just one of the agents in group completes the mission.Each agent adopts the advantage actor-critic model for local model and the model size of policy network presented in Table 1.Note that the model size of the value network is identical with that of policy network.
Before implement the federate reinforcement distillation, pre-arranging of the state clustering is needed.The state of Cartpole environment consists of four component, which is position of cart, velocity of cart, angle of pole, velocity of pole tip.We evenly divide the each component with the number as S subsections.Then, we form Figure 3: Simulation results in Cartpole environment.The x-axis label of all graphs is the number of agents in cooperative group and the y-axis label of all graphs is the number of episodes until the agent group completes the mission.The mission of the group of agents is achieve the average standing duration of the pole over 10 episodes exceed the predetermined time duration.We assume that the group of agents complete the mission if one of the agents in group completes the mission.The agents make the proxy experience memory with the output of policy network only.the state cluster as the combination of divided components.As a result, the number of state cluster |C| is identical with S 4 .The proxy state of each state cluster is defined as the middle value of each subsection of components.For example, the proxy state of the corresponding state cluster C j is s p = [0.5, −0.75, 0.75, 0.05] when C j = {[0, 1), [−1, −0.5), [0.5, 1), [0, 0.1)}.
We perform the simulations with various hyper parameter settings presented in Table 1 and corresponding results are presented in Figure 3.We investigate the impact of each hyper parameter in terms of the number of episodes until the agent group complete the mission.The box in Figure 3 represents the data from 25% to 75%.The blue star represents the average of data.The red line represents the median of data.
Impact of the Proxy State Size.In the Setting 2 and Set-ting 4, we can observe the impact of the proxy state size on the performance of FRD.When the multiple agents cooperate, the performance of Setting 4 is better than that of Setting 2 in terms of average number and the variance of episodes.As the number of agents is increase, the relation is reversed.Because the policy resolution of proxy state with smaller size is low, the knowledge of agents is blurred compare to that of proxy state with bigger size.Nevertheless, multiple agents case of Setting 4 has better performance though the proxy state size is 16 times smaller than that of Setting 2. It means that the group of agent choose the proxy state size to reducing the payload size of exchanging information.If the agents cooperate through wireless channel, they can select the proper state cluster size sacrificing a bit of learning performance.
Impact of Memory Exchange Preiod.In the Case 5, the performance of group agent is getting worse as the number of agents is increase.Too frequent memory exchange and local model update has no merit on increasing the number of agents.As shwon in the Setting 6, moderate frequency of memory exchange brings stable performance enhancement.
Impact of Initial Learning Time.If there is no initial learning time before exchanging the experience memory, the performance of FRD is degraded as well as unstable.In the Setting 5, absence of initial learning time results in performance degradation as the number of agents is increase.
The local model of agent is not trained enough to exchange there proxy experience memory.Furthermore, too long initial learning time is also negative to the performance of FRD.Because too long initial learning time may give a chance of learning bad policy of indivisual local model of agent, the cooperation of agents is getting worse the training of local model of each agent.Comparing the Setting 2 and Setting 3, the performance of the Setting 2 is better than that of the Setting 3. As a result, the initial learning time should be selected properly to achieve higher performance.
Impact of Neural Network Model Size.As we can see in Setting 1 and Setting 2, smaller NN has better performance in terms of the number of episodes until the group agent complete the mission.Because we measure how fast the group agent complete the mission, bigger NN has disadvantage in terms of convergence duration.In future work, the advantage of big NN compare to small NN can be evaluated in the more complicate and score-pursuing environment like Atari games in OpenAI gym.On the other hand, too small NN has marginal gain about FRD.In Setting 6 and Setting 7, the performance enhancement along increment of the number of agents is limited in certain average value boundary.

Discussion and Concluding Remarks
In this paper we introduce privacy-preserving distributed reinforcement learning framework, termed federated reinforcement distillation (FRD).The key idea is to exchange a proxy experience memory comprising a pre-arranged set of states and time-averaged policies.It makes possible to conceal the actual experience and additionally has benefit of reduced memory size.When the distributed learning is conducted in communication-constrained situation, e.g., through wireless channel, proposed FRD framework has advantage to existing policy distillation.
Based on advantage actor-critic (A2C) algorithm, we evaluate the performance of FRD in various proxy memory structure and different memory exchanging rules.First, we investigate the impact of proxy memory structure which network is used for FRD in A2C algorithm -policy network, value network or both.Second, based on the first investigation, we implement policy network based FRD and evaluate the performance in various setting of memory exchanging ruleswhen, how often, how large.
As the future work, performance comparison between the federated learning and FRD is promising.We evaluate the performance in simple setting when the multiple agents collaborate.The performance in terms of average number of episodes until the group of agent complete the mission is fairly equivalent.But in terms of variation, FRD has better performance than federated learning.The performance difference is due to the amount of noise when knowlede transfer occur.It means that the noise of FRD is less than that of federated learning.
with proxy experience memory.

Figure 1 :
Figure 1: Comparison between (a) a baseline distributed reinforcement learning (RL) framework, policy distillation with experience memory [Rusu et al., 2016], and (b) the proposed federated reinforcement distillation (FRD) with proxy experience memory.

Figure 2 :
Figure 2: Performance comparison according to exchanging model.The case of exchanging policy network is better than other cases in terms of performance variation.

Figure 4 :
Figure 4: Performance comparison between the federated learning and federated reinforcement distillation when the multiple agents collaboate.
where the s p denotes the proxy state and the π p (a k |s p k ) denotes average policy.The proxy state is representative state of state cluster C j ∈ C. Note that the union of proxy state cluster sets is the state space S, i.e., S = |C| j=1 C j and none of the state cluster has the joint set, i.e., C i

Table 1 :
Hyper parameters of federated reinforcement distillation.