Scalable Deep Reinforcement Learning-Based Online Routing for Multi-Type Service Requirements

Emerging applications raise critical QoS requirements for the Internet. The improvements in flow classification technologies, software-defined networks (SDN), and programmable network devices make it possible to fast identify users’ requirements and control the routing for fine-grained traffic flows. Meanwhile, the problem of optimizing the forwarding paths for traffic flows with multiple QoS requirements in an online fashion is not addressed sufficiently. To address the problem, we propose DRL-OR-S, a highly scalable online routing algorithm using multi-agent deep reinforcement learning. DRL-OR-S adopts a comprehensive reward function, an efficient learning algorithm, and a novel deep neural network structure to learn appropriate routing strategies for different types of flow requirements. In order to enhance the generalization and scalability, we propose a novel graph-based actor-critic network architecture and a carefully designed input state for DRL-OR-S. To accelerate the training process and guarantee reliability, we further introduce an NN-simulator for efficient offline training and a safe learning mechanism to avoid unsafe routes during the online routing process. We implement DRL-OR-S under SDN architecture and conduct Mininet-based experiments using real network topologies and traffic traces. The results validate that DRL-OR-S can well satisfy the requirements of latency-sensitive, throughput-sensitive, latency-throughput-sensitive, and latency-loss-sensitive flows at the same time, while exhibiting great adaptiveness and reliability under the scenarios of link failure, traffic change, unseen large topology and partial deployment.

is throughput-sensitive, for which a greedy algorithm chooses path S-1-3-T . Then, the second flow arrives, which has a data rate of 1 Mbps from node S to T , and is latency-sensitive. The greedy algorithm has to choose path S -1-2-3-T , which is longer and has a large latency. This degrades the QoS of the second flow.
We tackle the online routing optimization problem for multitype service requirements by exploring the potential of deep reinforcement learning (DRL). DRL is an expert at solving online decision making problems, which optimizes not only current but also future rewards and can adapt to the dynamic environment. There have been studies on reinforcement learning-based routing [12], [13], [14], [15], [16], [17]. However, most existing studies treat all flows as the same and do not consider differentiated QoS requirements. To address the problem, we have to overcome a set of challenges. First, the flows with different QoS requirements are not independent, and they share limited network resources. The routing decisions need to consider all requirements concurrently while preserving the capability to satisfy future requirements. Second, the approach should be scalable to network size, flow number, as well as the service type. That is to say, the routing algorithm should be able to generalize to different network scales at a small additional cost. Unknown QoS requirements that may arise in the future should also be supported. Last but not least, the route should be stable and will not change from time to time, even in a dynamic environment. We restrict that once a path for a flow is computed, the path will never change unless there is a link failure in the path. Meanwhile, the path computed for newly coming flows should adapt to the new network status, including burst traffic and link failure.
We develop our scheme under the SDN architecture. However, using one single agent in a centralized node, e.g., an SDN controller, to make routing decisions may greatly harm the scalability of the scheme, because the action space dramatically increases with the size of the network and will become too intractable to enable efficient training and accurate inference [16]. To address the issue, we decompose the route generation process into a multi-agent DRL problem. In particular, each router/switch controlled by a DRL agent selects the next hops to form routes according to each flow destination address and QoS requirement. In such a scheme, the agents need to cooperate to achieve the global optimal with limited messages exchanged, and unsafe routes (e.g., routing loops) due to random exploration by DRL should be avoided.
We propose a highly scalable multi-agent DRL-based online routing mechanism for multi-type service requirements. We normalize the flow performance metrics, such as latency and packet loss, in a self-adaptive approach and combine the normalized metrics to obtain utility functions for different service types. Based on the utility functions proposed, we model the route generation process as a multi-agent Markov decision process (MAMDP) and develop an efficient learning algorithm for our scheme. To realize the agents, we propose a novel partially unique actor network structure, which has a common feature extraction layer for all flows, and a specialized output layer for each service type. We also develop a safe learning mechanism to avoid unsafe routes during the online routing process. This basic DRL-based online routing is called DRL-OR, which first appears in our previous study [18].
In this paper, we improve the adaptiveness and scalability of DRL-OR to a large extent and propose DRL-OR-S. First, we apply the graph-based neural network mechanism [19], [20] and refine the input state of the actor-critic model. By doing so, the generalization ability of our agents is improved, enabling great adaptiveness to dynamics in network environments, including link failures and traffic demand changes. In particular, we can train a general model for DRL-OR-S in offline learning and deploy it for online routing directly. Such a general model performs well even under a dynamically changing network environment, and requires no online learning which is necessary for DRL-OR to adapt to network dynamics. Meanwhile, with such a general model, all agents of DRL-OR-S can share the same set of parameters, which greatly improves the scalability and reduces the cost of real world implementation. Moreover, we develop a neural network to estimate network performance when a routing decision is made in the offline training process. This neural network replaces packet-level simulated network environment to achieve efficient offline training, and is called an NN-simulator. Our NN-simulator trained with real world data collected can accurately predict flow QoS metrics and is more than 600x faster than Mininet [21].
We evaluate DRL-OR-S by Mininet-based experiments using real-world network topologies and traffic traces. The results validate that DRL-OR-S can learn a generalizable and scalable routing policy to satisfy multi-type flow requirements and has great adaptiveness and reliability under link failure, traffic change, and unseen large network topology. Compared with traditional rule-based routing algorithms, DRL-OR-S shows a better performance in different network scenarios. Extra evaluation on incremental deployment validates that partially deploying DRL-OR-S can also improve flow performance significantly, which exhibits the flexibility and scalability of DRL-OR-S.
The remainder of the paper is organized as follows: We describe the overview of DRL-OR-S in Section II and present the design of DRL-OR-S concretely in Section III. Then, Section IV shows the training process and deployment of our model, also introducing the construction of the NN-simulator. In Section V, the experimental setup and the evaluation results are illustrated in detail. Finally, we provide a brief overview of related work in Section VI and conclude the paper in Section VII.

II. OVERVIEW
We consider a network where any node may originate a flow terminated at another node at any time. Each flow has certain requirements on a set of performance metrics, including latency, throughput, and packet loss rate. The requirements can be clustered into a few categories, namely service types. We specify four service types shown in Table I. For each service type, we propose a utility function. The utility of a flow reflects to what extent the flow's requirements are satisfied. Our target is to compute the next hop for each flow at each node traversed to generate a route, such that 1) each flow can be forwarded from the source node to the destination node successfully, i.e., there is no routing loop or blackhole; and 2) the total utility of all flows is maximized.
Formally, the network is modeled as a directed graph G(V, E), where node set V represents the collection of switches/routers, and edge set E represents the collection of links. Each edge has a finite capacity. A flow request emerges in the network at some time point with a fixed bandwidth demand. Without loss of generality, we assume that the flow requests arrive in sequence. Recall that we restrict the route of a flow to be stable once computed. Thus, the network status will remain stable until the next flow arrives or some existing flows are finished. For the convenience of modeling, we assume a discrete-time model where the time is divided into contiguous timeslots t = 1, 2, . . . , and the t-th flow request arrives at the beginning of timeslot t. 1 Let src t be the source node of the t-th flow and dst t be the destination node. Let K be the number of service types and η t be the service type of the t-th flow (1 ≤ η t ≤ K). Let U η t (·) be the utility function of the service type η t , so U η t (t) is the utility of the t-th flow. The objective of our problem is then We propose a multi-agent DRL-based online routing scheme, as shown in Fig. 2. Instead of using only one agent for the whole network, which results in an exponential action space that greatly increases the complexity and degrades the accuracy of routing decision inference [16], our scheme deploys one DRL agent for each node. The agent just needs to decide the next hop of an upcoming flow to form a route. In particular, an agent makes a fast routing decision inference on the arrival of a new flow request, and the results are downloaded into the local data plane of the router so that the packets can be forwarded accordingly. Such a scheme can be realized with the help of SDN architecture, where the DRL agents are placed in a centralized controller and control routing nodes as described before, while the hopby-hop route generation process is also suitable for distributed deployment.
A good routing decision is made based on proper inputs, i.e., the agent input state. We assume that the input state can be obtained instantly after a flow request arrives, i.e., in each timeslot. For both DRL-OR and DRL-OR-S, a few features of current flow request are needed, including source node src t , destination node dst t , service type η t , and the maximum data rate demand t . For DRL-OR, we need the network topology and link attributes (e.g., capacities and loss rates). These inputs are relatively static and can be obtained once for all, unless there is a topology change such as link failures. Also, we need current network status, e.g., the residual capacity of each link which reflects the congestion degree. However, collecting all features above will bring a lot of overhead, and using such a large amount of state features directly as model input will make it difficult to learn a general and robust routing policy for each agent efficiently.
In DRL-OR-S, we adopt a more compact input state design, in which the network status is embedded. In particular, for each node, we compute candidate routes towards the destination node, and use attributes of the routes as input state. Thus, the input state of any node over different network topologies could be in the same form, and as a result, we can design a graph-based neural network model for learning a general routing policy for different agents under different scenarios.
Unlike traditional model-based solutions, our model-free multi-agent DRL routing scheme learns from the interaction with a network environment. To this end, feedback is returned to the agents after an action (routing decision) is taken. In the field of DRL, this feedback is called a reward. In particular, when agent i observes state s i t for the t-th flow, action a i t is taken and reward r t is returned. In general, the objective of DRL is to maximize the discounted cumulative reward T t=1 γ t r t , where γ ∈ [0, 1] is the discount factor. In this paper, reward r t is the utility value of the t-th flow.
To train the agents, DRL-OR uses online learning in the running network environment. However, the exploration and exploitation procedure of DRL may be time consuming, during which network performance degradation may be incurred. In addition, obtaining the performance metrics of flows is nontrivial. A traditional simulated network environment also suffers from the problem. To address the issue, we develop a deep learning model to construct a training environment and use only offline training in DRL-OR-S. This deep learning model is called an NN-simulator, which is trained using data collected from realistic network environments. The NN-simulator accurately inferences the flow QoS performance with the given flow feature and network status. In this way, we can obtain the performance metrics and compute the utility once the routing decision is made in the offline training phase.
In Section V we show that DRL-OR-S can be deployed under SDN architecture. But with the highly distributed multi-agent DRL scheme, DRL-OR-S also has great potential to embrace distributed deployment.
III. THE DRL-OR-S SCHEME In this section, we describe the details of the scalable DRLbased online routing algorithm for multi-type service requirements (DRL-OR-S) scheme. First, we present the utility function to evaluate the quality of routing paths for different types of flows. Then, we model the route generation process as a multi-agent Markov decision process and present the learning algorithms of agents. Finally, we show a novel actor-critic network structure based on graph neural network (GNN) for approximating the policy functions of agents.

A. Utility Function
To evaluate the routing paths for flows with different service types, we need to tailor the utility functions for flows properly. We propose a general utility function for flows with various service types by combining multiple performance metrics while fairness among flows is also considered. In particular, we consider three performance metrics, including throughput, latency, and packet loss ratio. Let x t , y t , and z t be the throughput value, latency value, and packet loss ratio of the t-th flow, respectively. Our proposed utility function is then where w 1 , w 2 , and w 3 are three non-negative scalars that indicate the importance weight of each performance metric. For latencysensitive flows, w 1 , w 2 , and w 3 are 0, 1, and 0 respectively. For throughput-sensitive flows, w 1 , w 2 , and w 3 are 1, 0, and 0 respectively. For latency-throughput-sensitive flows, w 1 , w 2 , and w 3 are 0.5, 0.5, and 0 respectively. For latency-loss-sensitive flows, w 1 , w 2 , and w 3 are 0, 0.5, and 0.5 respectively. To promote fairness among flows, we apply function log(·) to throughput, and functions (·) α 1 and (·) α 2 to latency and packet loss ratio, respectively. The use of log(·) encourages a balance of throughput among flows. Larger hyper-parameters α 1 and α 2 (both ≥ 1) help avoid heavy latency and packet loss ratio to ensure fairness among flows. This fairness-enhanced utility function aligns with previous works like [15], [22]. Our findings show that setting α 1 = 1 and α 2 = 1 is sufficient to ensure fairness of flow latency and packet loss in this work.
However, collected raw data of performance metrics (i.e., x t , y t , and z t ) cannot be input into the utility function directly. As we all know, different performance metrics, such as throughput and latency, have very different scales. Further, in real networks, the same performance metric (e.g., latency) depends on many factors (e.g., the traffic load of networks) and may change greatly with time. It is not easy to adjust the weights (i.e., w 1 , w 2 , and w 3 ) to combine different performance metrics properly. As a result, the scale of computed utility values will also change greatly with time, which makes learning algorithms difficult to converge well.
To solve the problem, we normalize the raw performance metrics before inputting them into the utility function.
The throughput value of a flow can be directly normalized by the flow's maximum data rate demand t . We define the throughput ratio as the raw throughput value divided by the corresponding demand t . For the latency and packet loss ratio, we apply a self-adaptive normalization mechanism. Specifically, we normalize the raw performance metric value of a target flow by the average performance metric value of the recent flows between the same source and destination nodes. Let q t (q t ∈ {y t , z t }) be the raw performance metric value of the t-th flow andq t be the normalized value. Formally, q t can be normalized byq where q t denotes the average performance metric value of the recent flows with the same source and destination node as the t-th flow. For convenience, q t is approximated by where ε ∈ [0, 1] is a constant parameter controlling the update speed of q t . Selecting a larger value for ε can result in a stable average value q t that is less influenced by sudden changes in the network environment such as traffic bursts causing congestion.
On the other hand, choosing a smaller value for ε can increase the rate at which the average value is updated. Note that although the update speed of the average value q t may be slower than the network dynamics, the average value of the performance metrics gradually plateaus as the routing strategy model gradually converges. Our findings show that setting ε = 0.99 is sufficient to adapt to network changes and provide a stable average value approximation for good performance metric normalization and model convergence in practice.

B. Multi-Agent Learning Algorithm
As described previously, we compute a route for each flow in a hop-by-hop manner to decompose the routing problem into several sub-problems with much smaller state and action spaces. When a flow request reaches a node, the node (also the agent) makes an action to select the best next hop for the flow. The following nodes take a similar way until the flow request reaches the destination node. Finally, we obtain the route for the flow (i.e., a sequence of ordered nodes).
We model the route generation process as a typical MAMDP where each node is an agent. The MAMDP is characterized by a tuple (S, {A i } i∈N , P, R). S is the global state space shared by all the agents i ∈ N , and A i denotes the action space of agent i. The joint action space of all agents is denoted as A = Π i∈N A i . Then, P : S × A × S → [0, 1] is the state transition probability of the MAMDP, and R : S × A → R is the function for computing global rewards. We take the utility function of (2) as the reward function. Note that the state space and reward values can be observed globally by each agent in such a MAMDP. Now, the route of a flow can be determined by action a ∈ A after all the agents take a local action based on the current state s ∈ S (for notational simplicity, we omit the subscript t). Actually, only the agents on the path need to make a decision, and the decision making of an agent depends on the decisions made by the prior agents on the same path. Assume that {src, a 1 , . . ., a n−1 , dst} is a route determined by action a. Then, the process of making action a based on state s can be expressed as P r(a|s) = P r(a 1 |s)P r(a 2 |s, a 1 )P r(a 3 |s, a 1 , a 2 ) · · · P r(a n |s, a 1 , a 2 , . . ., a n−1 ), where a i denotes the i-th agent's action, which determines the (i+1)-th node of the routing path.
To describe the relations between the local action a i and actions before a i , we import conditional state as an addition to global state s. Particularly, the conditional state of an agent records the currently used nodes before the current node with respect to a specific flow. In our design, we take the form of a zero-one vector with length |V | for the conditional state, i.e., the i-th item in the vector is 1 if the i-th node has been selected to construct the routing path, otherwise 0. When making an action for a coming flow request, the agent takes the corresponding conditional state together with the global state as the actual input.
Considering that the state space and the action space in real networks are very large, we adopt neural networks to approximate policies of agents, which map state input to action output. Formally, the local policy network of agent i can be denoted as π i θ i (a i |s, c i ), where θ i is the corresponding parameters of the neural network and c i is the conditional state. Essentially, π i θ i (a i |s, c i ) is used to approximate the target conditional probability distribution P r(a i |s, a 1 , a 2 , . . . , a i−1 ).
In our MAMDP model, the agents need to collaboratively find a policy π i θ i that maximizes the globally averaged long-term reward over the network routing problem. Let J(θ) be the global objective that needs to be maximized, where θ is the collection of every agent's policy parameters θ i and can be updated by computing the gradient of J(θ).
The most commonly used policy gradient estimator for reinforcement learning [23] has the form where τ denotes state-action pair (s, a) sampled through τ ∼ p θ (τ ) with respect to π θ and A(s, a) is an advantage value estimator. A(s, a) evaluates how much better taking action a is than taking actions randomly under policy π θ with respect to state s [24]. Particularly, a positive (resp. negative) value of A(s, a) means taking action a is better (resp. worse) than randomly taking actions with respect to current policy π θ . According to (5) and (6), we can derive the policy gradient estimator for each agent's policy parameters as Compared with policy gradient estimator in (6), a mask variable m i is used in (7) to indicate whether node i is on the path. m i = 0 will result in ∇ θ i J(θ) = 0, which means the node i won't update its policy parameters θ i at this step.
Agents can update policy parameters using the gradients computed by (7) in the same way as typical policy gradient methods. However, policy gradient methods are sensitive to the update step size selection, which makes it challenging to obtain good results. Besides, they often have very poor sample efficiency and take millions of time steps even for simple tasks. In this paper, we adopt the Proximal Policy Optimization (PPO) algorithm [23], one of the most popular reinforcement learning approaches. In contrast to policy gradient methods, PPO has been proved to have better data efficiency and low training variance. We introduce how to expand PPO to the multi-agent deep reinforcement learning (MADRL) framework of DRL-OR and DRL-OR-S below.
First, PPO reduces sample complexity via so-termed importance sampling. Unlike policy gradient methods that sample state-action pair τ ∼ p θ (τ ) using the newest policy parameters θ, PPO efficiently uses samples based on old policy parameters θ old to obtain a better data efficiency. By assuming that state distributions under θ and θ old are similar, we can derive the gradient estimator of agent i as We can find state-action pair τ is sampled through τ ∼ p θ old (τ ). r(θ i ) is the importance weight of samples, i.e., the ratio of the probability under the new and old policies. In the case where θ i old equals θ i , r(θ i ) becomes 1. The corresponding local objective function of (8) has the form Note that, the distribution of π i θ i (a i |s, c i ) should not be too different from π i θ i old (a i |s, c i ). Otherwise, we need to take a large number of sample processes so as to have a good knowledge of current J(θ i ).
Second, PPO limits the difference between θ i and θ i old to make θ i not update too fast. Following PPO, we add a clip value function clip(·, ·, ·) and a min function min(·, ·) to (9). Then, the new local objective function becomes where r (θ i ) = clip(r(θ i ), 1 − , 1 + ) and is a hyperparameter, set to 0.1 in this paper. Using the gradients of (10), the probability of a good action with A(s, a) > 0 won't increase too much and the probability of a bad action with A(s, a) < 0 won't decrease too much. In this way, the agents can update their routing policy at a consistent and stable speed. Note that (10) is easy to be optimized by gradient decent algorithms.
In the above analysis, global state is used for decision making (i.e., one of the inputs of π θ i ) as well as the advantage value estimator (i.e., A θ (s, a)). However, for the routing problem in this paper, the global state is insufficient for an agent to select the best next hop. Particularly, processed information of neighbor nodes (e.g., the distances and available bandwidth from the neighbor nodes to the destination node) can help an agent select the best next hop, but combining this information into a global state input will induce high computational overhead and make the policy learning algorithm of agents difficult to converge especially in large networks. Thus, we use partial state for each agent (We describe partial state concretely in Section III-C). The policy function of agent i can be denoted as π i θ i (a i |s i , c i ), where s i represents the partial state of agent i. Then, the advantage value estimator becomes A(s i , a) using partial state s i . Since flow utilities are broadcast to all the agents and the partial state s i contains enough network status, we can expect an unbiased estimation A i (s i , a) locally to advantage value A(s, a) which uses the generalized advantage estimation (GAE) [25]. Note that with local advantage value estimator A i (s i , a), the scheme has better potential for extended local reward design to speed up the training process.

C. Actor-Critic Network Design
In DRL-OR-S, we apply a deep learning model to approximate the policy function π i θ i , and the model parameters can be updated using the gradient estimator described above. To map state input to action output, a direct design is to use one deep neural network (DNN) as the policy function for all types of flows. However, there exist two challenges to such a simple model design. First, the optimal policy parameters of different types of flows can be in high diversity. Using a set of completely shared policy parameters for different types of flows will lead to conflicts of parameter updates, especially when the network topology is large. In such a case, the policy network is difficult to converge. An alternative approach to solve the problem is using a specialized policy neural network for every type of flow, which, however, results in a huge scale of policy parameters and slow convergence. Second, the simple DNN structure is highly coupled with not only the network topology but also the agent's location in the network. Such a scenario-coupled model greatly limits the generalization and scalability of the learned routing strategy. Then we will show how DRL-OR-S addresses these two challenges.
Partially Unique Policy Network Structure: We propose a partially unique policy network structure to deal with the high diversity among the optimal policies of flows while still having a fast model convergence. The novel policy network structure consists of three parts, that is an input layer, a general feature extraction mechanism, and a group of specialized policy layers for different types of flow requests. The policy network takes the local state s i and conditional state c i of agent i as input and extracts the feature vector. Then, the extracted features are fed into the specialized policy layer corresponding to its destination and service type. The specialized policy layer will then output the action for the corresponding type of flow. The proposed policy network for agent i can be formally expressed as where g i denotes general feature extraction layers shared by all types of flows, and h i η denotes the specialized policy layer for the flow requests with service type η.
In the earlier version of this work [18], DRL-OR uses a DNNbased partially unique policy network structure for the actor network and a plain DNN for the critic network. In DRL-OR-S, we apply a graph-based actor-critic structure to enable better generalization and scalability. We present the design of state input (including partial state s i and conditional state c i ) and action output (i.e., a i ) of a general agent i for DRL-OR-S. Then we show a graph-based novel actor-critic network structure to enhance the model generalization and scalability.
State Input: In our model design, we assign a state vector for each routing node. The node state is composed of the demand of the coming flow request, the attributes of the possible candidate paths towards the destination node, and a conditional state indicating whether the node has been traversed. In this work, we select the link loss ratio, hop number, and predicted available bandwidth of shortest path, widest path, and capacityconstrained shortest path [26] as the node state features. And for each agent i, we add extra indicators in the states of its neighbor nodes, which tell the agent whether the node could be the next hop of a shortest path, widest path, or a capacity-constrained shortest path. With the extra indicators, the agent will quickly learn a passable strategy and then explore for a better strategy from the baseline strategy. We also normalize the input features into the same scale to make the model easy to converge. Note that the form of input state is suitable for the GNN approach, and we omit the redundant full information of the network topology to simplify the decision-making difficulty, which is critical for a GNN-based MADRL algorithm to learn a general and robust routing policy.
Action Output: The action output for a coming flow request is also a vector which is the probability distribution of selecting neighbor nodes as the next hop.
Actor-Critic Network Structure: The actor-critic model architecture is shown in Fig. 3. A raw DNN architecture coupled with network topology could not adapt to dynamically changed network topologies and lacks scalability when the scale of the network increases. In this paper, we design a novel actor-critic model generalized to different network topologies based on graph attention mechanism [19] and graph convolutional network [20]. With a simple graph attention mechanism, the agent could select the best next hop according to the state features of the neighbor nodes and itself. And a graph convolutional network (GCN) could help estimate the state value V (s i ) with the given local network state features, which is then used to approximate the advantage value A(s i , a) with GAE algorithm. Combined with the partially unique network structure, the working process of the actor-critic model is as follows. For each coming flow, the agent i first uses the DNN structure g i with shared parameters to extract the critical features of the input node states as follows Then with extracted node features of itself and its neighbors, we apply a type-specific graph attention mechanism h i η to decide the chosen next hop for this agent, which is formulated as where w i η and Θ i η are type-specific parameters of h i η for different types of flow requests, and α i,j is the weight of i to select node j as the next hop.
For critic network, we apply a graph convolutional network [20] to extract the high level features of the network state and a plain DNN for the readout layer to obtain a local estimation V (s i ) to the state value V (s), and then use the GAE to calculate the approximated advantage value A(s i , a). Note that in this paper, we share the actor-critic network model parameters among agents, i.e., π θ i = π θ and A i = A, to give the model a better generalization and scalability. We also note that sharing model parameters ensures a general and consistent policy among the agents.
We note that the proposed policy network structure has good generalization and scalability. A new type of flow request can be accommodated by just adding one more specialized routing layer. Current well-trained feature extraction layers make the parameters of the newly added layer be optimized quickly. Compared to the previous version of this work, the agents could share the parameters since the new actor-critic model use the same architecture for different topologies and agents. Thus the agent could generalize to the topology changes under failures and other real-world events. Moreover, under the new model and state design, the model does not employ distinct routing decision layers for various destination nodes. In summary, the new GNN-based architecture not only reduces the number of parameters significantly but also generalizes network changes and naturally has good scalability.

IV. TRAINING PROCESS AND DEPLOYMENT
Although we have proposed the DRL routing scheme with good generalization and scalability, deploying such a DRL routing algorithm in real world is still challenging. We note that the key point for a routing algorithm is reliability, i.e., the routing algorithm should always avoid fatal routing decisions like routing loops. However, the exploration process during training will inevitably generate some dangerous routes. In addition, obtaining the QoS metrics of each coming flow for the DRL reward is also a challenging task. To reduce the training cost and accelerate the training process, we use a deep learning mechanism to design an NN-simulator. The NN-simulator is trained by data collected from the network environment and is used to estimate the flow QoS metrics given flow features and network states. We apply the NN-simulator to build up a simulation environment for the DRL agents' offline training process. To ensure the reliability of the routing scheme, we propose a reliable learning scheme for DRL-OR-S. Finally, we give a discussion on the incremental deployment of DRL-OR-S to show the flexibility of the algorithm. We give an overall picture of the training process for DRL-OR-S in Fig. 4.

A. NN-Simulator
As mentioned, NN-simulator is used to complete a flow-byflow prediction of latency, throughput, and loss rate based on the coming flow features (i.e., demand and path information) and the network state. In this section, we introduce the design of the NN-simulator in detail, including the internal model, the selection of loss function and the working process.
Internal Model Design: The internal model of the NNsimulator uses a sequential neural network structure to map the given information of a certain flow and current network state to the QoS metrics of the flow. The input of the neural network consist of the states of the links on the flow path and the flow features, both of which act as associated fixed-size vectors. The link state vector is constructed by the path information of the flow  and current topology characteristics involving the link capacities, link losses, and traversed traffic volume on the link, while the path state vector is initialized by the flow demand. The output of the network is the QoS metrics of the input flow, including latency, throughput, and loss rate. As Fig. 5 shows, the network structure consists of three parts, and the operation of the whole model is carried out flow by flow. The first part is one input and preprocess layer, which manually translates the input topology and traffic information into fixed-size feature vectors of link state and path state. The second part is a feature extraction layer which works much like the hop by routing process. This layer updates the path state through a gated recurrent unit (GRU) [27] according to the newly entered link state hop by hop following the path of the flow. The third part outputs the predicted flow features and then generates the QoS metrics with the help of the readout layer based on plain DNN.
Loss Function: We need to select an appropriate training loss function to train the internal model of the NN-simulator to model the relationship between the input state and different target QoS metrics. In the medium load and heavy load scenarios we tested, we found that the congestion may lead to a sudden burst of latency which makes it difficult to learn a precise model for the QoS metrics. In our experiment, the use of L1 loss function in the training stage will largely ignore these abrupt values, while using L2 loss function will make the model tend to overemphasize these outliers. Therefore, we use the smooth L1 loss function for the NN-simulator as a trade-off: Working Process: To train the NN-simulator, we should first collect data from the network environment, which includes flow features(i.e., flow demand and route), the corresponding network state, and the QoS metrics of the flow. Then we apply the collected data to train the internal model of the NN-simulator. Finally, we maintain and update the network state (e.g., link usage) and apply the trained model to estimate the QoS metrics for each coming flow, which forms a flow-level simulator to accelerate the offline training process for DRL-OR-S. It is important to recognize that the characteristics of real-world networks can vary over time as facilities are reconfigured or updated. In order to enable the DRL-OR-S model, trained offline, to adapt to long-term network dynamics (e.g., network facility updates and reconfigurations), which are typically under the control of the network operator, we propose a method of periodically collecting network feature data. This data will be used to retrain and update the NN-simulator model parameters. Subsequently, the updated NN-simulator model can be employed to incrementally train the DRL-OR-S agent and update its parameters. This training process generally takes a few hours, which is shorter than the cycle of network facility updates and reconfigurations. By regularly training and updating the models in this manner, we can ensure that the DRL-OR-S agent makes routing decisions based on the current characteristics of the network.
We note that the number of model parameters for our proposed NN-simulator keeps the same with the topology scale increasing and has no limit for the scale of the flow demand, which indicates better scalability compared to the packet-level simulator.

B. Reliable Learning Mechanisms
It is well known that reinforcement learning takes an exploration-and-exploitation method during the learning stage. That is to say, agents may try some exploration actions randomly so as to obtain better policy parameters. However, doing explorations may cause unsafe routing decisions (e.g., routing loops) and thus leads to terrible routing performance. Moreover, even a trained model could make terrible routing decisions under some unlearned situations. To enhance the reliability of our system and accelerate the training process, we propose two mechanisms.
A DRL agent learning from scratch usually needs millions of iterations to learn a stable policy, and a multi-agent system with a GNN-based, parameter-sharing agent makes the learning process more difficult. To reduce the computational overhead during the offline training stage, we pre-train agents to learn a simple QoS routing strategy with designed local rewards. Then in the formal offline training process, the agents learn from the interaction with the NN-simulator and converge faster based on the prior knowledge obtained by pre-training.
The pre-training stage makes our system converge faster, but it cannot fully guarantee the reliability of our system. Unsafe routing decisions like routing loops can still happen due to action explorations or unexpected changes in the network environment. Inspired by [28], we propose a safe routing mechanism to avoid unsafe routings. Fig. 6 illustrates our proposed safe routing mechanism. After an action is made, a safe routing component will check whether the generated routing path is a legal path (i.e., without loops). If the routing path is safe, it will be enabled in the data plane directly. If not, the safe learning component will turn to the fallback policy to generate a legal routing path for the flow request. Particularly, in this paper, the fallback policy computes the shortest path. In the offline training stage, after the path computation of the fallback policy, the routing nodes will be informed of the new routing rules. To reinforce the experience of unsafe action output by agents, an extra penalty value will be added to the reward. In this way, the model will learn to avoid generating unsafe routes without experiencing the terrible congestion caused by the bad routing decision. We also note that such a safe routing mechanism is also deployed in the online routing stage but with no learning process. With a safe routing mechanism, the model reliability is guaranteed, and our system model can still converge to the optimal policy.

C. Incremental Deployment
In practice, deploying the system of DRL-OR-S requires both software and hardware updates on routers. Thus, deployment on a real running network at once may be difficult and introduces much overhead. Alternatively, we can deploy DRL-OR-S incrementally, which benefits from our distributed system design. Some key routers can be controlled by DRL agents, and the other routers execute original routing algorithms (e.g., OSPF). In this way, the remaining problem is to find the minimal key router set to achieve maximum network performance improvement. Existing incremental deployment approaches [29], [30], [31] can be easily applied in our scenario. In Section V we show a simple example of incremental deploying DRL-OR-S on real-world network topologies.
We also note that the observation state for each agent remains unchanged when DRL-OR-S is deployed incrementally. It is assumed that a complete network observation, same as the one used in full deployment, could be obtained through network monitoring techniques while only part of the routing nodes could be controlled by DRL-OR-S agent. Therefore, although the controllable routing nodes and candidate paths are limited when DRL-OR-S is deployed incrementally, the routing policy for each agent to select the best next hop remains similar.

V. PERFORMANCE EVALUATION
In the evaluation phase, we aim to answer the following questions, r How is the performance of NN-simulator in replacing the real network environment for MARL (Section V-B)?
r How is the robustness of DRL-OR-S to failures (Section V-C)?
r How is the performance of DRL-OR-S under different network scenarios (Section V-D)?
r How is the performance of DRL-OR-S in incremental deployment (Section V-E)?

A. Experiment Setup
We use SDN architecture for the convenience of constructing the experimental network and implementing the approaches for comparison. We construct an experimental network using Ryu [32] and Mininet [21], which simulates packet-level queuing delay. DRL-OR-S is implemented with Pytorch based on PPO algorithm [33]. The source code of DRL-OR-S is available at https://github.com/netlab-lcy/DRL-OR-S. In our experiments, an agent needs only 3 ms to select the next hop given an input state.
In evaluation, we use two real-world network topologies, i.e., Abilene [34] and GEANT [35] with real world traffic matrices and a large-scale network topology in topology zoo [36], Dialt-elecomCz, with a traffic matrix generated by gravity model [37]. Abilene is a topology that has 11 nodes and 14 bidirectional links. GEANT is a medium-sized topology with 23 nodes and 37 bidirectional links. DialtelecomCz is a large-scale topology with 106 nodes and 119 bidirectional links. Since the data rate of the platform is limited, the link capacities are set smaller than reality. In particular, most links in Abilene have a data rate of 10 Mbps, while 1 bottleneck link has a data rate of 2.5 Mbps. We further add a 10% random packet loss to the bottleneck link in Abilene. In GEANT, there are 17 links with 2.5 Mbps data rate and 20 links with 10 Mbps data rate, and we select one of each type of link to add 10% packet loss. In DialtelecomCz, all the links have a 10 Mbps data rate with no link loss. Each link has a latency of 5 ms.
The flow requests are generated randomly according to the corresponding traffic matrices. In particular, the probability of selecting a source-destination pair is proportional to the traffic volume between the source-destination pair in the traffic matrix. The service type of a flow is determined according to the distribution where a flow is latency-sensitive (type I) with a probability of 0.2, throughput-sensitive (type II) with a probability of 0.3, latency-throughput-sensitive (type III) with a probability of 0.3 and latency-loss-sensitive (type IV) with a probability of 0.2. The sending rates of types I-IV flows are fixed to 100 Kbps, 1500 Kbps, 1500 Kbps, and 500 Kbps, respectively. For Abilene, we specify a light-load scenario where the flow duration is 10 timeslots and a heavy-load scenario where the flow duration is 50 timeslots. For GEANT, we set the flow duration to 15 timeslots. For DialtelecomCz, we set the flow duration to 300 timeslots. The shortest path routing achieves an average max link utilization over time of 0.71 in the light-load scenario on We choose four typical routing algorithms for comparison. First, we evaluate the shortest path routing (SPR), which always uses the route with the least hop number for each flow request. Second, we evaluate a load balancing routing (LBR), which uses the route with the largest residual capacity. Third, we evaluate a QoS routing (QoSR). QoSR uses SPR for a latency-sensitive flow, uses LBR for a throughput-sensitive flow, and computes a capacity-constrained shortest path [26] for a latency-throughput-sensitive flow. In order to optimize network packet loss composed of both random packet loss on links and congestion packet loss on the router, QoSR uses the capacityconstrained shortest path without link loss for a latency-losssensitive flow. Fourth, we compare our proposed algorithm, DRL-OR-S, with a state-of-the-art traffic engineering algorithm called MARL-GNN-TE [38]. MARL-GNN-TE utilizes graph neural networks and multi-agent reinforcement learning to optimize traffic engineering. To conduct a fair comparison, we implemented and trained the MARL-GNN-TE algorithm using the open-source code and dataset. We then used the trained model to generate optimal link weights for the shortest path routing algorithm, based on the traffic matrices. In addition, we compare DRL-OR-S with DRL-OR [18], the previous version of this paper. We evaluate the latency, throughput ratio, and packet loss ratio of each service type when each approach is used in our experiments.
Before the evaluation phase, we first collect data from the experimental network, where we set SPR as the routing algorithm. Second, we train NN-simulator using the collected data for the offline training phase. Finally, we use the NN-simulator to train a general DRL-OR-S model on Abilene topology with changing traffic load levels for the following evaluation on different scenarios.

B. Performance of NN-Simulator
In this section, we evaluate the performance of our NNsimulator in predicting flow-level QoS metrics on Abilene topology and GEANT topology. We collect 100,000 flow data under shortest path routing on both topologies from the mininet testbed and then train NN-simulator internal models separately for the network environments. To evaluate NN-simulator, we further collect 10,000 flow data from the mininet testbed for each network topology. We randomly select 100 sequential generated flows to illustrate the capability of the NN-simulator in predicting the flow QoS metrics and show the results in Figs. 7 and 8. We can find that the NN-simulator accurately predicts the delay, throughput, and packet loss for most of the input flows. We note that due to the complicated network architecture composed of a set of queues on the flow path, which is more like a black box, it is difficult to accurately predict the QoS metrics for all the flows according to the limited input network state. To this end, we design the internal model for learning a similar distribution to QoS metrics of the real network environments. We further evaluate NN-simulator over the whole test dataset and show the results in Table II. We can find that mean absolute error (MAE) and mean relative error (MRE) keeps a low level for both Abilene and GEANT. Moreover, the mean value of the predicted results is quite close to the ground truth results, which indicates a good similarity to the ground truth distribution and is important to our DRL-OR-S algorithm.
At the end of this section, we note that our model could achieve more than 600x speedup ratio over both Abilene and GEANT compared to the packet level simulator implemented with mininet. With the help of the NN-simulator, the offline training time cost reduces by 30x, from 5 days to 4 hours. We also note that the training processes of the agents are independent of each other and can be easily accelerated by parallelization, which can further reduce the training time and make the effect of the NN-simulator more obvious.

C. Robustness
In this section, we evaluate the robustness of DRL-OR-S under single link failure scenarios. We simulate all the single link failure scenarios of Abilene topology under light-load traffic in the experimental network and evaluate the performance of DRL-OR-S under such failure scenarios. We record the fallback policy trigger ratios of DRL-OR-S under single link failure in Table III. We can find that DRL-OR-S avoids the routing loop under most failure cases and keep the fallback policy trigger ratio less than 1% even in the worst case. The low fallback policy trigger ratios indicate that the trained model are still reliable when the failure occurs. We also measure the latency, throughput, and packet loss of four types of flow requests under such failure scenarios and show the results in Fig. 9. The evaluation for each failure scenario takes a few hours to run. The results for each scenarios are averaged over 10,000 flow requests. We observed that under certain link failure scenarios, the average flow Quality of Service (QoS) experiences slight performance degradation. This occurs because flows that traverse the failed link need to be steered to alternative, non-failed links, resulting in congestion. However, our results, as shown in 9(b), demonstrate   that DRL-OR-S effectively prevents congestion and maintains a throughput ratio exceeding 98% for all four types of flow requests, even during single link failure scenarios. Fig. 9(a) illustrates that flow latency increases when link failures occur, as some flows are forced to choose longer paths to bypass the failed link. Additionally, we observed that packet loss may occur when traffic demands are steered towards congested or lossy links, as depicted in Fig. 9(c). Nevertheless, DRL-OR-S consistently delivers low latency for incoming flows and effectively avoids packet loss for loss-sensitive flows (type-IV). The evaluation results indicate that DRL-OR-S could still make appropriate routing decisions in balancing the different QoS requirements under failure.

D. Generalization
We evaluate the performance metrics of DRL-OR-S in different network scenarios. As mentioned in V-A, we first use the trained general model on Abilene and evaluate the performance of DRL-OR-S over different traffic load levels on Abilene. The evaluation for each method under a network scenario takes a few hours to run. Tables IV and V show the results, which are averaged over 10,000 flow requests. We note that QoSR could be seen as an optimal routing strategy under such a light-load scenario. We see that DRL-OR-S learns a near optimal routing policy in Abilene light-load scenario, which achieves the good performance on latency, throughput and packet loss ratio and causes no routing loops without any extra online learning process used in DRL-OR. LBR avoids using bottleneck links, leading to higher flow latency. SPR and MARL-GNN-TE sometimes overload bottleneck links, causing link congestion and leading to the worst performance. In Abilene heavy-load scenario, DRL-OR-S outperforms the SPR, LBR, QoSR and MARL-GNN-TE in terms of every performance metric, which illustrates the advantages of DRL-OR-S in avoiding congestion and satisfying different QoS requirements in a complicated  network environment. We suspect that MARL-GNN-TE did not perform effectively in this task because it was designed to optimize overall network congestion for a given traffic matrix, which may not accurately reflect the actual traffic demands in the network at a specific time slot. In other words, there may be deviations between the traffic matrix and the real traffic demands. We can also find that DRL-OR-S shows a performance gain over DRL-OR on most performance metrics, indicating a good generalization of DRL-OR-S. Note that in heavy-load scenario, LBR is a good routing policy to avoid network congestion, but it is still beat by DRL-OR-S because such greedy-based methods do not consider the possible future flows.
To further analyze the performance of DRL-OR-S, we present the distribution of QoS metrics in Fig. 10 under the Abilene heavy-load scenario. We find that DRL-OR-S provides latency below 100 ms (50 ms) for 95% (85%) of latencysensitive flows, throughput ratio over 0.8 (0.9) for 78.6% (67.1%) flows, and packet loss ratio below 0.2 (0.1) for 89.2% (80.4%) flows, which outperforms all the baseline algorithms. Our findings show that DRL-OR-S effectively reduces network congestion and improves QoS for flow requests by selecting alternative paths that are not necessarily the shortest. As a result, DRL-OR-S significantly enhances overall network performance with minimal impact on the transmission latency of some flows under heavy-load scenarios. Moreover, our analysis indicates that DRL-OR-S also reduces packet loss ratios for loss-sensitive flows.
We also evaluate the generalization of trained general DRL-OR-S model on GEANT, an unseen medium-sized topology with much more bottleneck links than Abilene. We can find that even under such a challenging application scenario, DRL-OR-S still performs well in optimizing the critical QoS metrics shown in Table IV, which is significantly better than SPR, LBR, MARL-GNN-TE, DRL-OR, and close to QoSR.
To further evaluate the generalization and scalability of DRL-OR-S, we also apply the trained general agent model parameter in DialtelecomCz, a large-scale network topology with more than 100 nodes. We find that DRL-OR-S still avoids network congestion and best satisfies the loss-sensitive flow requirements.
We note that in the previous version of this paper, we found that it takes more timeslots for DRL-OR to converge on GEANT since more agents are engaged in routing decisions, and the input state space for GEANT is much larger than Abilene. However, DRL-OR-S could achieve much better performance with much less training overhead using NN-simulator and a highly generalized and scalable model design, which shows the great potential to be deployed in the real world.

E. Effectiveness Under Partial Deployment
To evaluate the potential of incremental deployment, we partially deploy the DRL-OR agents on a few critical routing nodes and repeat the experiments. In particular, we try a simple case in which agents can only control the nodes adjacent to the bottleneck links on Abilene and the nodes with the number of degrees larger than 4 in GEANT. This way, we only need to deploy two agents in the Abilene scenario and six in the GEANT scenario. In this section, we download the trained general model parameters to the deployed agents and use the shortest path routing (SPR) as the routing policy for nodes without agents. The results of partially deployed DRL-OR are shown in Tables IV and V (i.e., DRL-OR part.). We find that only deploying agents to control the critical routing nodes can improve performance significantly. Such a simple partial deployment design can achieve similar performance to the fully deployed DRL-OR-S for both light-load and heavy-load scenarios on the Abilene topology. In the GEANT topology, some flows will not go through the nodes controlled by DRL agents. So the partial deployment seems less effective, but it still significantly outperforms the fallback routing policy (i.e., SPR).

VI. RELATED WORK
With the rapid progress of deep learning, machine learning (ML) technologies have been regarded as a promising solution to sophisticated network performance modeling and optimization problems, including TCP congestion control [39] and adaptive video streaming [40]. We focus on ML-based routing in this paper, and briefly classify existing studies based on the ML technology used.
Supervised Learning for Route Generation: Zhuang et al. [41] propose a graph-aware deep learning-based algorithm for route generation in SDN-controlled scenario. Mao et al. [42] use deep belief network (DBN) to enable packet-level control and shortest path routing. Geyer et al. [43] propose a distributed routing algorithm using graph neural network (GNN) to learn the shortest path and max-min routing strategy. Kato et al. [44] indicate that generating or selecting route for the whole network needs a huge and complicated neural network, since the output space increases exponentially with the network topology growth. The routing approaches based on supervised learning can learn routing policies for specific networks, but the adaptiveness and the reliability under dynamically changing network status are not validated.
Reinforcement Learning for Routing: Several studies have used model-based Q-learning to design distributed routing algorithms that optimize specific objectives, such as packet delay [12], [45] network lifetime [13], and transmission reliability [14]. Although these approaches have shown good performance in certain scenarios, the model-based Q-learning model is not capable of solving more complex routing optimization problems. Recent studies have employed reinforcement learning to select routing paths from the pre-calculated candidate paths. For example, Rischke et al. [46] use Q-learning to choose best routing paths for SDN network, while Casas-Velasco et al. [47] use deep Q-learning. There are also some researches using deep reinforcement learning to determine the link weight and then employ some conventional routing algorithm, such as shortest path routing algorithms [48], [49], [50], [51], [52]. However, existing DRL-based routing algorithms typically train a single agent to make routing decisions, which limits scalability and makes it challenging to generalize to changes in network topologies. To address this limitation, we propose a multi-agent reinforcement learning-based routing algorithm that decomposes the path generation process into hop-by-hop routing decisions. Our graph-based agent model design allows us to share the model parameters among all agents and train the agents to learn a generic and scalable routing policy.
Reinforcement Learning for Traffic Engineering: Recent studies have also utilized deep reinforcement learning to optimize network traffic assignments based on traffic demand matrices in Wide Area Networks (WANs). For example, Xu et al. [15] use DDPG to generate flow split ratios given candidate paths of each flow demand to minimize the maximum link utilization, while Valadarsky et al. [17] use TRPO to determine the best link weights for load balance objective. Xu et al. [16] investigate scalability and robustness issues of explicit path-based routing and indicate that the large state and action spaces make it difficult to generate routes for upcoming flows explicitly. More recent works combine graph neural networks and deep reinforcement learning to better abstract the features of the network and learn a generalizable policy for network planning [53] and traffic engineering [38]. However, existing studies usually employ deep reinforcement learning agents in a centralized manner and implicitly control traffic load, which lacks interpretability and may not be suitable for complex network optimization problems. To overcome these limitations, our proposed approach also employs deep reinforcement learning but decomposes the path generation process into hop-by-hop routing decisions. This allows us to tackle more complex routing optimization problems, such as satisfying multi-type service requirements, while enabling distributed and incremental deployment.
Safe Online Reinforcement Learning: There are many researches on the development of safe reinforcement learning. Mao et al. [28] propose a safe online reinforcement learning framework, called training wheels, to solve the load balance problem for web service requests. Miryoosefi et al. [54] aims at RL tasks with a wide range of constraints and presents an algorithmic scheme for these tasks. Achiam et al. [55] propose constrained policy optimization for constrained RL. In this paper, we also introduce a safe online learning framework to make DRL-OR safe and reliable in distributed routing scenarios.

VII. CONCLUSION
In the previous version of this paper [18], DRL-OR algorithm, each routing node is controlled by a DRL agent and the agent selects the best next hop for the coming flow requests quickly. To achieve such a scheme, we designed a general utility function with normalized performance metrics, modeled the route generation process as a multi-agent Markov decision process and developed an efficient learning algorithm, proposed a novel policy network structure, and provided a safe learning mechanism to enhance the reliability of the system. However, DRL-OR inevitably shows inefficiency facing large-scale network and seems lack of adaptiveness to network dynamics. Aimed at solving these problems, we proposed DRL-OR-S, an highly scalable online routing algorithm using multi-agent deep reinforcement learning for multiple service requirements in this paper.
To enhance the scalability of the agent model, we propose a graph-based actor-critic structure, which can not only reduce the parameter size significantly by sharing the same set of parameters among the agents, but also enhance the generalization under network dynamics.
In order to improve the training efficiency and reduce the cost in real world implementation, we propose NN-simulator based on a simple deep learning mechanism and replace packetlevel simulator Mininet with trained flow-level simulator NNsimulator. We trained DRL-OR-S offline and obtain a general model for online routing process.
Experiment results show that DRL-OR-S with safe learning technique could well satisfy multiple types of service requirement and had great adaptiveness and reliability in different network scenarios.
In the end of this paper, we would like to note that current DRL-OR-S is still not a flawless solution to generalize any of the application scenarios in real world network. But the evaluation results in this paper has shown great potential of the graph-based deep learning model to learn a highly generalizable model for routing scenario with help of deep reinforcement learning.