Runtime Adaptation Framework for Fulfilling Availability and Continuity Requirements of Network Services

The Network Function Virtualization (NFV) framework is an enabler for the automation of Network Service (NS) management. In the context of NFV, an NS is realized by interconnecting Virtual Network Functions (VNF) using Virtual Links (VL). Availability and continuity are among the important characteristics of an NS. These characteristics depend on the availability of the VNFs and VLs composing the NS, which are usually selected at NS design time. VNFs and VLs utilize the resources of the underlying infrastructure, and their availability (partially) depends on the availability of these resources. To design an NS to fulfill availability and continuity requirements, the availability required from the resources is constrained at design time. However, the characteristics of these resources may change at runtime due to the dynamicity of NFV infrastructure. Thus, impacting the availability of the VNFs and the VLs, which in turn may impact the availability and continuity of the NS. To fulfill these requirements at runtime despite the changes in the infrastructure, the NS should be adapted. In this paper, we propose a framework for the runtime adaptation of NSs that reacts to changes and adapts the NS configuration so that it can fulfill its availability and continuity requirements during the NS lifetime. We also propose a method to develop machine learning models that are used within the framework to determine the required adjustments at runtime. We implemented the proposed framework, the method for developing the machine learning models, a testbed, and NSs to assess the feasibility and validity of our approach through experiments.


Runtime Adaptation Framework for Fulfilling
Availability and Continuity Requirements of Network Services Siamak Azadiabad , Ferhat Khendek , Member, IEEE, and Maria Toeroe Abstract-The Network Function Virtualization (NFV) framework is an enabler for the automation of Network Service (NS) management.In the context of NFV, an NS is realized by interconnecting Virtual Network Functions (VNF) using Virtual Links (VL).Availability and continuity are among the important characteristics of an NS.These characteristics depend on the availability of the VNFs and VLs composing the NS, which are usually selected at NS design time.VNFs and VLs utilize the resources of the underlying infrastructure, and their availability (partially) depends on the availability of these resources.To design an NS to fulfill availability and continuity requirements, the availability required from the resources is constrained at design time.However, the characteristics of these resources may change at runtime due to the dynamicity of NFV infrastructure.Thus, impacting the availability of the VNFs and the VLs, which in turn may impact the availability and continuity of the NS.To fulfill these requirements at runtime despite the changes in the infrastructure, the NS should be adapted.In this paper, we propose a framework for the runtime adaptation of NSs that reacts to changes and adapts the NS configuration so that it can fulfill its availability and continuity requirements during the NS lifetime.We also propose a method to develop machine learning models that are used within the framework to determine the required adjustments at runtime.We implemented the proposed framework, the method for developing the machine learning models, a testbed, and NSs to assess the feasibility and validity of our approach through experiments.Index Terms-Network service (NS), network function virtualization (NFV), virtual network function (VNF), service availability, service continuity, runtime adaptation, machine learning, deep learning, neural networks.

I. INTRODUCTION
I N THE context of Network Function Virtualization (NFV), a Network Service (NS) consists of Virtual Network Functions (VNF) interconnected by Virtual Links (VL) [1].A VNF is a software implementation of a network function that utilizes virtual resources (e.g., computing resources) of the underlying infrastructure [2], [3].The NFV framework manages the virtual resources to support the VNFs and VLs that compose NSs [1].The NFV framework manages the lifecycle of the VNFs and NSs (e.g., instantiating, scaling, and terminating VNFs and NSs) [4].
In [5], we proposed a design-time approach consisting of analytical methods that refined a given NS design already fulfilling performance requirements, such as throughput, by also translating availability and continuity requirements to appropriate configuration parameters.The availability and continuity of an NS depend on the availability of its constituent VNFs and VLs.The availability of a VNF, in turn, depends on the availability of its application and the availability of the underlying resources.For a given NS design, to fulfill the additional availability and continuity requirements, the availability of the VNF applications and the availability of the VLs along with the characteristics of the infrastructure resources, such as the availability of the hosts, the virtual machines (VM), and the hypervisors are taken into account to determine the appropriate values for the different configuration parameters (i.e., the number of standby instances for VNFs and VLs and the health-check and checkpointing intervals of VNFs) to generate a deployment configuration [5].Thus, at the NS level, the fulfillment of availability and continuity requirements of NSs depend (partially) on the characteristics of the resources at the infrastructure level.As long as the infrastructure resources used for the NS instance and their characteristics remain the same as the selected deployment options, the deployed NS instance will fulfill the availability and service continuity requirements.I.e., the characteristics of the selected deployment options serve as constraints for the infrastructure.
However, at runtime, the characteristics of infrastructure resources provided to the NS constituents may change due to different reasons such as failovers, aging, load redistribution, or upgrades.A change of resource characteristics at runtime may affect the availability of VNFs and/or VLs.If a change at runtime deteriorates the availability of some VNFs and/or VLs, the deployed configuration of the NS may not satisfy the same availability and continuity requirements anymore.If the availability of some VNFs and/or VLs improves at runtime, it is also considered a change, even though the NS will satisfy the requested availability and continuity requirements.This is because the deployment configuration may not be cost-efficient to fulfill these requirements.This could be the case when there is no failure for a long enough time, therefore the failure rate of some/all VNFs and/or VLs decreases.In such a case, the number of standby instances for those VNFs and/or VLs may be in excess.Thus, at runtime, the NS configuration should be adjusted to adapt to the changes in the characteristics of the resources to continue to fulfill the availability and continuity requirements and to avoid unnecessary overprovisioning of resources at the same time.
To determine the required adjustments at runtime (i.e., the new values for the configurable parameters of the NS), the analytical methods of the design-time approach described in [5] can be used.However, the execution time of these analytical methods to determine the optimal configuration values for VNFs may not be tolerable at runtime, particularly for large NSs with stringent availability requirements (e.g., ultra-high availability cases).Thus, if we need to adjust the NS quickly after a change is detected, the runtime adaptation method requires a lightweight method.
In this paper, we propose a runtime approach to maintain the fulfillment of the availability and continuity requirements of NSs that avoids resource overprovisioning.This approach includes an adaptation framework which reacts to changes in the characteristics of underlying infrastructure resources and adjusts the NS as necessary.The approach also includes a method of creating Machine Learning (ML) models to determine the new configuration values at runtime in a timely manner.
The rest of this paper is organized as follows.Section II introduces the context and defines the problem we are aiming at solving.Section III introduces the proposed framework for runtime adaptation and discusses its compliance with the NFV specifications.Section IV presents our method of machine learning model development for runtime adaptation.Section V discusses the testbed we developed to perform experiments and evaluate our solutions.In Section VI, we discuss and analyze the results of the experiments.In Section VII, we review related work before concluding in Section VIII.

II. CONTEXT AND PROBLEM DEFINITION
In this section, we provide as background a brief overview of the NFV framework and ML approaches.We also elaborate on the problem we are aiming at solving.

A. NFV Framework
The NFV framework provides VNFs and NSs with virtual resources, but it is not aware of the application/functionality of the VNFs and NSs [1].The NFV reference architecture proposed by the European Telecommunications Standards Institute (ETSI) is depicted in Fig. 1 [1].This architecture is composed of three main sets of entities: the NFV Infrastructure (NFVI), the VNFs, and the NFV Management and Orchestration (MANO) [1].The Operations Support Systems/Business Support Systems (OSS/BSS) and the Element Managers (EM) interact with the ETSI NFV framework, but they are not part of it.
MANO: It is responsible to manage and orchestrate resources that provide the NSs and their VNFs and VLs Fig. 1.ETSI NFV reference architecture [1].
The NFVO is responsible for onboarding NSs and VNFs and managing the lifecycle of NSs [4].The deployment template of an NS is described by an NS Descriptor (NSD) [6].An NS is instantiated by the NFVO based on an NS Deployment Flavor (NsDF) [6].Each NsDF references an NSD and specifies the deployment characteristics of the NS (e.g., the number of instances of VNFs and VLs at different scaling levels of the NS) [6].The topology of an NS is described as a VNF Forwarding Graph (VNFFG) descriptor which references VNFs [6].A VNFFG contains one or more Network Forwarding Paths (NFP).An NFP defines an ordered list of connection points associated with VNFs and VLs that form a sequence of network functions [6], [7].Different NFPs may have some VNFs and/or VLs in common, while not all VNFs and/or VLs of the NS may be involved in every NFP.Thus, an NS may provide multiple functionalities.We assume that each NS functionality is provided through a different NFP.
The VNFM is responsible for the lifecycle management of VNF instances, including instantiating, scaling, and terminating the managed VNFs [4].
The VIM orchestrates the allocation, upgrade, release, and reclamation of the NFVI resources [4].VIM also collects performance and fault information on hardware, software, and virtual resources of the NFVI [4].
NFVI: It includes the hardware resources (i.e., generalpurpose computing, networking, and storage hardware) at the bottom as well as the virtual resources at the top which can be assigned to VNFs and NSs.It also includes the virtualization layer (e.g., hypervisor, container engine) that supports the management of virtual machines/containers (e.g., create, delete, or resize virtual resources) [1].
VNF: A VNF is a software implementation of a network function that can run on the NFVI [1].VNFs are the building blocks of NSs.They are interconnected through VLs and consume infrastructure resources [2], [3].The VNF Descriptor (VNFD) of the VNF determines the VNF deployment and operational requirements [3].A VNF profile specifies the instantiation information for a specific VNF Deployment Flavors (VnfDF) [3].A VnfDF references a VNFD and specifies a particular deployment of the VNF [3].A VNF is composed of at least one VNF Component (VNFC) and zero or more Internal VLs (IntVL) [2], [3].VNFCs are the actual consumers of infrastructure resources [2], [3].The VnfDF of a VNF defines the scaling levels, which indicate the number of instances for each VNFC and IntVL of the VNF at the different VNF scaling levels [3].
EM: It manages the application and the functionality of its managed VNF(s).This includes fault, configuration, accounting, performance, and security management [8].
OSS/BSS: In general, the OSS is capable of requesting from the NFVO to onboard, instantiate, alter, or terminate NSs for which it provides NSDs as input.The OSS also manages VNF applications and their functionalities through EMs.The BSS includes systems like billing and customer management to support business management [8].

B. Machine Learning Approaches
There are three main ML approaches [9]: supervised, unsupervised, and reinforcement learning.Each approach targets solving certain types of problems.
For supervised learning, the machine is given a training dataset to learn a model from.Supervised ML models can solve classification problems [9].For example, an ML model can classify patients into recovered and unrecovered groups based on their symptoms.To do so, the classifier model is trained (before making any predictions) with a set of data showing the recovery condition of different symptoms.Supervised ML models can also solve regression problems in which the output has a numerical value with natural ordering [9].For example, a model can predict the price of land based on its area, location, and neighborhood.The Artificial Neural Network (ANN) is one of the most used and powerful algorithms to solve classification and regression problems [10].An ANN is composed of one input and output layer and one or more hidden layers.Each layer has one or more nodes/neurons.A node in the hidden layer applies a function to the input values it receives through weighted vectors from the nodes of the previous layer [11].When an ANN is trained, the weight of each vector is adjusted so that for a given input, the ANN can predict an output with an acceptable accuracy [9].An ANN with one hidden layer is also called a shallow ANN which can solve simple problems [11].Deep Neural Networks (DNN) can solve more complicated problems.DNN is an ANN with more than four hidden layers, and ML with DNN is called Deep Learning (DL) [11].To create a DL model, the following steps are taken [12]: 1) Data Collection: Training data can be collected from a real system or generated synthetically [13].Each record of the dataset has two parts: data or input and label or output.2) Data Analysis: The data analysis includes selecting the data features that have the most effect on the target outputs and preprocessing the data [12].Scaling and normalization are steps of data preprocessing.
3) Model Construction: Model construction includes selecting the hyper-parameter values and training the model [12].The number of hidden layers and nodes are examples of hyper-parameters.4) Model Validation: Once the model is created, we should validate its accuracy.We can use some sample data (used or unused during the training phase) to compare the model predictions with the labels of the sample data.If the result is unsatisfactory, we may go back to the first step and try to increase the model's accuracy by adding more training data, selecting better features, and/or tuning hyperparameters.In the case of unsupervised learning, the machine is trained using unlabeled/raw data.Thus, it tries to find patterns/structures within the given set of data [9].The two main unsupervised ML techniques are clustering and dimensionality reduction [9].
A reinforcement learning agent learns how to act/react in a dynamic environment to achieve a goal by receiving feedback(s) for its actions from the environment [9].Usually, a reinforcement learning agent learns in a real/near-real environment, and it takes a longer time for a reinforcement learning agent to learn a model compared to the two other approaches.

C. Background and Problem Definition 1) Design-Time Configuration of Network Services to Fulfill Availability and Continuity Requirements:
The availability of a service is defined as the fraction of a given period that the service is provided [14].Tenants often express the Required Availability (RA) for an NS functionality as a percentage.For example, 99.9995% of RA for an NS functionality means that the overall outage time of the NS functionality in a year is required not to exceed 157.68 seconds.Service continuity is another important characteristic of NSs.Service continuity depends on service availability.However, for stateful services, service continuity also depends on the service disruption caused by failures.In [5], we have provided quantitative definitions for service disruption to enable tenants to express their acceptable service disruption requirements and help NS designers to measure them.Service Disruption Time (SDT) is the amount of time for which the service data is lost due to all service outages in a period [5].Tenants can express their Acceptable SDT (ASDT) for an NS functionality in terms of seconds (e.g., 31.5 seconds per year, equal to 0.000001% of a year).Service Data Disruption (SDD) for an NS functionality is the maximum amount of data lost due to one failure [5].In other words, it is the maximum service data lost during the time between a failure and the latest committed checkpoint.Tenants can express the Acceptable SDD (ASDD) for an NS functionality in terms of bits (e.g., 1024 b per failure).
The availability and continuity of an NS are impacted by the failure rates, the failure detection time, and the failure recovery time of its constituents [5], [15].The failure detection time depends on the configured health-check interval, while the failure recovery time depends on the recovery mechanism, which can be a restart recovery or a failover [5], [15].To perform a failover, some redundancy with a certain number of standby instances is needed [5], [16].The failure rate of VNFs is derived from the failure rates of the VNF applications and of the utilized resources (i.e., hardware, software, and virtual resources) [5].In addition to the aforementioned parameters, the continuity of an NS depends also on the checkpointing intervals of stateful VNFs and the networking delay in checkpointing [5].Usually, an NFVI offers different types/options for hosts, networks, hypervisors, and VM flavors that can be utilized by VNFs and VLs [17].In [5], we proposed a designtime approach using analytical methods that refines a given NS design to fulfill the tenants' availability and continuity requirements in addition to the other requirements already satisfied.Fig. 2 depicts the overall picture of this approach.
As shown in Fig. 2, the design-time approach takes as input, a designed NS, the data rate of each functionality of the NS, and the availability and continuity requirements for each NS functionality.It also requires the estimated failure rate and availability of VNF applications, VNF scaling levels and antiaffinity policies, the checkpointing method of the VNFs, the estimated failure rate and availability, the cost of the available host types, and the availability of VLs The goal of this approach is to determine the optimal NS configuration to meet the availability and continuity requirements.This configuration includes the optimal value of the health-check and checkpointing intervals of the VNFs and the optimal number of standby instances of the VNFs and VLs.This approach also determines the optimal host type for each VNF and puts constraints on the availability and failure rate of the hosts (i.e., physical host, VM, and hypervisor), VLs, and VNF applications.These constraints are called "availability constraints" in [5] and in this paper.For example, the design-time approach may indicate five 9s of availability (i.e., 99.999%) and 2 failures per year as the availability constraints for a physical host.
It is shown in [5] that determining the optimal configuration for VNFs has exponential time complexity in terms of the number of VNFs of the NS when a complete search is used.The complete search determines multiple configurable parameter values of VNFs (e.g., health-check and checkpointing intervals) while it optimizes the resource cost (e.g., network messaging cost).Simulation results have shown that the complete search is not tolerable for NSs with more than six VNFs.Therefore, a heuristic search is proposed in [5], it determines a near-optimal configuration with quadratic time complexity.It is also shown through simulations that for an NS with 120 instances of VNFs of 20 different types, the execution time of the heuristic search to determine the near-optimal configuration is about 10 minutes.VLs have no configurable parameters and the design-time approach needs only to determine the number of (standby) instances.Determining the minimum required number of VL instances requires no optimization, since it can be calculated using one mathematical formula, and it can be performed with constant time complexity.In other words, the design-time approach determines, for each VL, the minimum required number of instances such that their overall availability is higher or equal to the expected availability of the VL.This can be calculated using one mathematical formula [5] which is independent from the size of the NS.
2) Problem Definition: When an NS is designed to fulfill given availability and service continuity requirements, it will do so upon deployment as long as the availability constraints are respected.At design time, the availability of resources is approximated.As a result, the actual availability may differ at runtime.For example, the failure rate of a host changes over time and increases as the host ages.Therefore, the design time configuration generation process uses an average as the estimate.In addition, the resources provided to VNFs and VLs can also change at runtime due to different reasons, such as hardware/software upgrades, migration, and failovers.Although the NFVI is expected to respect the availability constraints when a change (e.g., upgrade) is planned, unpredictable changes (e.g., failover due to a failure) may cause a violation at runtime.In these cases, the NS may no longer fulfill its availability and/or service continuity requirements with its current configuration.Therefore, if the actual availability of some resources at runtime is less than the estimated value used at design time, a violation of the availability constraints occurs, and the NS may need to be adapted by adjusting its configuration parameters.
Resource changes can also improve the availability of VNFs and/or VLs (e.g., due to failover to resources of better characteristics).Also, no failure may happen for a long time which reduces the actual failure rate of resources below the rate that was estimated.In these cases, we may be able to adjust the NS configuration such that the resource cost is decreased (e.g., reducing the number of standby instances), while still fulfilling the availability and service continuity requirements.Some changes at runtime like hardware/software aging need periodic adjustments since the change is predictable throughout the NS lifetime.But other changes like failovers are not predictable, and the adjustments should be performed as soon as the changes happen.To determine the adjustments at runtime, the same methods of the design-time approach can be used, particularly for periodic adjustments.However, for some NSs (e.g., a large NS with ultra-high availability requirement), we may need a swifter approach, especially for adjustments after unpredictable changes.As mentioned earlier, the heuristic search proposed in [5] has reduced the time complexity of the design-time approach from exponential to quadratic Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
complexity and made the approach affordable for the design time.But for the runtime adjustment, this can be still too long for large NSs.At runtime, constant time complexity (i.e., independent from the size of the NS) is preferable, particularly since the size of the NSs in the NFV environment can be considerably large which may also change with scaling.

III. RUNTIME ADAPTATION FRAMEWORK
The proposed runtime adaptation framework consists of a procedure supported by actors in the NFV reference architecture and an Adaptation Module (AM).In this section, we discuss this procedure considering the configurable parameters used to adapt the NS at runtime and propose a placement for the AM in the NFV reference architecture.We also discuss the roles of the actors in the NFV reference architecture with respect to the runtime AM.

A. Configurable Parameters
The runtime adaptation is achieved by adjusting configuration parameters as needed.In this work, we consider that at runtime, the health-check interval, the checkpointing interval, and the number of standby instances can be adjusted to compensate for availability constraint violations or to reduce resource consumption.To determine the new values for these configurable parameters when a resource characteristic changes at runtime, let it be due to an upgrade, a failure, or a lack of failures for a prolonged time, we can use the analytical methods of the design-time approach we proposed in [5] or ML models we propose in this paper.
As in [5], the assumption in this paper is also that healthchecks are performed at the VNF application level, and their intervals are configurable.Also, checkpointing is performed at the VNF application level, but the checkpointing intervals may or may not be configurable.In addition, the role of the VNF instances, i.e., whether an instance is active or standby, can be set at the VNF application level, and accordingly, the external VL instances can also be active or standby.The MANO functional blocks are not aware of these application-level parameters and roles.
Although the number of standby instances for the VNFs or VLs is not visible to the MANO, it is aware of the total number of instances (i.e., the summation of all the active and standby instances).The (total) number of instances of VNFs and VLs for each NS scaling level is (pre)defined in the NsDF at design time [6].
The approach proposed in [5] (and the ML models proposed in this paper which mimics the approach in [5]) determines the optimal NS configuration, which can be used for the designtime configuration or the runtime adaptation.At design-time, the optimal configuration can be applied for application-level parameters (i.e., health-check and checkpointing intervals of the VNFs) and for the number of VNF and VL instances.At runtime, the application-level configuration parameters can be reconfigured with the optimal value determined by the design-time approach, since the MANO is not aware of these parameters and does not restrict them.However, the number of VNF and VL instances of a running NS instance in the context of NFV can be changed only by NS scaling, which may not be an optimal configuration, or by switching the NS instance to another NsDF, which can be an optimal configuration.In either case, the MANO only changes the number of instances according to the applicable NsDF.The NS scaling or NsDF change can be triggered by the OSS or by the NFVO itself.The total number of instances for each scaling level of an NsDF is determined at NS design time by the NS designer from the required number of active and standby instances.In the rest of this section, we explain the two possible methods (i.e., NS scaling and NsDF change) to alter the number of standby instances of VNFs at runtime.We also discuss an optimality analysis of these two methods.To alter the number of standby instances of VLs in the context of NFV, similar methods can be used, and similar optimality analysis applies.
For some availability constraint violations, scaling the NS up to the next scaling level in the NsDF can provide the required number of standby instances that compensate for the violation.The purpose of scaling in this case is to increase the availability of the NS by adding more standby instances as opposed to increasing the number of active instances to increase the performance.For example, let us assume an NS with two VNFs and three scaling levels as shown in Table I.To meet the performance expectations at different scaling levels, the number of required active instances of each VNF for this NS is set in an initial NsDF as shown in Table I.Note that the NsDF of an NSD only includes the values of the "Total" column.
Furthermore, let us assume that to fulfill an availability expectation, the NS designer has determined the number of required standby instances for each VNF at each scaling level, using the analytical method proposed in [5], and updated the NsDF (or created a new one) by adding the required number of standby instances as shown in Table II.
For this example, the numbers of active and standby instances are configured according to Table II to adapt to performance changes.Now, let us assume that the NS is running at scaling level 2, and an availability constraint is violated.To compensate for this constraint violation, our analytical method [5] determines that the number of standby instances of VNF1 should be increased by one, while it should be increased by two for VNF2.Meaning, in total, VNF1 needs four instances and VNF2 needs thirteen instances.In this case, scaling up to level 3 can provide the required number of instances for both VNFs and the role of VNF instances can be set according to the new calculations outside of the scope of the NFVO (or MANO in general).This, however, means that if the NS needs 10 active instances of VNF2 for performance, scaling level 3 cannot provide it under the current circumstances, which require 6 standbys for 8 active instances.
There are also cases for which the predefined scaling levels of a given NsDF do not provide the number of instances that can compensate for an availability constraint violation.For example, consider again the NS of the previous example at scaling level 2. Assume again that an availability constraint is violated and the analytical method [5] determines that five more standby instances should be added to VNF2.So, in total, sixteen instances of VNF2 are needed; this is not supported by the current NsDF.This could be also the case where the NS is in the highest scaling level and we need to add more standby instances, but there is no higher scaling level to switch to.For these cases, the NS can be switched to a different NsDFs if it is available for the NS.
Thus, different NsDFs can be designed at design time and onboarded for a given NS.The difference between these NsDFs is in the number of standby instances implied for the same scaling levels of the NS due to the difference in the availability constraints implied towards the underlying virtual resources.
In summary, if runtime adjustment is needed for the number of standby instances of VNFs, at design time the NS designer has two options to support this adjustment: 1) update the NsDF and add the required (standby) VNF instances to the scaling levels 2) create one or more NsDFs with the same number of active VNF instances, but a different number of standby instances for different cases of probable adjustments at runtime.
Creating multiple NsDFs is a better solution since different NsDFs can be created to support all the possible changes at runtime with the exact required number of instances without resource overprovisioning.However, scaling an NS in order to adjust the number of standby instances at runtime may not be always an optimal solution.For example, if for the NsDF of Table II, the NS is at scaling level 2 and there is a need to add one more standby instance to VNF2 at runtime, scaling to level 3 will add three standby instances.Two of these added standby instances are not necessary.Creating multiple NsDFs with the exact number of required standby instances for different possible adjustments at runtime will avoid this resource over-provisioning.However, not all the existing implementations of the MANO support NsDF change for a running NS.Currently, NS scaling at runtime is supported by all the existing implementations of the MANO.
Runtime adjustment may also avoid the overprovisioning of resources while meeting the availability and service continuity requirements.In particular, when resources are assigned to compensate for an availability constraints violation, they

B. Runtime Adaptation Procedure
To perform the runtime adaptation, we need to perform the following steps: • Step 1: Monitor changes (including resource changes, failures, and/or violations of availability constraints).• Step 2: Notify the AM about each change or violation.
• Step 3: Determine the new values for the adjustable configuration parameters if there was a change, or periodically even if there was no change.If the new values differ from the current values, there is a need for an adjustment.• Step 4: Perform the reconfiguration as needed for the adjustment.Monitoring of changes and events is performed continuously across the whole system in the NFV architecture.However, MANO does not monitor availability constraint violations since this is not a functionality required by the current specifications.

C. Runtime Adaptation Module and Its Placement
As mentioned earlier, the AM can use the analytical methods of the design-time approach described in [5] to determine any required adjustments.In this case, it needs the same inputs as the design-time approach, such as the availability and continuity requirements to be fulfilled, the availability and failure rate of different types of resources, and VNF applications.In addition, the AM should be aware of the number of active and standby instances of VNFs and VLs at the different NS scaling levels.This information is generated at design time and can be given to the AM at NS deployment.
Alternatively, the AM can use ML models as we will propose later in this paper to determine the required adjustments.In this case, it needs all the inputs mentioned above except the availability and continuity requirements since the requirements are already implied in the models.
To determine a good placement for the AM in the NFV reference architecture, we need to consider the steps of the runtime adaptation procedure and the capabilities/functionalities of the different entities of the architecture.
Steps 1 and 2 of the runtime adaptation procedure can be performed by existing actors of the NFV system.The AM Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
needs to perform step 3, no other actor has such a functionality.The AM could also perform step 4 by itself and/or in collaboration with existing actors of the NFV system.To do so, the AM should be able to perform the following activities.
• Activity 1: Analyze the NS availability and service continuity fulfillment status and determine any required adjustments when there is an availability constraint violation or a possibility to reduce resource consumption.Therefore, the NFVO and the OSS are suitable candidates to host these two activities.Between the NFVO and the OSS, the OSS is a better candidate since the OSS is aware of the application level of VNFs and has direct access to the EMs according to the NFV reference architecture.Therefore, we consider the AM as part of or placed within the OSS.

D. Notification and Adaptation Operation Flows
Considering the first step of the four-step runtime adaptation procedure -monitoring changes in the NFVI resourcescan be performed by the VIM since it is responsible for managing the resources provided to VNFs and VLs.If a change is detected/carried out by the VIM at the hardware or virtualization layers, it can also determine the impacted virtual resources (i.e., the virtual resources assigned to VNFs or VLs).
For the second step, we need a notification flow from the VIM to the AM placed as part of the OSS.For changes that impact virtual resources assigned to VNFs (e.g., VMs and VLs interconnecting VNF components within a VNF instance), the notification flow is depicted in Fig. 3.
In this case, the VIM reports the change to the corresponding VNFM by sending a notification through the Vi-Vnfm reference point (depicted in Fig. 6) if the VNFM has subscribed for such notifications.This notification includes the impacted resource(s) and the change.Based on the impacted resource(s), the VNFM identifies the VNF(s) impacted by the change and reports them together with the change to the NFVO by sending a notification through Or-Vnfm reference point.Then, the NFVO can determine the impacted NS(s) and report them and the change to the AM through Os-Ma reference point, which in turn determines for each impacted NS if adjustments are necessary.
Fig. 4 shows the notification flow for changes that impact VLs at the NS level.In this case, the VIM reports the change and the impacted VLs directly to the NFVO through Or-Vi reference point, which in turn determines the NS(s) impacted and notifies the AM.
A shown in Fig. 6, failures of VNFs and their components and internal VLs can also be monitored and reported by the  EM directly to the AM.Also, hardware resource failures can be monitored by the OSS itself.
In any case, the AM collects information on failures and estimates the actual failure rate of infrastructure resources, VNFs, and VLs.Thus, at any point in time, the AM is able to evaluate whether the availability of an NS and/or its constituents is unchanged, deteriorated, or improved.
According to Step 3 of the adaptation procedure, once the AM identifies a change, it determines the applicable new values for the adjustable configuration parameters.It compares these new values with their current values to identify any adjustments required.Then according to Step 4 of the adaptation procedure, if changing the scaling level or the NsDF is needed, the AM sends a request to the NFVO as shown in

E. Compliance With the Standards
The communication and messaging between different entities of the NFV architecture are performed through the reference points shown in Fig. 6.Resource monitoring is supported by the MANO according to the ETSI NFVI specifications, and the required APIs and information elements for the change notifications are also defined.
To receive the change notifications for an NS the OSS needs to subscribe to the NFVO.In turn, the NFVO will subscribe to the VIM and the VNFM, for their change notifications.The VNFM will also subscribe with the corresponding VIM.Note that multiple VIMs and VNFMs may be involved in the same way simultaneously.
In turn, the VIM monitors resource changes and notifies the NFVO through the Or-Vi reference point and the VNFM through the Vi-Vnfm reference point about a change they subscribed for [18], [19].
The information element VirtualisedResourceChange-Notification carries the information about the change, e.g., failure or upcoming upgrade.The virtualisedResourceId and the virtualisedResourceGroupId attributes indicate the impacted virtualized resource, and the changedResourceData attribute provides the change information.
In case of a change impacting a VNF, from the above information, the VNFM determines the impacted VNFs and notifies the NFVO about them and the change.The AlarmNotification information element can be used to send an alarm from the VNFM to the NFVO [20].This includes attributes for the impacted VNF instance ID (i.e., managedOb-jectId) and the alarm details (i.e., faultDetails).
When the NFVO receives a notification from the VNFM or the VIM, it determines the impacted NS(s) and notifies the OSS and, thus, the AM, which is part of the OSS.The NFVO also uses the AlarmNotification information element to carry the information through the Os-Ma reference point [21].
The OSS can request the NFVO to scale the NS using the ScaleNsRequest operation or to switch it to a different NsDF using the NsUpdateRequest operation of the NS lifecycle management interface [21].The information element used to provide the details in the first case is the ScaleNsData, while in the second case the ChangeNsFlavourData.
The communication between the OSS and EMs is not within the scope of the NFV specifications.However, the configuration of the VNF is part of the activities supported by the OSS and EMs [8].

IV. MACHINE LEARNING MODELS FOR RUNTIME ADAPTATION
As mentioned earlier, the execution time of the analytical methods used at the design time introduced in [5] may not be tolerable for large NSs to use the approach at runtime to determine the required adjustments.Reference [5] shows that the time complexity of determining optimal NS configuration to meet availability and continuity requirements is exponential in terms of the number of VNFs of the NS (i.e., O(c n ), where c denotes the number of configurable parameters of the NS and n denotes the number of VNFs of the NS).Exponential time complexity is not tolerable for large NSs even for design time configuration.Therefore, a heuristic algorithm is proposed in [5] which reduces the time complexity to quadratic complexity (i.e., O(n 2 ), where n denotes the number of VNFs of the NS).The heuristic algorithm solves the execution time problem for the design-time configuration.However, for runtime adaptation and particularly for use cases of ultra-high availability in which very quick adjustment is needed, the execution time of the heuristic algorithm is not tolerable.It is shown in [5] that the execution time of the heuristic algorithm for large NSs is several minutes, which is well above the maximum down time allowed in one year in case of ultra-high availability.
A possible solution is to replace those time-complex algorithms/methods with ML models which mimic the design-time approach.Making a prediction at runtime using an ML model has a constant time complexity which makes it independent from the size of the NS, hence, an ideal solution in terms of time complexity.Thus, the AM can use, for example, a combination of a lightweight analytical method for the redundancy Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
calculation of VLs and an ML model(s) to determine the required adjustments for the VNFs.

A. Problem Formulation
The first step of applying ML to networking problems is formulating the problem, which means selecting the ML approach and algorithm that can solve the problem [12].In our case, we need an ML model capable of predicting numerical values at runtime.A suitable model would determine the adjustments at any point in the NS lifetime.
Considering the three main ML techniques: • Unsupervised ML is not suitable as the nature of the problem is neither pattern recognition nor data structuring.• A reinforcement learning agent is usually trained at runtime.During training, the predictions are low quality, while we need a solution that can make good predictions from the moment the NS is instantiated.In addition, there is a high possibility the reward and value function calculations are as heavy (if not more) as the analytical model we want to replace.• Supervised ML can replace the heavy analytical methods with regression models.Supervised models can be constructed at design time and used to predict at runtime as soon as the NS is instantiated.Considering the complexity of the analytical methods of [5], DNN-based DL models are considered.The details of the DL runtime adaptation model (DL-RAM) creation method are explained on a sample NS.We first introduce this NS, next we discuss the model construction for it, and which we summarize in a method for DL model creation applicable to any NS.We assume that the example NS needs to satisfy the following requirements for each of its functionalities as shown in Table IV.

B. Sample Network Service
This information is an input for creating the DL-RAM together with the details of the NS, including the NFPs, VNFs, VLs, mapping of functionalities to NFPs, NS scaling levels, and the maximum service data rate of each NFP.Some of this information is part of the initial NsDF designed only to meet the performance requirements.The NS scaling levels of the NsDF are shown in Table V.
Information about each VNF is available in the form of the VnfDF, and the characterization of the VNF application and its internal reliability features.As part of VnfDFs, the VNFCs, IntVLs, and VNF scaling levels of the different VNFs are as follows:

TABLE IV FUNCTIONAL AND NON-FUNCTIONAL REQUIREMENTS OF THE SAMPLE NS TABLE V SCALING LEVELS OF THE SAMPLE NS
The application-level information of each VNF of the example NS is provided in Table VII.
For the VNFC applications of each VNF, the availability and Average Failure Rate (AFR) are shown in Table VIII.
The maximum availability and failure rate of VLs (including IntVLs) that the infrastructure can provide are 0.9995 of availability and one failure per year.Finally, the infrastructure available for the deployment is characterized by the different

C. Deep Learning Models for the Sample Network Service
The first step to create a DL model is to collect/generate the training dataset and select the data features.For our problem, no data can be collected from a production system before the NS is deployed.We can collect the data in a testbed, or we can generate the training dataset using analytical methods of the design-time approach in [5].To ensure that the data collected from a testbed covers a sufficient range of behaviors, and the corresponding evaluations are a challenge by itself, so we proceed with the second option.Our purpose in creating the DL-RAM is to mimic the behavior of the design-time approach at runtime while taking into account the actual characteristics of the resources.
To generate the training dataset, random values can be generated, among others, for the constrained characteristics of the infrastructure, that is, for the host availability and failure rates, the VL availability, and the network latency and bandwidth.Then, for each set of values (i.e., input feature values), the corresponding adjustable configuration parameters values (i.e., outputs/labels) are determined using the analytical methods so that the availability and continuity requirements are met.
Once we have generated the training dataset, we need to select the data features and preprocess the data.Considering the input and output parameters of the design-time approach, the features shown in Table XI are considered to determine the output parameters shown in Table XII.The two tables also show a sample record of the dataset.The input features considered for the training data structure are the NS scaling level, the AFR of each VNF, the Network Latency (NL) and the Bandwidth (BW) of the network used for checkpointing for each VNF.The outputs are HI, CpI, and the number of standby instances (SB) for the VNFs.We do not include the availability of VNFs in the input features because availability and AFR convey the same information.The availability decreases if the AFR increases and vice versa.Having correlated features does not improve the training, only slows it down.The service availability and continuity requirements are constant for all cases, so there is no need to include them as input features.
The data structure of Table XI and Table XII can be used to solve our problem.However, there might be a possibility for improvement by applying some domain knowledge.According to [5], to determine the number of standby instances of a VNF, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.We expect that applying the domain knowledge this way results in more accurate predictions since the chained DL models learn the behavior of the analytical methods better.To assess the DL models and the achieved improvement, we generate a training dataset with 75000 records for the data structure of the single DL model and for the data structures of two chained DL models.
To simulate runtime changes in the infrastructure, the following input feature changes are considered for the generation of the training datasets: 1) Network Delay and Bandwidth • To simulate switching to a different network option for single VNFs (as in case of a single network interface failover) or all the VNFs using given options (as in case of a router failover) • To simulate link congestions and router overloads variations in the delay and bandwidth (up to 50%)

2) Host Availability and AFR
• To simulate failover/migration of single VNFCs or all the VNFCs to another type • To simulate the change of AFR and availability (changes within the range of the last digit) of single VNFCs or all the VNFCs using the same host type

3) VL Availability and AFR
• To simulate the change of AFR and availability (changes within the range of the last digit) of VLs and IntVLs To generate a record of input feature values, a random number of features are selected and changed randomly within the above ranges.Then, the analytical methods are used with this input data to generate the output portion of the record.This process is repeated to generate the target 75000 records.
Once all the records are generated using the designtime approach, the dataset is pre-processed according to the methodology described in [11], including encoding categorical data, data scaling, and normalization.Then, all duplicates are removed to ensure there is no overlap between the training and the validation sets.Finally, 10% of the remaining data is set aside for model validation.
The construction and validation of the DL models is our next step.We use TensorFlow to create DL models.TensorFlow is a free and open-source machine learning library widely used for DL model construction [9].First, we determine the hyper-parameter values to train the DL models.The goal is to achieve a fast-learning model while avoiding overfitting and underfitting [12].We use the random search method to determine the hyper-parameters value [22].For the DL models of both data structures, the result is: • Number of hidden layers = 19 • Number of nodes for each hidden layer = 35 • Learning rate = 0.00003 The number of nodes in the output layer equals the output parameters of the data structure of each DL model.In the case of the single-DL model, it is three times the number of VNFs in the NS.In the case of the chained DL models, it is twice the number of VNFs for the first DL model, and the number of VNFs in the NS for the second DL model.The activation function selected for hidden layers is the Rectified Linear Unit (ReLU) function, and the output layer used a linear function since the problem is a regression [9].The loss function is the mean squared error, while the optimizer is the ADAM (adaptive moment estimation) algorithm [9].
To evaluate the DL models and to compare the two proposed training data structures, we generate datasets for both the structures of the single DL model and the chained-DL models.The prototypes are implemented in Python using the TensorFlow library.75 000 records are generated for each case, 90% of which are used for training and 10% for validation.The DL models are trained for 20,000 epochs.
Table XV shows the standard deviation for each output parameter predicted by the prototype of the chained DL models and the respective output value in the validation set (i.e., the optimal value determined by the analytical methods).Considering the average value of each output parameter, the corresponding standard deviation indicates a good prediction.The table also shows that the chained DL models can predict VNF2_CpI and VNF3_SB values with 100% accuracy.This usually happens if a parameter's value is constant for all the records of the training (and validation) dataset.In this example, the CpI of the VNF2 is not configurable (see Table VII).So, its value for all the records is 50 ms.Also, the number of standby instances for VNF3 for all records is the same (i.e., one instance) as generated by the design time approach.
To compare the single DL model and the chained DL models we focus on the number of standby instances.The validation results for the number of standbys are shown in Table XVI.As shown, while the average values are the same for the two solutions, the standard deviations are lower (better) for the chained models, as expected.Therefore, it is more likely that using the single DL will result in an unsatisfaction with the service availability and continuity requirements and/or more resource costs.
Comparing the results of two data structures for HI and CpI prediction, the results of chained DL models are slightly better for HI and CpI predictions, however, the difference is not statistically significant.

D. Method for Deep Learning Models Construction
When DL models are used for runtime adjustments, e.g., if swift adaptation is required for a large NS, for each NS instance, dedicated models need to be constructed as the models depend on the NS design, the given infrastructure, and the service availability and continuity requirements.These runtime adjustment models include two DL models for adjusting the configuration of the VNFs.In addition, the runtime adjustment models include an analytical model for the redundancy adjustment of VLs.Calculating the number of required standbys for VLs using analytical methods has constant time complexity.Therefore, there is no need for an additional DL model to replace it.
In the previous sub-section, we have presented the creation of DL models for a sample NS.Here we summarize the process for the construction of DL models for any given NS design to be deployed on a given infrastructure to satisfy given availability and service continuity requirements.Starting with the analytical methods used to design such an NS, the steps are the following: • Step 1: To select the input feature for the DL models, identify the input parameters of the analytical methods that can change at runtime and the range within which they can change.Possible input parameters are the NS scaling level and the AFRs, NLs, and BWs of the VNFs.Some of these parameters may not change during runtime.For example, the bandwidth of a network link can be guaranteed by a dedicated link (i.e., a dedicated physical network is used for the link).If an input parameter does not change during runtime, its value will be the same for all records of the training dataset.Therefore, it needs to be removed from the DL models' data structure.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• Step 10: Apply the validation datasets to their respective DL models to validate the models.If the model predictions diverge unacceptably from the values of output parameters in the validation dataset, the models need to be discarded and the DNNs refined starting with changing the parameter determined in steps 6 and 7.

V. PROTOTYPES, CASE STUDIES, AND TESTBED
To evaluate our approaches through experiments we have prototyped the design-time approach and the runtime adjustment models creation presented here.We have also developed VNFs for two NSs comprising two case studies and created a cloud environment with a prototype of the runtime adaptation framework.In this section, we present the prepared prototypes, case studies, and the testbed.

A. Prototypes
The prototype of the design-time approach had been developed in Java and used for simulations as described in [5].To this prototype, we have added a data generator function to generate a training dataset for the DL models.This function generates different random input values in the range of possible changes (as described in Section IV-C).Then, it determines the values for the corresponding output parameters (i.e., the optimal configuration).The input parameters include random values for the availability and failure rate of hosts, VNFCs, VLs, and IntVLs, as well as for the bandwidth and latency of the networks used for the VNFs checkpointing.The outputs are the new values for HI and CpI of VNFs, and the number of standby instances for the VNFs and VLs.
For the runtime adjustment models creation, we have prototyped the DL model creation method in Python.These DL models are created at design time with the purpose of using them at runtime by the AM.The details of the implementation are described in Sections IV-C and IV-D.
We have also prototyped the AM in Python.The AM receives failure and change notifications from the MANO and the EM, evaluates the availability and service disruption fulfillment of the NS, determines any required adjustments using DL models, and requests the MANO and the EM to reconfigure the NS if needed.The AM interacts with the MANO and the EM through RESTful APIs.

B. Case Studies
We have developed different VNFs for two NSs of the case studies of our experiments.Here, we introduce the NSs and their VNFs and infrastructure characteristics.

1) Video Streaming Case Study:
The first NS provides a video streaming functionality.In this case study, we assume that the NS needs to satisfy certain ASDT requirements.We consider two different values of ASDT for two sets of tests.The first ASDT is 120 seconds, i.e., the service disruption time of the NS should remain below 120 seconds.The second ASDT is 180 seconds.The period of each test (i.e., the NS lifetime) is 24 hours.For these requirements, we want to be able to compare the actual SDT of the NS to the ASDT after each test run.

TABLE XVII SCALING LEVELS OF THE VIDEO STREAMING NS
Fig. 9 shows this NS, which has two VNFs and one VL.The decoder VNF decodes a video and sends the stream to the multi-caster VNF.The end users connect to the multi-caster IP address to receive the stream.
We assume this NS has two scaling levels that satisfy certain workloads (i.e., in the initial NsDF).Table XVII shows the required numbers of (active) instances of the VNFs for each NS scaling level.
The decoder and multi-caster VNFs have one VNFC and zero IntVL.Also, each VNF has one VNF scaling level with one VNFC instance.We have created VM images for the VNFCs of both VNFs using Ubuntu Server 20.04.Ubuntu is a distribution of Linux available at [23].
The decoder VNF uses FFmpeg software to provide its functionality.FFmpeg is an open-source software to record, convert, and stream video and audio [24].FFmpeg uses multiple communication protocols to send the video stream to one or more predefined receivers.In our experiments, the Real-Time Messaging Protocol (RTMP) is used.Each instance of the decoder VNF can send the stream to one or two multicaster VNFs.Note that FFmpeg does not send the stream directly to the end users because the list of end users is not predefined and can change anytime.
The multi-caster VNF uses the Nginx software to make the video stream available to end users.Nginx is an open-source software used as a (generic) proxy server [25].We configure the Nginx to receive a stream and multicast it on its IP address using RTMP.End users can use a video player software that supports RTMP (e.g., VLC media player) to connect to an instance of the multi-caster VNF and receive the video stream.
To add the capability of N+M redundancy to both VNFs, we developed a Python program called HA (High Availability).All instances of each VNF run this program, which is responsible for: • Health-check messaging between the instances of the same VNF • The HA program uses client/server TCP sockets to implement this functionality.Each instance sends a health-check message to all other instances of the VNF every HI.The value of HI is stored in a configuration file and is reconfigurable at runtime.• The IP address of the active instances of the multicaster VNF should be known and accessible to the external network (e.g., for the end users).The primary IP address of a VNF instance may change during the NS lifetime (e.g., due to a failover).Therefore, a constant virtual IP is assigned as a secondary IP address to an active VNF instance, and this address is removed when the instance becomes standby.The external network knows the list of virtual IPs.The decoder VNF is stateful and uses checkpointing.We also developed a Python program for this purpose.It: • Uses cline/server TCP sockets to send checkpoint data from the active instances to the standby instances.• Stores the checkpoint data in a file on reception.
• Restores the latest checkpoint when a standby instance becomes active.The checkpointing program reads the FFmpeg state (i.e., the checkpoint data) every CpI and sends it to the standby instances.The value of CpI is stored in a configuration file and it is reconfigurable at runtime.The multi-caster VNF is stateless therefore it uses no checkpointing.
Table XVIII shows the application-level information of the decoder and multi-caster VNFs.The availability and AFRs of VNFC applications of each VNF are in Table XIX.
The infrastructure available for our experiments provides one host type and one network option.We use dedicated physical hosts to provide resources to the VNFs.To change the availability and failure rates of different VNFs in each experiment, we inject host failures during the different test runs.The estimated availability of hosts used at design time is 0.9999, and their AFR is 2 per 24 hours.
The physical network of the infrastructure is not dedicated to our experiments, therefore intentional failures are not permitted.Hence, we assume that the availability of the network is 100%.
First, using the prototype of the design time approach we determine the optimal configuration for the NS instantiation given the ASDT requirements.For the set of tests in which the ASDT requirement is 120 seconds, the optimal configuration is shown in Table XX.
For the set of tests where the requirement is ASDT=180s, the optimal configuration is in Table XXI.
We also create new NsDFs with three NS scaling levels (compared to the scaling levels of the initial NsDF).Table XXII shows the number of VNF instances for each NS scaling level of the new NsDF for ASDT=120s.Table XXIII shows similar information of the NsDF for ASDT=180s.
In our experiments, the NS is instantiated using the new NsDF supporting the required ASDT.

2) Web Service Case Study:
The second case study is a Web-based ad-post network service.This NS provides a Web interface for users to post their ads or search for published items on the website.To experiment with a different type of tenant requirement, satisfying the RA is considered in this second case study.Again, two different RA values are selected for two different sets of experiments: RA=0.9995 and RA=0.999.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XXIV SCALING LEVELS OF THE WEB SERVICE NS
The period of each test run is again 24 hours.The goal of this case study is to be able to compare the availability of the NS achieved in each test run with the RA.Note that service continuity is not part of the tenant requirements in this case study.
Fig. 10 shows this NS, which has two VNFs and one VL.The webserver VNF provides the Web interface, stores the posted ad data in the database VNF, and searches for items in the database if a search request is received from an end user.We have also developed a traffic generator for this case study simulating the end users, which sends HTTP requests to the webserver VNF (with configurable intervals) to post ads or search for items.
Table XXIV shows the scaling levels of the NS with the numbers of (active) instances of each VNF that satisfy certain performance (i.e., in the initial NsDF).
Similar to the NS of the first case study, both VNFs of this NS have one VNFC and zero IntVL.Also, each VNF has one VNF scaling level with one instance of the respective VNFC.Ubuntu Server 20.04 is used to create a VM image for the VNFC of each VNF.
The database VNF uses MariaDB, an open-source relational database management software [26].The webserver VNF uses Apache software to provide its functionality [27].The Web interface is developed in PHP and is accessible using a Web browser.
To add the capability of N+M redundancy to both VNFs, we use the same Python program (i.e., HA) introduced in the first case study.The only difference is that in this second use case (i.e., Web service NS), the HA program assigns virtual IP addresses to active instances of both VNFs.This is because the external network should know the (secondary) IP address of the webserver VNF instances (as in the first use case).In addition, in this use case, the (secondary) IP address of the database VNF instances needs to be predefined for the webserver VNF instances.
Each instance of the database VNF can serve one or two instances of the webserver VNF.However, each webserver VNF instance has access to the database VNF as a whole.When there are multiple instances of the database VNF, they need to synchronize their records.Galera is used to create a cluster of MariaDB database instances that support data synchronization between multiple instances [28].
Table XXV shows the application-level information of the database and webserver VNFs for this case study.Since ASDT and ASDD are not part of the requirements, checkpointing is not used for the experiments.Therefore, related characteristics are not included in the table.
The availability and AFRs of the VNFC applications of the VNFs are shown in Table XXVI.For this case study, we use the same infrastructure as for the first case study.
We determine the optimal configuration for NS instantiation again using the prototype of the design time approach.The optimal configuration for RA=0.9995 is shown in Table XXVII.
For RA=0.999, the optimal configuration is shown in Table XXVIII.
We also create new NsDFs with three NS scaling levels for each of the RAs.instances for each NS scaling level of the new NsDF for RA=0.9995.For RA=0.999, no scaling level needed to be added as shown in Table XXX.However, the initial NsDF is updated to add the required number of standby instances to each NS scaling level.Testbed perform our validation experiments, we need an NFVcompliant testbed to instantiate and manage the NSs.testbed includes an EM, an NFV cloud (i.e., NFVI with MANO), a fault injector, local monitors, and local log collectors.In this sub-section, we introduce these components.
1) Element Manager: We have created an EM which is a VNF itself and has one VNFC.The EM application is prototyped in Python.Ubuntu Server 20.04 is used to create a VM image.
The EM can (re)configure the VNF instances by updating the values of the HI, the CpI, and the number of active (N) and standby (M) instances stored in the configuration file of the different VNF instances.Each VNF instance periodically checks for changes in the content of its configuration file and acts accordingly.The interval of this check is one HI by default; however, it is reconfigurable.Changing the value of N or M can cause a change of the roles of VNF instances.The EM provides RESTful APIs for the (re)configuration of VNFs.It also monitors VNF failures and notifies the AM should it happen.The monitoring capability of the EM is based on Zabbix, an open-source software for monitoring network nodes, servers, and applications [29].
2) Fault Injector: To test the runtime adaptation framework, we need to change the failure rate of hosts and VNFs during runtime.A fault injector program is developed in Python, which can inject a predefined or random number of failures on any host and/or VNF.The fault injector reboots the operating system of a host to fail the host, or the operating system of the VNFC of a VNF instance to fail the VNF instance.
3) NFV Cloud: We use OpenStack and Tacker to create a real NFV environment for our experiments.Considering the ETSI NFV architecture, OpenStack can play the role of a VIM and manage the NFVI.Tacker is a software module that adds the VNFM and the NFVO functionalities to the OpenStack controller.The OpenStack modules used for our implementation are: • Keystone: it is the identification, authentication, and authorization manager.• Neutron: it is the network manager to create and manages virtual links.• Glance: it manages a store of VM images for the VNF components.
• Nova: it manages the life cycle of VM instances.
• Placement: it provides a resource inventory and helps Nova with instance scheduling and resource optimization.• Heat: it enables the orchestration of composite cloud applications using declarative templates.• Ceilometer: it is a monitoring service to collect, normalize and transform event data produced by other OpenStack services.• Aodh: it triggers actions based on rules against event data collected by Ceilometer.• Gnocchi: it provides a scalable means of storing data collected by Ceilometer.• Mistral: it is a workflow manager and manages (defines and executes) multi-step tasks.• Barbican: it is an encryption key management service.Fig. 11 shows the video streaming NS instantiated on our testbed created using OpenStack and Tacker.As shown in the figure, the NFVI is deployed on four physical servers (HW SRV), the MANO runs on an additional physical server, while the AM and the fault injector run on a personal computer (PC) as part of the OSS.
The Yoga version of Tacker implements features of the NFVO and the VNFM according to the ETSI NFV Release 2 specifications, and it is suitable to test our approaches.

4) Local Monitor and Log Collector:
To be able to evaluate the achieved availability and service continuity characteristics, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
local monitoring programs and log collectors are developed for VNFs to collect the data about the actual service outage and recovery times for failures, to be analyzed after each test period.
Each VNFC instance of each VNF instance is associated with a local monitoring program and a log collector.The local monitoring program notifies the log collector of the VNFC instance when the application starts and stops functioning (e.g., the FFmpeg application starts or stops decoding and streaming the video).The log collector writes these events into a local file.We can download the log file at any time and analyze the log records.The local monitoring program of VNFC instances also notifies the local log collector when the operating system of the VNFC instance is rebooted.In addition, the traffic generator of the second use case also logs the result of each request it sends (i.e., success or failure).

VI. EXPERIMENTAL RESULTS
In this section, we introduce the objectives of our experiments, discuss their limitations, and present the results.

A. Objectives of the Experiments
The main objectives of our experiments are as follows: • Demonstrating the feasibility of the design-time and runtime approaches.• Evaluating whether the optimal configuration determined using the design-time approach for the NS indeed fulfills the service availability and continuity requirements when the availability characteristics of resources at runtime do not deteriorate compared to the values approximated at design time.• Evaluating whether the same service availability and continuity requirements are fulfilled at runtime when the availability characteristics of resources change over time, and the runtime AM reconfigures the NS.

B. Limitations of Performed Experiments
Although the main ideas of the design-time and runtime approaches have been tested through different experiments, we faced some limitations in performing our tests.The first limitation was the capacity of hardware resources we had for the experiments.The physical servers available for the experiments had four CPU cores each.Therefore, each server could host up to three VNFC instances because to provide the minimum performance each one of the VNFC instances and the host operating system needed at least one CPU core.Also, four physical servers were available to host all the VNF instances.As a result, the maximum number of VNF instances with one VNFC instance each could be twelve -for both VNFs of both NSs.Thus, we could not perform experiments for large NS instances with the existing infrastructure resources.
The limited capacity of the hosts also constrained the performance of the VNFs.For example, an active instance of the decoder VNF consumes almost 100% of one CPU core to decode a video.This VNF instance also needs to execute health-checking and checkpointing, which are CPU-intensive operations for very low HIs and CpIs.Therefore, we could not Despite the abovementioned limitations, it is worth mentioning that we have performed a good number of diverse experiments to assess our approaches.As we will show in the next sub-section, the results confirm the feasibility, applicability, and (to some extent) the validity of our approaches.
The second limitation we faced was related to Tacker and OpenStack.The latest version of Tacker and OpenStack (i.e., Yoga) was used to implement the MANO functional blocks.This version of Tacker implements the features of the NFVO and the VNFM according to the ETSI NFV Release 2 specifications.However, the Yoga version of Tacker does not support changing the NsDF at runtime for a running NS.As explained in Section III, the runtime AM may try to change the number of standby instances of VNFs at runtime by scaling the NS, by changing the NsDF, or by using both operations.The Tacker currently only supports NS scaling, but no NsDF change.

C. Experimental Results
We have performed three sets of experiments: 1) Unchanged AFR The goal of the first set of experiments is to assess the design-time approach.Therefore, we keep the AFR of the VNFs at runtime the same as the values used at design time.Also, no runtime adaptation is applied during these experiments.

2) Increasing AFR
For the second set of experiments, we randomly increase the AFR of VNFs at runtime and expect that the runtime AM adapts the NS accordingly.The goal is to evaluate if the availability and continuity requirements are satisfied in such cases.

3) Decreasing AFR
For the last set of experiments, the AFR of VNFs is decreased at runtime to check whether the runtime AM can optimize the resource usage.VNFs at runtime is equal to the values used in the calculations at design time.Therefore, in this set of experiments, the AM is disabled.We perform an experiment for each RA/ASDT value to be fulfilled in a single test run.
The AFR used at design time for all VNFs in all cases is approximately the same: three failures per 24 hours.This value is calculated based on the failure rates of resources and VNFC applications (as explained in Section V-B) and using the methods proposed in [30].In all test runs, only one failure is injected at a time, i.e., simultaneous failures are not considered.
Table XXXII shows the Service disruption time for each VNF failure (Sdt) of the video streaming NS, and the Outage time for each VNF failure (Ot) of the Web service NS.In the case of the video streaming NS, the first VNF in the table is the decoder VNF, and the second is the multi-caster VNF.For the Web service NS, the first VNF is the database and the second VNF is the Web server VNF.Each numbered column of a VNF (e.g., #1) indicates a failure injected into the given VNF.Thus, the total number of VNF failures of the NS for each experiment is six (three failures per VNF).The Average Sdt or Ot per failure is also shown for each experiment.In the last column, the values in green show the overall SDT or Availability (A) the NS has suffered during the test period.The table shows that the requirements are met in each experiment (i.e., SDT<ASDT or A<RA).
The differences between ASDT and SDT as well as between RA and A of Table XXXII are rather significant, and one might question if it is necessary.The difference could be significant because the design-time and runtime approaches guarantee the fulfillment of service availability and continuity requirements considering the worst-case scenario.However, since failures can happen at different points in time between consecutive health-checks or checkpoints, the Sdts and Ots in Table XXXII do not represent the worst-case Sdt or Ot, and we have different service outage and disruption times for the different failures in our experiments.
If all failures in all these experiments happened right after a health-check message was sent and before a checkpoint was ready to send, we would have the maximum possible Sdt or Ot for all failures, as shown in Table XXXIII.The maximum possible Ot due to a failure is the summation of the current HI, timeout, takeover time, and failover time.The maximum possible SDT due to a failure is calculated as the summation of the current HI, timeout, takeover time, failover time, and the CpI.As shown in Table XXXIII, the maximum possible Sdt/Ot is the same for each test run throughout the experiment since the AM is disabled for this group of experiments and HI and CpI are not adjusted at runtime.
Table XXXIII also shows the satisfaction of the requirements for each experiment considering the worst-case scenario, and that the maximum overall STD or availability values are close to the ASDT/RA values of the experiments.This is expected as our approaches minimize the cost while guaranteeing the satisfaction of the requirements for the worst case, what we consider as the optimal configuration.
2) Results of Assessing the Runtime Adaptation Approach for Increasing AFR:  given requirement, the average Sdt or Ot per failure decreases because of NS adjustments performed by the runtime AM.For example, for the experiment with the video streaming NS with ASDT=120 (i.e., the first four rows in Table XXXIV), the lowest average Sdt belongs to the first test run, which has the highest total number of VNF failures (i.e., 10 failures) compared to the other test runs (i.e., the second, the third, and the fourth test runs).
Table XXXIV also shows that the Sdt/Ot for each failure of a test run is different from the Sdt/Ot of the other failures of the same test run.This is because 1) a random failure can happen at any point between consecutive health-checks and checkpoints, 2) when a failure happens, it is considered as a change and the AM adapts the VNFs accordingly, meaning, the health-check and checkpointing intervals may change after a failure, hence, having a different value when the next failure happens.
In this set of experiments, scaling up of the NS is observed for all test runs except for the second and fifteenth.Since the AFR has increased (compared to the design-time AFRs) in all the test runs, no scaling down has occurred.
Table XXXV shows the maximum possible Sdt/Ot for this second set of experiments, and that in these worst cases, the ASDT/RA still would be met.The maximum possible overall SDT/minimum possible overall availability of each test run is close to the value of the respective requirement.Thus, the optimal configurations are determined and applied by the runtime AM.
Table XXXV gives also a better view of how the HI and CpI of VNFs are changed according to VNF failures and changes in the AFR of VNFs at runtime for each test run since the maximum possible Sdt/Ot is affected by the current HI and CpI of VNFs when a failure happens.
3) Results of Assessing the Runtime Adaptation Approach for Decreasing AFR: The focus of the third set of experiments was to check runtime adjustments performed to improve resource efficiency, particularly with respect to computing resources.As for the first set, we run an experiment for each ASDT/RA to be fulfilled in a single test run.For these test runs, we inject failures so that the actual failure rate of each VNF is less than the estimate used at design time (i.e., AFR<3).The AM evaluates whether adjustments to the NS are needed after each change (i.e., failure) and also periodically every hour.For each test run, we inject the first failure randomly at around the second hour of the test run, so that the predicted AFR increases drastically, and the AM scales up the NS.Then, the rest of the failures are injected randomly in the remaining time of the test run.We expect to see at least one scale up and one scale down operation performed by the AM for each test run.
Table XXXVI shows the details of this set of experiments including the time of each VNF failure, the time and the predicted total number of VNF failures at NS scaling, the total number of failures, and the fulfillment of requirements for each test run.As shown in Table XXXVI, for some test runs, we have one scale-up and one scale-down operations.For others, we have two scale-up and two scale-down operations.This depends on whether there is a long enough time between failures so that the estimated AFR decreases enough to allow for a scale-down.If failures occur too close to each other they may cause a scale-up.For example, in the first test run, the first failure happens at 2:17:35 and causes a scale-up.Then at 4:00:00 at a periodical check by the AM, the total AFR drops to 6 resulting in a scale-down.Another failure happens at 5:22:39 which increases the total AFR to 8.92, therefore, causing a NS scale-up.Finally, the total AFR drops to 6:00 at 8:00:00 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
therefore the AM requests another scale-down.In this set of experiments, only single-level scale-up or down are observed.

VII. RELATED WORK
An essential prerequisite that enables runtime adaptation of NSs for the purpose of maintaining required availability characteristics is monitoring the changes in the resources assigned to the NS constituents.
Some related work has proposed a framework/solution to monitor VNFs/NSs, in the context of NFV, for availability and/or performance purposes.However, all of them have shortcomings.Some do not monitor all types of resources and changes, while others are not (fully) compliant with ETST NFV reference architecture.
A fault management architecture and related procedures have been proposed for NFV-based mobile networks in [31].This specification proposes a procedure to notify the OSS about virtual resource failures through the VIM, VNFM, and EM.The only changes that are monitored, are resource failures that affect VNFs.Other types of changes, including changes that affect VLs, are not considered in this specification.In [32] the authors propose a framework to monitor resources, aggregate events, notify the NFVO and the OSS about alerts, and provide a visual dashboard for 5G networks.They propose to insert a monitoring layer between the infrastructure and the orchestration layers.Their definition of the orchestration layer combines both the NFVO and the OSS.The proposed monitoring framework also provides some advanced functionalities like visualization and aggregation.However, it is not clear where the entities of this monitoring layer are placed in the NFV reference architecture and how they interact with other NFV entities.In [33], the authors propose a monitoring framework which adds features like anomaly detection and aggregation optimization to the existing monitoring solutions of the VIM.Their goal is to reduce the number of notifications sent to the upper level, especially in large systems.Although the detailed capabilities of this solution are not discussed in [33], it is a potential candidate for adding availability constraints monitoring capabilities to the VIM, when it is considered and implemented for the NFV reference architecture.
Reference [34] introduces the need for end-to-end QoS monitoring and proposes a top-down approach with agents added to the OSS and the functional blocks of the MANO.Using the top-down approach, the OSS monitors the end-toend QoS of the NS and sends alerts to the agent in the NFVO if a failure happens.Then, the agent in the NFVO alerts the agent in the VNFM, which in turn alerts the agent in the VIM.The VIM determines the root cause and takes the healing action in collaboration with the VNFM and the NFVO.
The solution in [34] only considers healing actions at the resource level performed by the VIM (e.g., restart recovery) and does not include VNF application-level reconfigurations.Therefore, this solution cannot be used for the runtime adaptation we are aiming at in our paper.In addition, unlike the solution in [34], our runtime adaptation framework uses both bottom-up and top-down approaches.On the one hand, it monitors any change (e.g., NS scaling or resource upgrades), not only the failures.On the other, it re-evaluates the availability and service disruption characteristics of a network service on a regular basis.Our proposed solution performs adjustments as soon as it deems necessary.That is, it does not wait until an expectation/requirement is not met.To explain more, changes like NS scaling are not captured if only failures are monitored, but for different NS scaling levels, we may need a different number of standby VNF instances to fulfill the same availability/service disruption requirement.
Another set of related works addresses the runtime adaptation of NSs in case of performance degradation.The MANO itself supports VNF and NS level scaling at runtime to adjust the number of instances according to the current workload or if it is asked for [3], [6].The goal of [35] is to evaluate the end-to-end network delay at runtime by an analytical model and predict the optimal required resources by a reinforcement machine learning model to reduce the delay to an acceptable threshold.The work in [36] proposes an architecture for the dynamic provisioning of QoS-oriented VNFs for IoT systems.This architecture proposes an entity to manage the NFV, the software-defined networks, and the IoT middleware together.The proposed entity monitors the resource consumption and the servers' performance and orders the NFVO and/or the SDN controller to take the necessary adaptation actions to keep the QoS of the IoT services as expected.
Authors in [37] address the problem of auto-scaling during runtime to meet the required performance of the NS in a resource-efficient manner.They propose a DL solution to create a classifier to predict the appropriate scaling level at runtime when there is a change in the traffic load.Each class of the classifier represents a valid scaling level.To train the DNN, they create a set of training data by generating labels for a random set of input data.In [38], an ML approach is proposed to determine the optimal number of VNF instances at runtime to fulfill the current workload.This work benefits from supervised ML and proposes a method to generate a dataset to train the machine.In addition, the approach in [38] determines the optimal placement for VNF instances to meet the required performance in a cost-efficient manner.Authors in [39], propose an architecture that uses an ML approach for VNF lifecycle management.This architecture enables proactive decision-making for VNF resource prediction at runtime to meet the performance requirements efficiently.The architecture in [39] is implemented and tested on top of OpenStack.In [40], the authors address IoT services created using VNFs and their deployment in the dynamic NFV environment.Reference [40] proposes a routing and placement solution for IoT service chains that can dynamically scale at runtime to meet the workload requirements.The work in [40] benefits from DL for runtime adaptations and minimizes resource and routing costs.All these adaptations are specific to performance degradation compensation or performance guarantees at runtime and do not handle the case of availability constraints violation which is the goal of the solution proposed in this paper.
On the other hand, design-time configuration of NS for addressed in the literature.To determine the required adjustments at runtime, design-time solutions can also be used, if the solution determines all the configuration parameters (i.e., the solution is comprehensive) and the execution time is tolerable for runtime (i.e., the solution is lightweight) ideally, with constant time complexity independently of the size of the NS.The approach in [5] is a comprehensive solution to determine all the required configuration parameters.However, as discussed earlier, it is not lightweight for runtime.Related works in [41], [42], [43], [44], [45], [46], [47] mainly focus on VNFs redundancy and/or placement, hence, not a comprehensive solution for the problem we tackled in this paper.VNF redundancy calculations and placement decisions contribute to the availability and continuity of NSs, but other configuration parameters, such as HI and CpI, may need to be adjusted at runtime to maintain the availability and continuity fulfillment of the NS.The work in [41] proposes a solution for the placement of primary and backup VNFs.This work takes into account the time slots of the availability of (physical) nodes in the infrastructure to place the primary and/or backup VNF instances in order to avoid service discontinuity.The work in [41] assumes the resources are limited and multiple NSs are using the same infrastructure.The goal in [41] is to maximize the number of NSs with minimum service discontinuity caused by node unavailability.Authors in [41] show the execution time of the problem is not tolerable for large NSs, and they propose a heuristic algorithm with a slight penalty in performance.In [42], the authors tackle the same problem as in [41].The problem in [42] is formulated as an integer linear programming problem (ILP).ILP solves the problem in [42] without performance penalties in a timely manner.Authors in [43] take into account the availability of both VNFs and VLs and divide an NS into smaller sub-NSs.In [43], the availability of each sub-NS is protected through redundancy independently from other sub-NSs while the resource cost is minimized.Reference [43] proves that the optimization problem is NPhard.A game modeling approach is proposed in [43] to tackle the complexity of the problem.The work in [44] evaluates the reliability and availability of NSs for three different placement strategies without considering VNF redundancy (i.e., there is one instance of each VNF of the NS).Reference [44] shows that if all VNFs of the NS use the same physical node, the NS is more reliable compared with the cases in which not all the VNFs are sharing the same host or each VNF is placed on a different node.Reference [45] combines VNF redundancy with path backup strategy to guarantee the availability of the NS.Reference [45] determines the required redundancy of VNFs, VNFs placement, and network paths backup while minimizing the resource cost.The work in [46] takes into account the multitenancy of the NFV environment which enables VNFs of different NSs to share the infrastructure.Reference [46] assumes that the same VNF types of different NSs can share the same standby instances of the VNF (although this assumption contradicts ETSI NFV specifications).The goal of [46] is to provide a solution to determine the required redundancy of VNFs and the placement of VNF instances to guarantee the availability of different NSs (each NS with a different availability requirement) and minimize resource costs.Authors in [47] propose a solution for VNF placement to guarantee NS availability and minimize resource costs.They also propose a backup placement model (called the sideway cross backup model) in which an active instance of a VNF shares the underlying physical host with a standby instance of a different VNF.Reference [47] shows through simulations that the sideway cross backup model is more resource efficient compared with traditional backup placement models.

VIII. CONCLUSION
Because of the dynamicity of NFV systems and their infrastructure resources, NSs designed to fulfill certain service availability and/or continuity requirements may need to be adapted to changes in the characteristics of their resources during their lifetime.In this paper, we proposed a framework for the runtime adaptation of NSs to fulfill requirements with such expectations.This framework defines a runtime adaptation procedure, supported by change notification and adjustment flows, which are implemented by an Adaptation Module that is in charge of carrying out the runtime adaptations of an NS.The proposed framework is compliant with the ETSI NFV specifications.
The proposed runtime Adaptation Module manages the fulfillment of service availability and continuity requirements of an NS.To determine the required adjustments, the AM can use the analytical methods introduced for design time in [5].Alternatively in this paper, we proposed a method for constructing machine learning models to speed up the calculations of runtime adaptation.The AM can use these DL models to determine any required adjustments at runtime instead of the analytical methods.This is especially important for large NSs.
We conducted several experiments with realistic deployments of NSs to demonstrate the feasibility of our proposed solutions and to demonstrate they can maintain the configuration of an NS so that it fulfills its service availability and continuity requirements.We prototyped our approaches and developed different VNFs to create two NSs for these experiments.We deployed these NSs in a testbed, which included an NFV infrastructure managed by a MANO and a prototype of the proposed Adaptation Module.The results of the experiments confirm that our approaches are able to guarantee the fulfillment of service availability and continuity requirements for the considered case studies.They also show to some extent, the validity of our proposed solutions.However, experimenting with more and larger NSs in an industrial infrastructure remains for the next step to complete the validation of our approaches.

Fig. 2 .
Fig. 2. Overall picture of the design-time approach.

Fig. 5 .
The NFVO applies the requested scaling level and/or deployment flavor and reports the successful change to the OSS.Based on the new scaling level and/or NsDF, new VNF and/or VL instances may be instantiated, or existing ones terminated.If the newly instantiated VNFs and VLs should have a role assignment, the AM requests the EM of each VNF to assign the active/standby role to the newly instantiated VNFs and/or their newly instantiated external VLs.Once the roles are assigned, the AM may request the EMs to reconfigure the health-check intervals (HI) and checkpointing intervals (CpI) of their managed VNFs.Fig. 6 shows the notification and adjustment paths in the NFV reference architecture.Green arrows indicate the path for change (including failure) notifications initiated by the VIM up to the AM.Red arrows show the cases where the OSS monitors the hardware resource failures and the EM monitors VNF and VL failures.Arrows in blue show the communications of the AM for the adjustment of configuration parameters.

Fig. 6 .
Fig. 6.Notification and adjustment flows in the NFV reference architecture.

Fig. 7
Fig.7shows a sample NS with three VNFs and four VLs[4].It has three NFPs, which can be mapped to three NS functionalities.VNF1 and VNF3 are shared by all NFPs, while VNF2 serves only NFP1.All NFPs share VL1 and VL4.NFP1 and NFP2 share VL2, and VL3 is only used by NFP3.Fig.8shows the VNFCs and IntVLs for the sample NS of Fig.7 [4].VNF1 and VNF3 each consist of only one VNFC, and VNF2 has three VNFCs and one IntVL.We assume that the example NS needs to satisfy the following requirements for each of its functionalities as shown in TableIV.This information is an input for creating the DL-RAM together with the details of the NS, including the NFPs, VNFs, VLs, mapping of functionalities to NFPs, NS scaling levels, and the maximum service data rate of each NFP.Some of this information is part of the initial NsDF designed only to meet the performance requirements.The NS scaling levels of the NsDF are shown in TableV.Information about each VNF is available in the form of the VnfDF, and the characterization of the VNF application and its internal reliability features.As part of VnfDFs, the VNFCs, IntVLs, and VNF scaling levels of the different VNFs are as follows:

TABLE I THE
NUMBER OF VNF INSTANCES PER SCALING LEVEL OF AN NSDF THAT MEET A CERTAIN PERFORMANCE REQUIREMENT TABLE II THE UPDATED NUMBERS OF VNF INSTANCES FOR THE NSDF (FROM TABLE I) TO MEET AN AVAILABILITY REQUIREMENT

TABLE III THE
NUMBERS OF VNF INSTANCES FOR A SECOND NSDF FOR RESOURCES WITH LOWER AVAILABILITYturn into an excess subsequently if no failure occurs for a prolonged time, which improves the actual (measured) availability of resources.For example, assume that the NS of the previous example was switched to the NsDF of TableIIIto compensate for a change.Subsequent changes (or the lack of them) improve the NS availability so that switching back to the original NsDF of Table II can fulfill again the requirements.
Request for/perform NS scaling/NsDF change operations as needed.Request for/perform health-check and/or checkpointing interval reconfiguration if needed.As mentioned earlier, the application-level configuration of VNFs can be performed through their EMs.Configure active and standby roles of VNFs and VLs if needed.Like Activity 3, role assignments can be performed through EMs.The first and the second activities are at the NS level.

TABLE VI VNF
SCALING LEVELS, VNFCS, AND INTVLS OF THE VNFS OF THE SAMPLE NS

TABLE VII APPLICATION
-LEVEL INFORMATION OF EACH VNF OF THE SAMPLE NS

TABLE VIII AVAILABILITY
AND FAILURE RATE OF VNFC APPLICATIONSTABLE IX HOSTING OPTIONS AVAILABLE FOR THE SAMPLE NS hosting options shown in Table IX and the various network options presented in Table X.

TABLE XIII OUTPUTS
[5] THE FIRST DL MODELTABLE XIV INPUT FEATURES AND OUTPUTS FOR THE SECOND DL MODEL TO DETERMINE SB VALUESthe HI of the VNF is determined first.Then, the expected availability of the VNF is calculated using the VNF failure rate and the HI.Finally, the number of standby instances is determined using the expected availability of the VNF.In short, the number of standby instances of a VNF is determined based on the VNF failure rate and HI.But in the label structure of TableXII, the HI and the SB are output parameters at the same time, therefore, their dependency might not be learned properly by a DL model.This logic of the analytical methods of[5]is better reflected by constructing two DL models.One model can determine the HI and CpI parameters (shown in TableXIII) using the input features of TableXI.Then, a second DL can determine the SB for each VNF using the data structure of TableXIV, where the input features are the NS scaling level, the AFR of the VNFs together with the HI values determined by the first DL model.At runtime, these two DL models are chained through the HI values produced by the first model, which are used as input by the second model.
Determine the configurable output parameters.Output parameters are HI, CpI, and SB of VNFs, but CpI may not be configurable for a VNF.Unconfigurable output parameters should not be included in the DL models' data structure.Determine the DNNs architecture (the number of hidden layers and their nodes) for the DL models.A random search method can be used.
• Step 2: • Step 3: Generate two sets of input parameters by randomly changing the parameter values within their range.The size of these sets is the target size of the training datasets.The first set includes the NS scaling level and AFR, NL, and BW of VNFs.The second set includes the NS scaling level and AFR and HI of VNFs.• Step 4: Apply the generated sets of input parameters to their respective analytical methods to generate the corresponding output parameters of the analytical models.In other words, generate the labels to create the training set of each DL model.For the first set of input parameters, the HIs and CpIs of the VNFs are generated.For the second set of input parameters, the SBs for the VNFs are generated.• Step 5: Preprocess the training sets (e.g., scaling, normalization, etc.) and eliminate any duplicates.• Step 6: Split each set into training and validation sets.Usually, 80-90% of the records are used for training, and the remaining 10-20% are used for validation.• Step 7: • Step 9: Train the two DNNs using the generated training datasets and save the respective DL models.If a model converges very slowly or diverges during training, meaning that the learning loss during training increases/decreases very slowly, the parameters determined in steps 6 and 7 need to be tuned.

TABLE XXII NS
SCALING LEVELS OF THE NEW NSDF WHEN ASDT=120S

TABLE XXIII NS
SCALING LEVELS OF THE NEW NSDF WHEN ASDT=180S

TABLE XXV APPLICATION
-LEVEL INFORMATION OF VNFS OF THE WEB SERVICE NS TABLE XXVI AVAILABILITY AND AFR OF VNFC APPLICATIONS FOR WEB SERVICE NS TABLE XXVII OPTIMAL CONFIGURATION FOR RA=0.9995

TABLE XXVIII OPTIMAL
CONFIGURATION FOR RA=0.999TABLEXXIX NS SCALING LEVELS FOR THE NEW NSDF WHEN RA=0.9995TABLE XXX NS SCALING LEVELS FOR THE NEW NSDF WHEN RA=0.999

TABLE XXXI EXPERIMENTS
FOR EACH REQUIREMENT AND CASE STUDY consider stringent values for service availability and continuity requirements for our tests.
Table XXXI shows for each NS the number of experiments performed for each set of tests, the requirement type, and requirement values.Each individual test run lasted for 24 hours.1) Results of Assessing the Design-Time Approach: Table XXXII shows the results of the experiments conducted to evaluate the design-time approach in which the AFR of

TABLE XXXII RESULT
OF THE FIRST SET OF EXPERIMENTS TABLE XXXIII MAXIMUM POSSIBLE SDT/OT FOR THE FIRST SET OF EXPERIMENTS Table XXXIV shows the results of the second set of experiments in which the AFR of VNFs increases at runtime.These experiments are performed for each requirement value four times (i.e., four test runs per experiment).For each test run, a random number of failures are injected into each VNF at runtime.As shown in Table XXXIV, the ASDTs or RAs for all respective test runs are satisfied.The table also shows that as the total number of failures increases for an NS with a Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XXXIV RESULT
OF THE SECOND SET OF EXPERIMENTS

TABLE XXXV MAXIMUM
POSSIBLE SDT/OT FOR THE SECOND SET OF EXPERIMENTS

TABLE XXXVI RESULTS
OF THE THIRD SET OF EXPERIMENTS