Video Streaming and Cloud Gaming Services Over 4G and 5G: A Complete Network and Service Metrics Dataset

The development of telecommunications, particularly mobile communications, has led to the emergence of numerous multimedia services. As these services become increasingly reliant on connectivity, the network has a huge impact on the user experience. However, it is difficult to assess the End-to-End (E2E) performance of the service because of the limited access to the application data and encryption used to secure and integrate the network information exchanged. In this context, this work presents a dataset based on several metrics, namely Key Quality Indicators (KQIs), representing different services' E2E conditions. In particular, video streaming and Cloud Gaming (CG) services are considered in the dataset. Furthermore, a detailed description of the testbed where the dataset has been generated is given, discussing the rationale behind the choices made for its development.


I. INTRODUCTION
O VER its different generations, cellular communications has often been seen as the key to the emergence of multiple services.With each generation offering more and more advanced features, it has become a mainstream alternative for delivering many services.In addition, the introduction of cloud-based services such as Cloud Gaming (CG), together with Extended Reality (XR) type services, will increase the volume of data exchanged over these networks.This makes managing the network more complex, as each of the many services available will have heterogeneous requirements.Moreover, service performance also depends on network conditions, hence their Quality of Experience (QoE).
As a result, the research community has put a lot of effort into the design and implementation of several algorithms that help with management tasks, such as traffic steering or End-to-End (E2E) Quality of Service (QoS) provisioning.Similarly, the Self-Organising Networks (SON) paradigm has emerged to make the network smarter by introducing the concept of Machine Learning (ML) to these algorithms.
Most algorithms proposed in the literature are tested in simulators (e.g.ns-3) where network scenarios are based on statistical models adapted to the radio environment (e.g.propagation loss) present in real networks.Likewise, the emergence of Software Defined Networks (SDN) and Network Function Virtualisation (NFV) has led to the emergence of other alternatives [1]- [3], providing some interesting features of 5G, such as network slicing.
There are previous works that focus on the evaluation of multimedia services whose performance depends on the network.The authors in [4] provide a QoE related dataset of various YouTube videos over commercial 4G and 5G networks.However, it does not include data from a network perspective and only the network configuration set by the operator in that area is taken into account.In [5], a framework for analysing Adaptive Bitrate Streaming (ABS) algorithms based on key QoS parameters from 5G traces is presented.Similarly, in [6], the authors propose a network emulator for the evaluation of video streaming QoE taking into account network characteristics, user preferences and contextual information.
Nevertheless, to our knowledge, no previous work has combined service E2E information with data and network configurations in real implementations.In this context, this work provides complete network and service metrics for some popular services.These include Video On Demand (VOD), Live Streaming (LS) and CG in a variety of network environments, including real-world cellular deployments.This dataset has some advantages over telcos' datasets.First, there is no proprietary information that prevents their free use by the research community.Sencondly, the experimental environments used have allowed the collection of measurements with multiple network configurations.This is something that operators are typically reluctant to do in their network deployments.Finally, the methodology used to collect the data is described, facilitating the replication of the tests.
Compared to other works, the dataset provides E2E crosslayer metrics from multiple service delivery parts (i.e.network, user equipment and service) under different network technologies, configurations and radio scenarios.This supports a comprehensive view of the impact of network configurations and radio scenarios on services with heterogeneous requirements, providing a strong basis for the application of ML techniques.
The rest of the paper is organised as follows.Section II introduces the general considerations of the multimedia services that are taken into account in this work.In addition, metrics for measuring E2E performance are highlighted for each service.Based on these aspects, Section III discusses the design of the testbed.Then, Section IV describes the practical implementation of the testbed that has support various previous works [7]- [10].In Section V, a detailed description of the dataset [11] is provided.Subsequently, Section VI offers different ML applications for the presented dataset.Finally, Section VII summarizes the conclusions and future works.
This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

II. BACKGROUND A. Video Streaming
Streaming refers to the distribution of multimedia content (e.g.music, video...) over a network, allowing the user to consume the content as it is being downloaded.This differs from the traditional way of delivering content, which has to be downloaded before it can be used.In order to do this, the download stream is stored in a buffer located in the user's client, from which the data is taken for playback.
This new approach to accessing multimedia content is therefore having a major impact not only on the way it is consumed, but also on the way in which content is created and distributed.This resulted in two main types of streaming: VOD and LS.In the case of VOD streaming, the content is generated and stored on a server, from where clients can play the content at any time and as often as they want.With LS, the user plays content that is generated at that moment, strongly resembling traditional TV services.
However, streaming services require a certain data rate for their delivery.This means that the user's perception of the service (i.e.QoE) is highly dependent on the network performance.In this context, services are often evaluated using a set of E2E metrics, namely Key Quality Indicators (KQIs), which differ depending on the type of service [12].
The nature of streaming services always puts the focus on freezing or stalling.This is related to the temporary stopping of the service (e.g.video image freezes).These events are essentially caused by a lack of data in user's buffer, making it impossible to display new content.Therefore, several KQIs related to freezes, such as their number, frequency or duration, are taken into account to qualify the smoothness of the service.
Along these lines, some protocols such as HTTP Live Streaming (HLS) or Dynamic Adaptive Streaming over HTTP (DASH) provide services with Adaptive Bitrate Streaming (ABS) algorithms.These algorithms follow to optimise the quality of the streaming by adapting the bitrate required to fill the buffer.Therefore, the multimedia content is encoded in the server in segments with different levels of quality (i.e.resolution).This allows users to request at any time the segment that best suits the conditions of the network connection, thus avoiding freezing.
In this context, the resolution at which the content is displayed and the number of resolution changes during playback are also key parameters to consider when assessing the perception of the service.In addition to these metrics, other KQIs considered for video streaming include initial playback time and video latency.The former looks at the time it takes to fill the buffer with the necessary data to start the video.The latter, which is more related to LS, refers to the time delay between the content producer and the viewer, which is fundamental for the interaction between them.

B. Cloud Gaming
Similarly, the gaming industry is focusing on another alternative for the distribution of their services.Often, the main bottleneck in the delivery of their services has been the user's equipment: the high computing requirements for running some games make them inaccessible to users without powerful equipment, thus reducing their market reach.To overcome this problem, they have found a new lease of life in the CG concept.
The CG paradigm takes advantage of the cloud computing approach to offer a gaming experience on devices with limited computing capabilities.This means that most of the tasks involved in running the game are now located on a remote server.In this way, CG can be seen as a streaming service: the server hosting and launching the game streams the rendered content to the user's device.The latter, therefore, is only responsible for the collection of the user's actions and the display of the multimedia content.
As a result, some video streaming KQIs are also used to assess the E2E performance of CG.These include resolution and freezing.Higher resolutions provide finer image granularity, which brings players closer to increasingly realistic virtual scenarios, improving the QoE of the service.Conversely, frozen scenes hinder interaction with the game, resulting in a serious degradation of service quality.
In connection with this interplay, new KQIs have been considered for the evaluation of the service.On the one hand, the number of frames used to represent game scenes (i.e.frame rate) has a strong influence on the perception of the service, with smoother movements associated with higher frame rates.On the other hand, the time that elapses between the user's input and its representation on the screen, commonly known as input lag or E2E latency, is the primary factor of CG services.Thus, high values of input lag tend to degrade the QoE of the service.
However, the impact of CG on the network is different.The amount of data exchanged in this type of service is much higher than in video streaming.The main reason is the interactivity of the service, whose latency sensitivity complicates video compression.For example, the enhanced H.264 codec is used in both video streaming and CG services.However, the latter avoids bidirectional encoding modes to reduce delay.This also significantly increases the network requirements for delivering CG services.The final bitstream rate for 4K 60 fps CG session is 90Mbps [13], well above the 30Mbps required by on-demand video for the same quality.This means that freezing and high input lag are likely to occur on networks that do not meet the ideal requirements, where both the resolution and frame rate of the streamed content play an important role.

III. TESTBED DESIGN
The main goal of the testbed is to collect metrics from different parts of the architecture.For this purpose, the testbed is divided into three main blocks: multimedia services, network adapters and network infrastructure.Figure 1 shows the distribution of these three modules, as well as the elements, protocols or connectivity that make them up.

A. Multimedia Services
In this first block, four different elements can be seen, distributed between video streaming and CG services.For video streaming, three platforms are considered: Dash-Industry This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.It allows to customise some parameters of the player and to play any kind of DASH content located on private and public servers.The latter two correspond to two of the most popular platforms that offer VOD and LS content free of charge.In the case of Youtube, the platform uses the DASH and HLS protocols to deliver VOD and LS respectively, enabling content playback in 4K resolution.For its part, Twitch relies exclusively on HLS, which allows it to stream video up to 1080p.All of these streaming alternatives are accessible via web clients.This makes them easier to use and opens up different ways of extracting data.
In terms of CG services, the Moonlight platform is the one implemented in the testbed.This is an open-source implementation of Nvidia's low-latency streaming protocol.It is specifically designed for streaming games using the Real-Time Transport Protocol (RTP).In contrast to the most widely available platforms, such as Project xCloud or Amazon Luna, Moonlight allows complete control of the service.In this sense, Moonlight enables the deployment of a private CG architecture, as well as enabling its integration within the Mobile Edge Computing (MEC) paradigm, a key aspect for CG delivery over mobile networks.Therefore, to stream content to any type of device (i.e.thin clients), Moonlight requires the server to have an Nvidia Graphical Processing Unit (GPU).The thin clients only need to have the Moonlight client installed, which is available on all platforms (e.g.Windows, Android).The client also enables CG session configuration, i.e. setting streaming parameters such as resolution or frame rate.All these features, along with the high level of platform control and myriad streaming configurations it offers, make Moonlight a good alternative for developing KQI-oriented measurement frameworks.

B. Network adapters
This module is responsible for providing different types of connections transparently to the top module.Based on their specifications, the testbed primarily considers the Huawei LTE Modem E3772 and the Huawei CPE PRO 2. The former is a dongle device that enables LTE connectivity via a USB port.The latter is an advanced router with Ethernet ports and WiFi 6 connectivity.However, its most interesting feature is its backhaul connectivity, which includes wired, wireless and cellular (i.e.LTE and 5G) communications.
Furthermore, both are equipped with an Application Programming Interface (API) that facilitates the extraction of data from the devices.This helps to get information about the connection from the User Equipment (UE).Nevertheless, the use of the upper layer should not be limited to these alternatives.

C. Network infrastructure
This framework has been developed on top of Amarisoft's solutions for the virtual deployment of cellular networks, offering a cheaper alternative to the traditional network infrastructure by taking advantage of Software Defined Radio (SDR).The aim of this block is to simplify the implementation of network actions, such as setting up different network configurations or collecting information.In this way, it is possible to achieve full control over the deployment of the service.Like the previous block, this module should not interfere with the functionality of the upper layers, albeit the wide availability of configuration makes it useful for accessing various network metrics.

A. Service's KQI extraction
For the collection of the KQI metrics for each of the services, two different approaches have been followed with regard to the type of service.
1) Video streaming: Most video streaming platforms provide web clients to consume their content.Therefore, the implementation of the video streaming block relies on the web scraping methodology.This technique facilitates the extraction of information from web pages by allowing the inspection and interaction with multiple web elements.The block thus consists of several Python scripts that base their actions on the Selenium WebDriver, which provides an object-oriented API that allows simulating user interactions in the browser.This eases the automation of actions required to initialise the service (e.g.searching for a video, selecting a certain resolution) and the extraction of data from the player related to video performance, such as the resolution displayed.It also enables the calculation of other KQIs, such as initial time, which is calculated as the time between clicking on the desired video and the player displaying it.Table I shows the primordial metrics obtained from the services, though it should be noted that more player and platform specific metrics are obtained.
article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) CG: The CG platform considered (i.e.Moonlight) requires the installation of lightweight software, which precludes adopting the web scrapping technique.Moreover, this service introduces complexity in the recollection process as it relies heavily on user interaction.To solve this, the Python scripts that make up the module for this service follow three main steps: emulating user actions, tracking perception and calculating metrics.The former interacts with the client to configure the game, as well as replicating some pre-recorded user actions during the game.It is also responsible for collecting the timestamp of each action.This is considered a key factor in calculating the interactivity of the service.In terms of perception tracking, the scripts are responsible for detecting the result of the user's action and finding the exact time at which it has been perceived by the user.For further calculation of the service's KQIs, they also record the session at a high frame rate (i.e.144 FPS).Once all these steps have been fulfilled, the KQIs of the service are calculated by correlating the data obtained in the previous steps.For instance, the timestamps of the automated actions are mapped to the time at which the scene has changed, thus obtaining the E2E latency of the system.In addition, other actions necessary for the accurate calculation of some metrics, such as the freeze events or the Effective Frame Rate (EFPS) perceived by the user, are carried out in this step.For these metrics in particular, frame decimation (i.e.matching the frame rate of the recording to the one configured in the CG server for streaming) facilitates the elimination of false positive freeze events or the miscalculation of the EFPS value.
Finally, this information is complemented with some data from the platform, obtained by inspecting and parsing some of the available logs.Table I reveals some of the parameters of interest obtained from the services.

B. Network connectivity adapters
Even though any network adapter can be used to enable the services (see Figure 2), the API available in the Huawei devices makes it worth considering in the testbed.In this way, the Python functions that make up this block are in charge of wrapping the various HTTP requests needed to interact with the API.This allows the collection of information related to the radio link, such as Reference Signal Received Power (RSRP) or Reference Signal Received Quality (RSRQ), and values from the data exchanged through the network adapter, such as data rates or data volumes.Table I includes a summary of some of the metrics collected by the network adapter.

C. Amarisoft-based solutions framework
This framework, presented in [9], is built on top of Amarisoft's software solutions.Its main objective is to enable the testbed to interact with the network infrastructure.In this sense, the implementation of the framework is divided according to the objective of the interaction: configuring, monitoring and acting with the network entities.
In terms of configuration, a set of functions is responsible for reading and parsing the configuration files of the various elements that make up the network (e.g.eNB/gNB).This allows to collect and manipulate their parameters using the JavaScript Object Notation (JSON) syntax.
In the case of monitoring, the module makes multiple calls to an API provided by the device itself, also wrapping both the request and response in JSON to manage them as a humanreadable format.Table I shows the main metrics obtained from this framework, which can be useful for service evaluation.
Finally, all these blocks are orchestrated by a management element that provides the framework with a Representational State Transfer (REST) interface, thus facilitating interaction through HyperText Transfer Protocol (HTTP) requests.
This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.V. DATASET

A. Overview
The dataset available in [11] contains 22853 samples, corresponding to 8597 VOD, 5142 LS and 9114 CG samples.A sample represents a collection of session-based metrics from the different sources (see Table I) that are obtained from the execution of a service.Table II details the distribution of samples across services and technologies.These have been collected considering multiple network configurations (e.g.radio bandwidth) at the facilities of the University of Malaga [14], equipped with a physical LTE and 5G network infrastructure and Amarisoft-based solutions [9].Besides, several radio scenarios have been considered by forcing the UE to connect to different base stations and by emulating radio scenarios through a channel simulator (in Amarisoft-based solutions).
Thus, around 20 runs of each service have been performed for each network scenario (i.e.combination of network configuration and radio conditions) and service configuration (e.g.broadcaster, resolution, frame rate).Each service execution has been carried out according to the following factors: 1) VOD: Real-time data is collected every two seconds from the playback of two different videos, freeing the cache memory between each playback for the correct gathering of some KQIs like the initial time.Besides, multiple resolutions are considered (Auto, 720p, 1080p, 1440p and 4K).
2) LS: The content from two always-on broadcasts is played during one minute, from which data is gathered with two seconds granularity.Likewise, several resolutions are considered (i.e.Auto, 360p, 480p, 720p and 1080p).
3) CG: Multiple League of Legends sessions are launched considering 4 resolutions (720p, 1080p, 1440p and 4K) and 3 frame rates (30, 60 and 120 fps).In each session, data is collected at 144 FPS (i.e.every 8 ms) and five equal mouse actions are performed to keep service consistency.Besides, there is always a 5-seconds time gap between each mouse action to simplify the latency gathering process.This fixed frequency of the actions has no effect on the KQIs obtained for the service, as these are objective values of the service's E2E performance.
For each service, session-based data is also obtained from the real-time measurements, both of which are included in the dataset.These are complemented by metrics and network configurations from each session.Here, the amount of data varies according to the service and the technology (see Table II), caused by the different levels of configurability between these two aspects.The resulting data imbalance can affect the accuracy of ML models for minority classes, which generally aim to minimise the average loss function over the entire dataset.However, this problem can be solved by applying data balancing techniques such as oversampling or undersampling.

B. Format and organization
All information comprising the dataset is stored in multiple files corresponding to measurement campaigns.These files follow a JSON sintax, which is a format compatible with several data science libraries.This format also allows the storage of nested data variables, so that the session-based This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.metrics can be stored together with the real-time measurements of the service (stored in lists).
Regarding the organization of these files, they are composed by four main fields: pingTest, service, networkAdapter and network.Here, each label corresponds to the source or kind of data that can be found in each field.
1) pingTest: This field contains the results of multiple ping tests performed during the campaign, considering different objects based on the destination of the ping.Each of these objects is a list with the results of the ping for each global experiment.Besides, a special notation has been followed to remark on the condition of the test, being added a " " suffix in the name of each object (e.g.ping dns ) if the results were obtained without running other tests.Conversely, the absence of this suffix (e.g.ping dns) indicates that the tests were carried out while the service was running.
2) service: This field is mainly composed by one object labeled as metrics, which contains all the information gathered from the service in a list.Besides, these metrics are complemented by the type of service and platform used.It is worth mentioning that this field stores more data about the player and platform than the one exposed in Table I.
3) networkAdapter: For this case, the information is distributed as shown in Table I, which means two objects regarding the radio metrics (labelled as radioKPI) and exchanged data information (labelled as stats).
4) network: In this field the information is categorised following the distribution seen in Table I.Nevertheless, radio statistics is divided into cell and UE stats, corresponding to the value's level of specification.Configuration parameters can be found as an object or as a list of objects, depending on whether several network configuration parameters have been used during the campaign.
Note that both networkAdapter and network fields can be empty if a different adapter or network solution/infrastructure has been used.
Finally, the files are distributed in different folders with names corresponding to the service, technology and infrastructure used to generate the data.This information is also part of the file name, together with some information about the measurement campaign (e.g.number of PRBs, noise level).

VI. ML APPLICATIONS
The availability of cross-layer data provides a wide view of the impact of the network (e.g.configuration, radio conditions) in the service provision.For example, Figure 3a shows the network bandwidth influence on the EFPS for 4K CG sessions over LTE networks, where a degradation in the EFPS is observed for network configurations of 25 and 50 Physical On the other hand, clustering models based on E2E metrics can be used for network management applications.For example, Figure 3b shows the clustering of 4K 120 FPS data in which different network conditions (e.g.service degradation due to high SNR or insufficient PRBs) have been detected based on the CG metrics shown in Table I.This can support network anomaly detection and diagnosis algorithms.
Similarly, service metrics and network metrics can be mapped by using regression techniques, as shown in Figure 3c.This example takes the cellular radio metrics described This article has been accepted for inclusion in a future issue of this magazine.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in Table I to estimate the latency of CG service [10].This will enable the estimation of these metrics from network parameters as proposed in [8] and [15].This approach can also boost the development of ML techniques for network management optimization purposes such as policy definition and network configuration setting to offer services with specific requirements under constrained resources, such as the ones preliminarily envisaged in [7], [9] and [10].

VII. CONCLUSION
This paper has presented a dataset of E2E metrics from some popular and extended services such as video streaming and CG over different network technologies such as Ethernet, WiFi, 4G and 5G.Thus, the main characteristics of these services have been highlighted, providing a baseline for understanding the dataset.Furthermore, the testbed used is described, with particular attention to its design and implementation.In this sense, this work aims to support the research community in the field of cellular network in two aspects.First, saving the difficult and time-consuming task of data acquisition, hence, boosting the design and implementation of new algorithms for the network management.Finally, the technical considerations described in the paper can inspire the development of new testbeds that consider new implementations for data mining of cutting-edge services such as VR.

Fig. 1 :
Fig. 1: Modular distribution of the testbed Forum, Youtube and Twitch.The first one is an open source streaming client.It allows to customise some parameters of the player and to play any kind of DASH content located on private and public servers.The latter two correspond to two of the most popular platforms that offer VOD and LS content free of charge.In the case of Youtube, the platform uses the DASH and HLS protocols to deliver VOD and LS respectively, enabling content playback in 4K resolution.For its part, Twitch relies exclusively on HLS, which allows it to stream video up to 1080p.All of these streaming alternatives are accessible via web clients.This makes them easier to use and opens up different ways of extracting data.In terms of CG services, the Moonlight platform is the one implemented in the testbed.This is an open-source implementation of Nvidia's low-latency streaming protocol.It is specifically designed for streaming games using the Real-Time Transport Protocol (RTP).In contrast to the most widely available platforms, such as Project xCloud or Amazon Luna, Moonlight allows complete control of the service.In this sense, Moonlight enables the deployment of a private CG architecture, as well as enabling its integration within the Mobile Edge Computing (MEC) paradigm, a key aspect for CG delivery over mobile networks.Therefore, to stream content to any type of device (i.e.thin clients), Moonlight requires the server to have an Nvidia Graphical Processing Unit (GPU).The thin clients only need to have the Moonlight client installed, which is available on all platforms (e.g.Windows, Android).The client also enables CG session configuration, i.e. setting streaming parameters such as resolution or frame rate.All these features, along with the high level of platform

Fig. 3 :
Fig. 3: Sample applicationsResources Blocks (PRBs) when high frame rates (i.e. 120 FPS) are targeted.Here, among the main ML fields of application of the dataset, the training of classification models is envisaged to avoid service quality degradation based on available network metrics/configurations.On the other hand, clustering models based on E2E metrics can be used for network management applications.For example, Figure3bshows the clustering of 4K 120 FPS data in which different network conditions (e.g.service degradation due to high SNR or insufficient PRBs) have been detected based on the CG metrics shown in TableI.This can support network anomaly detection and diagnosis algorithms.Similarly, service metrics and network metrics can be mapped by using regression techniques, as shown in Figure3c.This example takes the cellular radio metrics described

TABLE I :
Testbed parameters overview

TABLE II :
Distribution of dataset samples