Multi-path Coverage of all Final States for Model-Based Testing Theory using Spark In-memory Design

. This paper deals with an eﬃcient and robust distributed framework for ﬁnite state machine coverage in the ﬁeld model based testing theory. All ﬁnal states coverage in large-scale automaton is in-herently computing-intensive and memory exhausting with impractical time complexity because of an explosion of the number of states. Thus, it is important to propose a faster solution that reduces the time complexity by exploiting big data concept based on Spark RDD computation. To cope with this situation, we propose a parallel and distributed approach based on Spark in-memory design which exploits A* algorithm for optimal coverage. The experiments performed on multi-node cluster prove that the proposed framework achieves signiﬁcant gain of the computation time.


Introduction
Since we moved into the era of new technologies such as IoT, Cloud service, artificial intelligence, big data and 5G, the systems test attracts more interest for industrial and scientific communities. The reliability test of systems is based on a large number of compliance actions. The best known testing technique where the behavior of the system under test is checked against a prediction model is the Model-based testing [14,12].
The basic idea of this technique is to create a set of functional tests based on some requirements, input sequences and actions. Once created, this model can be applied for testing both software and hardware. In this context, several research projects have been conducted in the field of system testing [17,1,15,7].
Moez Krichen et al. [16] proposed a conformance testing for real-time systems. The proposed framework is designed for non-deterministic timed automata. This allows the user to define essential features and assumptions based on variants of the tested model. Moreover, they proposed a dynamic testing for realtime applications [7], it is dynamic in the sense that it provides an automatic generation of observers of time automata.
The checking of advanced models emanated from concurrent, dynamic or real-time systems is challenging and has been the focus of several decades of research. Indeed, the main problem that affects checking of sophisticated software or hardware is the state explosion issue [18] because of large amount of stored and shared information between many components of the system under test. To cope with this situation, the most efficient techniques to cope with this problem are based on symbolic bounded model checking, partial order reduction and symbolic model [11]. These breakthrough approach have enabled the testing of complex systems with large amount of states. Nevertheless, these techniques present some drawbacks. In fact, symbolic techniques disable the computation for specific states like state probabilities of Petri nets [6].
Another idea consists of using a computer cluster to achieve verification of a system represented by very complex model. These computers are connected under a master-slave architecture. Each computer checks a part of the model in parallel. Thus the analysis of very complex systems is done in a fast way with a quasi-optimal verification of the system.
Although big data computation techniques have so far been poorly explored for verifying software systems. We believe that the main issues to be tackled in conformance testing can benefit a lot from big data concepts. In this paper, we are interested to the problem of multipath coverage of all final states in large-scale automaton with state explosion. This task is very expensive with impracticable computation time.
The major contributions of the paper are summarized below.
-Introduction of big data (big-graph) concepts in the area of model based testing. -An optimized coverage technique for generating test cases based on the extended version of MapReduce-A* [4,3]. -A framework designed for large-scale finite automaton coverage.
The remainder of this paper is organized as follows. Section 2 gives background about Model-based testing and big data framework. In Section 3, we introduce the problem formulation. Section 4 presents the design of the proposed framework. Next, we expose our experiment results in section 5 and conclude this work in section 6.

Model-based testing
Model-based testing (MBT) is a variant of test techniques that is based on explicit behavior models, describing expected behaviors of the system under test (SUT), or the behavior of its environment, constructed from functional requirements. The MBT is an evolving approach that aims to automatically generate from one of these models, test cases to play on the SUT. The interest of adopting model-based testing in projects is to improve the detection of SUT bugs and improve software quality. Moreover, Model-based testing can reduce the time and effort spent on testing since the test scripts are derived automatically from the model of the SUT.

Apache Spark
Hadoop is an Apache open source framework allowing parallel and distributed processing of large amoung of data sets. It is based on Google File System and works with multi-nodes connected under a master-slaves architecture. Hadoop consists of two main components: HDFS [9] and MapReduce [8,19]. HDFS component is the Hadoop Distributed File System, it is designed to store very big dataset. The dataset is split into many chunk files and distributed across the cluster nodes in order to prevent fault-tolerance. The MapReduce component is designed for intensive computation on large data file in parallel and distributed way. The MapReduce programming model consists of two main functions : map() and reduce() functions.
Spark [20] is an Apache open source framework allowing parallel and distributed processing of large amounts of data sets. Spark is designed to face some limitations of Hadoop [19] such as iterative processing, in-memory computing and near real-time processing. Spark works with multi-nodes cluster connected under a master-slaves architecture, it provides a Resilient Distributed Datasets (RDDs) that computing operations in memory. RDD is a immutable object which allows transformation and action operations to do parallel processing on multi-node cluster.

Problem of coverage
Let A = (Q, , s 0 , F, δ) be a determinist finite automaton (DFA) such as: In MBT, the reliability test generally passes through the coverage of all states reachable by the system. This task consists of starting from the initial state of the system to find all the optimal or quasi-optimal paths that reach the set of final states of the system. However, in an automaton with state explosion, the search of all paths covering the final states of the system is very expensive.
Current coverage techniques [17,1,16,7] are not suitable for large-scale automaton. One technique that has attracted attention is the divide-to-conquer which consists of partitioning the automata into a sub-automata and then finding the intermediate coverage. However, the quality of the result of this technique depends on the partitioning technique adopted. There are parsing algorithms that allow for quick coverage while others are well suited for optimal coverage [2,5].

Proposed Spark coverage technique
The framework we propose is an adapted version of MRA* framework [4,3]. The cluster is setup so that the RDD is used for partitioned and distributed the automaton across the cluster. Then, we process the big automaton data using Spark in-memory computation. As shown in Figure 1, the conceptual model of our framework works under a master-slaves architecture and it consists of three mains stages: -Input stage which consists in the automaton partition. First of all, the coverage job submitted from the master is split into two sets of mapper and reducer tasks and distributed across the Spark cluster. The coverage job consists of finding all paths reaching the set of final states s ∈ F . The master manages the coverage job, assigns each map and reduces task to all workers. The coverage job is synchronized so that the output of the previous stage is taken as the input of the next stage. It is important to note that the input stage is the pre-processing stage and intermediate coverages are performed into map and reduce stages. Thus the total time t σ required to find all paths that cover the set of final states s i ∈ F is calculated as follows: where t map and t red are respectively the time passed in the map and reduce stage.

Input stage: automaton partition
First, we need to load the automaton graph A as a parallelized RDD object. Then, A is partitioned into k sub-automata. A good partitioning strategy allows to schedule all tasks so that the load balancing among all nodes in the cluster is balanced. Spark provides different types of partitioning scheme. But in our work, we have used the "Edge-partition" technique [10] which equally distributes the automaton among all the workers in the cluster. As shown in Figure 2 the automaton partition is defined as subsets of transitions such that each transition edge belongs to exactly one partition. The states which appear in two sub-automata A 1 and A 2 are used for the communications channels [10].
From the initial state s 0 ∈ Q, we create the first sub-automaton, the intermediate states s i located on each sub-automaton A i ⊆ A are used as initial and final states. The initial state s i of the i th sub-automaton A i is the final state of the (i − 1) th sub-automaton A i−1 so on until the last k th sub-automaton A k . To avoid overload of the cluster and maintain load balancing, we assume that the number k of partitions is: k = N node × N core (2) where N node is the number of workers within the cluster and N core the number of core processors used by each worker node. Finally, when the k-partitioning of the automaton is done, we create an object RDD(A i ) of each sub-automaton A i . This means that the input dataset remains in memory, but it can also be stored on the local disk if there is not enough RAM. The idea of creating RDD object of each sub-automaton is beneficial since at any time, we can correctly re-balance the number of states or transitions in each sub-automaton.

Map stage: intermediate states coverage
The map stage consists of computing for a given sub-automaton object RDD(A i ), all cover paths that reach the set of final states s i ∈ F i . For the cover computation, we used an adapted version of A* mapper algorithm proposed by [4,3].
This procedure takes as input an RDD object of a sub-automaton A i , and maps it into key-value pairs of states RDDs and transitions RDDs. Then, it initializes the initial state s i 0 and the two main queues: the open-list O that contains all candidates states and the close-list S that contains promising states. The next step, consists of exploring in depth and selecting the most promising states until the final state found. Finally, the last step consists of extracting the coverage paths from the close-list S. Afterward, the result is submitted to the reduce stage.
The time complexity for all intermediate states coverage depends on the total number of worker nodes N node within the cluster and the number of core processors N core per node. To cover an intermediate final state s i into the subautomaton A i , the time m i,j,k taken by the k th mapper assigned to the j th core processor of the i th worker is about O(|Q i | + |T i |) log(|Q i |) where T i is the set of transitions within A i . The time ∆m i,j,k taken to reach all final states during the map stage is calculated as follows:

Reduce stage: merging all states coverage
In the final stage, the intermediate coverages from the map stage are aggregated and merged based on all final states that share the same key. This is performed by an adapted version of A* reducer procedure presented in [4,3].
It takes as input the coverage path of each state, then group and merge all paths to build the final coverage path. The coverage σ i,j is extended to the coverage σ j,k if the intermediate state j is the final state reached by σ i,j and the start state of σ j,k , so on until the last coverage.
When one worker node completes its tasks, it waits until the other reducers achieve their tasks before sending the result to the master. Then the time ∆r i,j,k taken to merge all intermediate coverages from state s 0 to all final state s i ∈ F of the original automaton A is:

Experiment results
The experiments have been achieved on a Spark cluster composed of one master node and three worker nodes. The cluster configuration is given below in We have used a dataset generated from the mutants-equiv-eval dataset 3 . This dataset corresponds to the finite behavioural models of equivalent mutant problem. Table 5 shows the number of states and transitions in different finite behavioural models, each automaton contains at least 5% of final states to cover.  Table 3 shows the execution time between sequential and parallel approaches using single Spark node, the computation is parallelized on 4 cores processor.
First, we remark that the sequential approach takes impracticable time to achieve the coverage of all final states.
On the other hand, by using our approach, the computational time is much faster and reduces significantly the execution time. For example, the task of finding all coverage paths within the claroline automaton takes on average 6 hours with sequential approach, it is computationally expensive. In contrast, by using our approach we find all coverage paths in 25 min on disk computing versus 3 min in-memory computing.  Figure 3 shows the performance comparison of increasing workers on the computational time. In this case, the computation is distributed across the cluster and also parallelized on the 4 cores of each worker. We remark that the runtime varies strongly and is improved while extending the cluster from 1 to 4 workers. Adding new Spark worker reduces the computation time, however when we move from a very large dataset to a small automaton, the computational time is not affected by the addition of new Spark node into the cluster. This leads us to consider that there exists an optimal number of nodes that satisfies the full coverage of a given finite automaton.

Conclusion and perspectives
In this paper, we have proposed a parallel and distributed framework for largescale automaton coverage. The time complexity decreases from exponential to linear time. The experimental results prove that our approach is faster and works well with very large automaton. But our approach presents some limitations: 1) the path optimality depends on the partitioning strategy and the number of sub-automatons, see [4] for more details; 2) the computation is often memory expensive. For future work, we are interested in studying the impact of automaton partitioning on the time complexity and propose an extended version of the framework for the coverage of timed-automaton and distributed systems [13].