A novel approach for fast and effective inﬂuence spread maximization in social networks

—The main characteristic of social networks is their ability to quickly spread information between a large group of people. This phenomenon is generated by the social inﬂuence that individuals induce on each other. The widespread use of online social networks (e.g., Facebook) increases researchers’ interest in how inﬂuence propagates through these networks. One of the most important research issues in this ﬁeld is the so-called in-ﬂuence maximization problem, which essentially consists in selecting the most inﬂuential users (i.e., those who are able to maximize the spread of inﬂuence through the social network). Due to its practical importance in various applications (e.g., viral marketing, target advertisement, personalized recommendation), such a problem has been studied in several variants. Different solution methodologies have been proposed. Nevertheless, the current open challenge in the resolution of the inﬂuence maximization problem still concerns achieving a good trade-off between accuracy and computational time. In this context, based on the well-known independent cascade and the linear threshold models of social networks, we propose a novel low-complexity and highly accurate algorithm for selecting an initial group of nodes to maximize the spread of inﬂu-ence in large-scale networks. In particular, the key idea consists in iteratively removing the overlap of inﬂuence spread induced by different seed nodes. Application to several numerical experiments based on real datasets proves that the proposed algorithm effectively ﬁnds practical near-optimal solutions of the addressed inﬂuence maximization problem in a computationally efﬁcient fashion. Finally, comparison with the best performing state of the art algorithms demonstrates that in large scale scenarios, the proposed approach shows higher performance in terms of inﬂuence spread and running time.


I. INTRODUCTION
A social network is a structure composed of a group of interconnected independent agents. These networks are essential for the spread of information, opinion, and innovation. Several studies have been made in social science to correctly evaluate the spread of influence, especially related to the "word-of-mouth" process [1], [2]. The core idea of this process is the influence between people in making actions, i.e., a person may hear about a specific innovation and decide to implement it too. This diffusion process is essential for viral marketing, personalized recommendation, and target advertisement. Therefore, attention is increasing on how influence propagates in online social networks [3], [4]. For instance, in viral marketing, the best way to spread information about a product is to select a group of users that have a strong influence on the network and convince them to promote it. Due to their high influence, many of their neighbors will share information, causing a cascade process. Naturally, this process is fruitful if, with a low number of initial agents, it is possible to significantly influence the network.
The study of the diffusion process in a network can be done employing different models; in the related literature, two models are particularly examined for their effectiveness: the linear threshold (LT) and the independent cascade (IC) model [5]. Both models comprehend active and inactive nodes: the former have been already influenced by the information spreading in the network and consequently attempt to spread it to their neighbors in a cascade process. In the LT model, the inactive nodes receive information when a large number of their neighbors are active. More formally, each inactive node has a threshold in the interval [0, 1], which expresses the fraction of neighbors that should be active in order to receive information [6]. Conversely, in the IC model each node tries to directly influence its inactive neighbors. The success of this activation process is ruled by a probability associated with the edge; however, each node has only one chance to influence its neighbors [7].
Several works analyze the probability related to the diffusion of information in social networks [8]- [10]. In particular, considerable attention is focused on determining the set of active nodes at the end of the diffusion process. Both the IC and the LT models allow estimating the influence spread by summing up all the nodes' activation probabilities; these represent the probability that a single node gets influenced. Nevertheless, computing the precise influence spread under the IC and LT models is #P-hard [11], [12]. Hence, Monte Carlo simulation is applied in many works, even if it is timeconsuming [11], [13], [14]. To cope with this issue, the authors in [15] propose the SteadyStateSpread approximated algorithm to compute an estimation of the influence spread; however, the value calculated by the algorithm may be considerably different from the exact solutions of the problem, depending on the network structure.
An additional problem in these networks is selecting a group of initial nodes that spread information at most. This second problem was firstly addressed in [16], [17]. In [18], the authors define a greedy algorithm able to solve the influence maximization problem. The influence maximization problem is computationally hard. In fact, given the request for a significant computational effort, many works aim at decreasing the running time [18]. Most of these works use heuristic approaches, for instance by assuming that all nodes have relatively low influence or that the input graph is clustered [11], [19], [20]. Other works define the influence spread as a submodular process [14], [21]. Furthermore, some works attempt to solve the influence maximization problem by considering the concept of influence overlap. The influential nodes in social networks usually are concentrated [22], in such a way that their resulting influences overlap, generating a redundancy [23]. Recent studies have paid attention to reducing influence overlap by keeping nodes disconnected [24]. The problem of influence overlap is optimally solved by employing the greedy algorithm proposed in [18], which ensures an almost optimal convergence; however, this method is highly time-consuming and is not feasible for large scale networks.
Summing up, the previous literature review shows that the current open challenge in the study of the influence diffusion process in social networks still concerns achieving a good trade-off between accuracy and computational time.
To cope with this gap, first of all we propose a novel low-complexity and highly accurate algorithm for influence spread computation and maximization in large-scale networks. In particular, the proposed approach is based on the reduction of influence overlapping and leverages on the joint probability of activations related to different nodes occurring together. At each iteration, the algorithm chooses a node to include in the seed set and then attempts to estimate and remove the influence overlap with the other candidate nodes. The approach is valid both for the IC and LT diffusion model.
More in detail, the specific contributions of this work are summarized in the following aspects.
• For the sake of exactly computing the influence spread in a social network based on the LT diffusion model, we extend to the LC case the Path Method proposed by [25] for the IC diffusion model only. • Leveraging on the SteadyStateSpread approach proposed by [15] to iteratively calculate the influence spread in IC networks, we present its counterpart for LT models, proving its convergence to a fixed point. • We propose a novel approach for influence maximization based on the reduction of influence overlapping and utilization of the joint probability of activations to efficiently estimate the marginal gain in the influence spread computation without recalculating it at each iteration. • We show how the level of spread probability and the topology of network graphs influence the accuracy of results in terms both of influence spread computation and influence maximization. • We examine different combinations of algorithms for the activation probability computation with the proposed algorithm to select the seed set and solve the influence maximization problem. • We show through several numerical experiments based on real datasets that the proposed algorithm effectively finds practical near-optimal solutions of the addressed influence maximization problem in a computationally efficient fashion. Finally, the comparison with other stateof-the-art mechanisms (including the Greedy algorithm) demonstrates that the proposed approach shows higher performance in terms of influence spread and running time in large scale scenarios. The rest of this work is structured as follows. In Section II we present the social network model and some preliminary definitions. In Section III we present two influence diffusion models and show the procedures to compute each node's activation probability. Section IV comprehends the formulation of the influence maximization problem and the definition of the best performing algorithms that are available in the related literature to solve it; moreover, we propose a novel algorithm to solve such a problem. In Section V we show the numerical results of the conducted experiments. Lastly, in Section VI we provide the outcomes of this work.

II. PRELIMINARIES
Graphs are widely used to represent networked systems [26], [27]. Thus, let us define a social network by a direct graph G = (V, E), where V is the set of nodes with cardinality N = |V| representing people, and E ∈ V ×V is the set of edges describing the social connections between pairs of people. In the sequel, we refer to nodes and edges, putting aside the words people and social connections.
When node j is influenced by node i, an edge denoted by (i, j) ∈ E exists. Finally, let us define by N in j = {i ∈ V|(i, j) ∈ E} the in-neighbors of node j and N out j = {i ∈ V|(j, i) ∈ E} the out-neighbors of node j.
Let us now recall from [28] some definitions used in the rest of this paper .
Definition 1: Given a social network model, with direct graph G = (V, E), we describe the diffusion model (DM) as the stochastic process that captures the spread on the information in the network starting from a seed set φ 0 ⊆ V that contains the initially activated nodes.
Definition 2: For a given diffusion model G DM = (V, E, DM) and a seed set φ 0 ⊆ V, the probability that a node j ∈ V is activated during the diffusion process is defined as its activation probability π j (φ 0 ).
Definition 3: For a given diffusion model G DM = (V, E, DM) and a seed set φ 0 ⊆ V, the influence spread across the network σ(φ 0 ), is defined as follows: Definition 4: Assuming that σ(·) is a non-negative set function, we have that for every possible seed set Theorem 1 ( [29] ): Assuming that σ(·) is a non-negative set function, we have that Theorem 2 ( [29] ): Assuming that σ(·) is a non-negative set function, we have that

III. INFLUENCE DIFFUSION MODELS
Several models are available in the literature to represent the spread of information in a network. These models rely on different mechanisms to capture the switching mechanism from inactive to active node status. This section focuses on the two most common diffusion models: the Independent Cascade (IC) model and the Linear Threshold (LT) model.

A. Independent Cascade Model
The spread of influence through social networks is extremely well studied, and several models are available in the literature. The IC model is one of the most studied probabilistic information diffusion models and is based on the assumption that information flows over the network through cascades. This means that a node can be triggered from specific information only if one of its in-neighbors was triggered before. In the IC model, nodes can have two states: active (i.e., already influenced by the information) or inactive (i.e., not influenced). The process runs over discrete steps. Initially, i.e., at time step k = 0, most nodes are inactive; only a few nodes are active and are known as seed nodes. Let us denote as φ 0 ⊆ V the seed set containing all the initially active nodes. If node i becomes active at step k, it has one and only one chance to activate its inactive neighbors j ∈ N out i with a propagation probability p i,j . The propagation probabilities p i,j ∀(i, j) measure the influence strength of the edges; they are collected in matrix P : If node i successfully activates one of its out-neighbors j ∈ N out i at time k, then j will become an active node at the next time step k +1. However, note that a node can change its state from inactive to active but not from active to inactive. This cascading process continues until no more activation occurs through the network [30]. More formally, we can denote the considered IC model by G IC = (V, E, P ).

B. The Linear Threshold Model
Together with the IC model, the LT model is one of the most studied diffusion models. It is used in several variants, however, the most common is the so-called submodular LT model. In such a model the basic idea is that a user can switch its status from inactive to active if a sufficient number of its incoming neighbors are active. Similarly to the IC model, nodes can have two states: active (i.e., already influenced by the information) or inactive (i.e., not influenced). The process runs over discrete steps. Just like in the IC model, initially, i.e., at time k = 0, all nodes are inactive, except for a few nodes which are contained in the seed set φ 0 ⊆ V. A weight b i,j -that measures the influence strength of the edges -is associated to each edge (i, j); all the weights are collected in matrix B : i.e., for each node the sum of weights of all its in-neighbors must be lower than one.
Considering an instance of the diffusion process, the LT model first samples for each node n a threshold value θ n ∈ [0, 1] randomly. Then, at step k, all nodes that were active in step k − 1 remain active, and any node n that was inactive in step k −1 switches to active if the total weight of its neighbors that became active at step k − 1 is at least θ n (k). More in detail, the activation process will be successful if the sum of the incoming active neighbor's weights becomes either greater than or equal to the node's threshold, i.e., where N in j (k−1) denotes the set of in-neighbors of j activated at the step k − 1. This method will be continued until no more activation is possible. More formally, we can denote the considered LT model by G LT = (V, E, B).

C. Exact Activation Probability Computation
In [25] the authors propose the Path Method, an algorithm to exactly compute the value of π j (φ 0 ) for IC networks. This algorithm takes into account all the possible evolutions of the information spread in a social network. In the first part, an evolution graph for the model is created; this is made of cells shown in Fig. 1(a). Each cell is divided into three parts: the set of past active nodes A p k containing all the nodes activated before the current step k, the set of active nodes A c k at time k and the cell probability P k (i.e., the probability of the evolution contained in A p k and A c k ). The cells that have an empty current active set are the terminal cells. The exact activation probability of node j can be computed by summing up all the cell probabilities of terminal cells whose set of past active nodes contains node j. For instance, we show such a procedure in reference to the simple IC model in Fig. 1(b), where the corresponding directed graph is composed of three nodes. In particular, Fig. 2 shows the resolution steps of the Path Method for the aforementioned example, when the seed set includes only node 1, i.e., φ 0 = {1}. In this example, we show the graph's evolution through discrete time steps, and we indicate in red the terminal cells. For instance, if we consider node 3, its activation probability π 3 ({1}) is the sum of all the probabilities P k of the terminal cells whose sets A c k and A p k contain the node 3; namely π 3 ({1}) = 0.06 + 0.07 + 0.216 + 0.024 = 0.37.
The path method was explicitly studied for the IC diffusion model; however, it can be easily modified to take into account the LT model, as shown in Fig. 3 for the aforementioned example.
Due to the high complexity of the Path Method, its computational time is O(6 N ) [25]. The exponential complexity of this approach makes it not feasible for large-scale networks.
Evolution graph related to the network shown in Fig. 1(b) with the LT diffusion model.
As aforementioned, the Path Method is able to calculate the exact activation probability of each node; however, it may be infeasible even for small networks. An alternative approach to estimate the exact activation probability is the use of the well-known Monte Carlo technique. In fact, the IC and the LT models are stochastic diffusion processes that each time lead to different outcomes. However, on average, the final results will represent the exact activation probabilities. More in detail, for each run r of the stochastic diffusion model, let us define by π r j (φ 0 ) the corresponding result; then, by performing R runs, we can estimate the activation probabilities by: This approach's reliability increases with the number of instances; however, the required computational time makes it practically infeasible for large scale networks.

D. Approximate Activation Probability Computation
As mentioned, the activation probability computation is a hard problem to be solved because it cannot be easily expressed in a closed form. In [15] the authors propose the SteadyStateSpread (SSS) algorithm, which is able to estimate the activation probability of each node. In the given network, node j may receive the influence by one of its in-neighbors, each with a specific activation probability. Therefore, each node does not assimilate information when none of its inneighbors transmits it. Assuming that the activations of the in-neighbors are independent events and they do not depend on the activation of node j, we calculate the probability that none of the in-neighbors of node j transmits the information by: Thus, we have: As proposed in [15], the computation of the activation probability of all the nodes in a network can be done iteratively base on (5) by setting: where we mark with the plus symbol (+) the updated value of π j (·) at each iteration. However, the SSS approach is able to compute the activation probability of all the nodes in a network only when the diffusion model is the IC. In fact, this approach does not apply to the LT models, where different assumptions hold. To cope with this issue, let us propose a novel approach -inspired by the classic SSS -named Linear Threshold SteadyStateSpread (LT-SSS). Similarly to the classic SSS, also this approach iteratively calculates all the nodes' activation probability in a network: The approaches described above does not lead to the computation of the exact activation probability, due to the assumption made on the independence of in-neighbor activation events [13]. The time complexity of this approach is O(N 2 ) [15].
Theorem 3: If a n is a monotone sequence of real numbers, then a n is convergent if and only if a n is bounded.
Proposition 1: The sequence generated by the LT-SSS algorithm, i.e., the system (7) converges to a unique fixed-point.
Proof: Let us assume that the activation probabilities π i (φ 0 ) ∀i ∈ V are always positive and bounded in the interval [0,1]. As we defined the weights as N in j b i,j ≤ 1 the activation probability for every node at any subsequent step π + j (φ 0 ) is also bounded in [0,1] and the following relation holds: π + j (φ 0 ) ≥ π j (φ 0 ). Therefore, it is just necessary to show that the value of π i (φ 0 ) ∀i ∈ V at the first iteration is bounded in [0,1], as it is in the seed set definitions:

IV. FORMULATION AND RESOLUTION OF THE INFLUENCE MAXIMIZATION PROBLEM
Let us now formulate the influence maximization problem, given a diffusion model, the definition of information spread, and an algorithm to compute an estimation for the activation probabilities. The problem is to find S initial nodes that maximize the influence spread through the network. More formally, we can define the maximization problem as: where S is the chosen cardinality for the seed set. Several algorithms are available in the related literature to approximately solve this problem [31]. In this section we first recall two wellknown approaches (namely the SelectTopS and the Greedy algorithm); subsequently, we present two novel algorithms to effectively find near-optimal solutions to the influence maximization problem in a computationally efficient fashion.

A. SelectTopS Algorithm
The easiest approach to solve the maximization problem is to consider each node i as an initial seed set φ 0 = {i}. Then it is possible to compute the activation probability π j ({i}) of each node j ∈ V, and select the S nodes that gave the higher influence spread σ({i}) = j∈V π j ({i}). The formal definition of the SelectTopS algorithm is reported in Algorithm 1 [28].
Let us name T the time complexity needed to calculate the activation probability across a given network of N nodes. Since the calculation of the influence spread for every single node being a singleton seed set is needed, it is straightforward noting that the time complexity of the SelectTopS algorithm is O(N T ).

B. Greedy Algorithm
Let us now recall the Greedy algorithm presented in [18]. The definition of Greedy algorithm is reported in Algorithm 2 and is summarized in the sequel. The algorithm starts with the empty seed set φ 0 , and then iteratively selects a node which is not currently in φ 0 , whose inclusion implies the highest marginal increment in σ(·). The process of recalculating the activation probability for the entire network for each node in the final seed set causes the algorithm's time inefficiency. Recalling that T is the time complexity required to compute the activation probability across the given network with N nodes, the time complexity for the Greedy algorithm is O(SN T ) [18].
This algorithm guarantees, under the conditions described in the following theorem, that the obtained influence spread approximates the optimal value by a factor (1 − 1/e), where e denotes the base of the natural logarithm; however, the algorithm presents a significant computational time.
Theorem 4 ( [18] ): For a non-negative, monotone, and sub-modular influence function σ(·), let φ 0 be a set of size S obtained by the greedy algorithm.

Corollary 1 ( [18] ):
For an arbitrary instance of the IC model or the submodular LT model, the resulting influence function σ(·) is monotone and submodular.

C. The Novel Proposed Algorithm for Influence Maximization
As described above, the Greedy algorithm requires a high computational time, which increases linearly over the size of the final seed set S. In this paper, we propose a novel algorithm based on approximately maximizing the submodular function through an iterative selection of nodes for the final seed set. The proposed approach, denoted as Reduced Influence Overlap (RIO) algorithm, is able to solve the influence maximization problem within a shorter computational time, while providing a good approximation of the optimal solution.
First, we describe the core idea of the proposed algorithm referring to the simple graph shown in Fig. 4 under the IC model, where node c may be influenced by node a and node b. It is apparent that the activation probability of node c is π c ({a}) and π c ({b}) when a and b is the initial node, respectively.
Assuming that a is the only node in the seed set, i.e., φ 0 = {a} andπ c (φ 0 )) = π c ({a}), we want to evaluate the new activation probability when b is added to the seed set. In this simple example, such a value can be calculated as: However, we highlight that the proposed algorithm is based on the assumption that the activations of the in-neighbors of node j are independent events. Hence, equation (11) is not accurate in social networks due to the non-independence between nodes. Nevertheless, (11) gives a good basis to develop an estimate on how the activation probabilities change when a node is included in the seed set.
In the example mentioned above, by means of (11), we can exactly compute the new activation probability without having a second run of the activation spread algorithm (e.g., as in the Path Method, SteadyStateSpread or Monte Carlo approach) because the two activation events are independent. However, by analyzing the network in Fig. 1 with the IC model, we do not have the assumption on the independency of the activation events. The node with the highest influence spread is obviously the first, that has a π 2 ({1}) = 0.224 and a π 3 ({1}) = 0.370. Consequently, this is the first node to be added to the seed set. Next, we aim to evaluate the activation probability of node 3 when node 2 is included in the seed set, i.e., π 3 ({1} ∪ {2}), without running another time the influence spread algorithm (i.e., like in the Greedy algorithm). By employing (11)  The definition of the RIO algorithm is reported in Algorithm 3 and is summarized in the sequel. In the first phase (Algorithm 3, lines 1-5), following the SelectTopS algorithm, we calculate the activation probability π j ({i}) of each node j ∈ V by considering each node i as a seed set. Moreover, we initialize the seed set φ 0 to the empty set andσ(φ 0 ) and π j (φ 0 ) to zero (Algorithm 3, lines 6-9). In the second phase (Algorithm 3, lines 9-16), we estimate the new activation probability of all the nodes in the network by including an additional node in the seed set by means of (11) and we compute the new value of the influence spread function. Next, we select the seed node n with the highest expected gain of information spread, and we include it in the final seed set φ 0 . In the last phase (Algorithm 3, lines 17-18), we update the activation probability of all the node j ∈ V when the node i ∈ V \ φ 0 is the seed node, the influence spread function and the activation probability.
Finally, we provide the complexity analysis for the proposed algorithm. It is apparent that the complexity of the RIO algorithm is dominated by the first phase (where the activation Algorithm 3 RIO algorithm Input: G IC , S ∈ N + 1: for i ∈ V do 2: for j ∈ V do 3: Compute π j ({i}) 4: end for 5: end for 6: Initialize φ 0 ← ∅ 7: Initializeσ(φ 0 ) ← 0 8: Initializeπ j (φ 0 ) ← 0 ∀j 9: for s = 1 to S do 10: for i ∈ V \ φ 0 do 11: probabilities are determined) only, whilst the second and third phases contain negligible operations. Therefore, recalling that T denotes the time complexity for calculating the activation probability across the given network of N nodes, the time complexity of the RIO algorithm is straightforwardly equal to O(N T ).

V. NUMERICAL EXPERIMENTS
This section on numerical experiments is divided into three parts. In the first part, we present the datasets. Then, we analyze the activation probability computation: in particular, we compare the influence spread value obtained with the SSS and the LT-SSS approaches with respect to the Monte Carlo simulation results. Lastly, we evaluate the proposed RIO algorithm in terms of the seed set selection and computational time.

A. Datasets Description
In our experiments, we adopt two sets of regular networks and two public real network datasets [34], [35].
The first set of networks is a series of rings composed of 8 nodes connected in a circle; each node is connected symmetrically with its neighbors. Let us define the network parameter c that indicates the number of connected neighbors, for instance, we show a ring with c = 4 in 2 Fig. 5(a). We assume p i,j = b i,j = 0.1 for all edges. It should be noted that, being at most c = 6, we have N in j b i,j = 0.6 ≤ 1 that still preserves the submodularity of the LT model. The second synthetic network is a bidirectional grid graph of 16 nodes. For this network we define an additional parameter d ∈ [0, 1] to describe the level of influence spread probability in the overall (a) network. More in detail, we calculate the probability for the IC model and the weights for the LT model as where N in j are the numbers of in-neighbors of node j.
The first real dataset is called Hamsterster (1858 nodes, 12534 edges) and is related to the homonym website created in 2003. The website worked as a social network for pets, where their owners create friendships between the animals. In this dataset, pets are the nodes, while their friendships are the edges. We adopt the IC and the LT models to simulate the information diffusion process; however, since we do not have the data to infer weights of edges, we use the Trivalency approach in [11] to randomly assign the probability for the IC model by selecting p i,j from the discrete set {0.01, 0.1, 0.5}, moreover, to ensure the submodularity in the LT model, we calculate the weights as b i,j = pi,j j∈V pi,j r where r ∈ [0, 1] is a random value.
Second, we use the ego-Facebook real dataset (4039 nodes and 88234 edges). This dataset consists of anonymized friends lists from Facebook. The dataset includes nodes (profiles), circles, and ego networks. Data are anonymous. Moreover, feature vectors from this dataset have been provided, even though the interpretation of those features has been obscured. For instance, if the original dataset contained a feature "political=Democratic Party", the new data would contain "political=feature 1". Thus, using the anonymized data it is possible to determine whether two users have the same political affiliations. The dataset does not contain any direct information related to the strength of the social connection between two different users: however, we estimate this value by employing the anonymized features, i.e., the most features two nodes share, the strongest their connection.

B. Activation Probability Computation
In the first part of the simulations, we analyze the reliability of the approximated approach (i.e., the SSS and the LT-SSS) in computing the activation probability. More in detail, we evaluate the impact of the influence spread level and the grid topology by changing the parameters c and d of the two regular networks. In order to correctly assess the performance of the proposed approximated approach, we compare the corresponding results with those achieved by the Monte Carlo method over 10000 runs. Adding one node per time in the seed set, we evaluate the value of the influence spread calculated by the SSS and the Monte Carlo method. However, the relative errors between these two techniques are highly dependent on the order used to include the nodes in the seed set. Therefore, we perform a simulation per every combination and calculate the confidence interval for the resulting relative error. In Fig. 6 we show the results corresponding to ring network in Fig. 5(a) with c ∈ {2, 4, 6} and using the IC and the LC modeling. It is apparent that the relative error of the SSS and the LT-SSS approaches with respect to the Monte Carlo method increase when additional connections are included in the network. In fact, by increasing the connections, the hypothesis on the independence of the activation events become weaker.
Moreover, Fig. 7 shows the results corresponding to the grid network Fig. 5(a) when d ∈ {0.3, 0.5, 0.7}. In this case, the increase of the level of probability spread in the network impact negatively the accuracy of the approximated approaches.
It should be noted that the LT-SSS shows a higher relative error with respect to its IC counterpart both in the ring and grid networks simulations. These experiments indeed employ networks that are considerably different from real social networks. In fact, social networks are characterized by a large number of nodes and relatively small propagation probability [13]. Moreover, the influence spread is clustered; therefore, a local error would not profoundly affect the influence spread function. As a proof of this, in Fig. 8 we show the influence spread for the two real networks computed by the SteadyS-tateSpread and the Monte Carlo approach. It is evident that the approximated approaches are highly accurate both in the IC and LT models.

C. Influence Maximization and Nodes Selection
In the second part of the simulations, we examine the influence maximization problem, assessing the performance of the RIO algorithm. In particular, we show the influence spread given by the nodes selected by different state-of-theart algorithms. The experiment is divided into two parts. In the first part, we select the seed nodes by employing one of the algorithms mentioned above that may approximate the influence spread function (e.g., the RIO algorithm). After selecting the seed nodes, we evaluate the real influence spread by performing a Monte Carlo simulation with 10000 runs for each selected seed set by the different algorithms.
In Fig. 9 we show the influence spread for the ring network with c ∈ {2, 4, 6}, and with the IC and the LC models. In this case, the increased c reduces the difference between the SelectTopS and the RIO. However, the differences between the RIO and the Greedy are always small even if the hypothesis on the activation events' independency is not met.
Moreover, Fig. 10 shows the results of the grid network when d ∈ {0.3, 0.5, 0.7}. In this case, a higher level of probability spread in the network makes the RIO algorithm results poorer than the Greedy.
The first simulations using the artificial datasets cannot fully show the proposed approach's advantage due to the small number of nodes. Therefore, let us show the results obtained with the real datasets. Similarly, the simulations are made in two parts. In the first, a combination of algorithms for the influence spread computation and the influence maximization is used to select the most influential users. Then in the second part, the obtained seed sets are validated with a Monte Carlo simulation. Let us remark that we need the validation phase because the influence spread computed with the SteadyState-  Spread approximates the real influence spread function and cannot be compared directly. The first test is conducted on the Hamstested dataset, both for the IC and the LT models. The simulation shows the performance of our algorithm in terms of influence spread and computational time. In particular, we show in Fig. 11 that the results obtained by the RIO algorithm compare the ones obtained by the Greedy algorithm while requiring approximately the computational time of the SelectTopS. Figures 11(a) and 11(c) show that the proposed algorithm works with all the seed test cardinality in the, providing high value for the influence spread while employing low computational time. Noted that the time value in Figs. 11(b) and 11(d) are presented on a logarithmic scale; therefore, the gap in the processing times required by the three analyzed algorithms is significantly marked.
Lastly, in Figs. 12(a)-12(d) we report the results of simulation in the case of the ego-Facebook dataset for the IC and  the LT model. The corresponding results confirm the findings achived by the previous simulations.

VI. CONCLUSIONS
This paper proposes a novel algorithm for efficiently solving the influence maximization problem in large social networks under the well-known independent cascade (IC) and linear threshold (LT) models. On the one hand, the paper fills a gap in the existing literature, where there is a lack of investigations on low-complexity methods for the computation of the influence spread in the LT model. On the other hand, the work improves the knowledge on approaches that are able to select an initial group of nodes that maximize the spread of influence in large-scale social networks. The complexity analysis and the application to numerical experiments based on real datasets prove that the proposed algorithm ensures to efficiently find near-optimal solutions of the influence maximization problem in a computationally efficient way, both for IC and LT models, showing higher performance in running time and influence spread than recent state of the art approaches.
Future research will address: providing the approximation bound on influence spread, assessing the scalability of the proposed algorithm, modeling uncertainties that may affect the decision parameters.