Searching for a new probability distribution for modeling non- scale-free heavy-tailed real-world networks

Perhaps the most controversial topic in network science research is whether real-world complex networks are scale-free or not. Recently, Broido and Clauset [A.D. Broido, A. Clauset, Nature Communication, 10, 1017 (2019)] claimed that the degree distributions of real-world networks are rarely power-law under statistical tests. Such complex networks include social, biological, information, temporal, and brain networks are often heavy-tailed where the assumption on scale-free nature of real-world heavy-tailed networks becomes insignificant so as the complex system evolves over time. The failure of power-law distribution in fitting the degree distribution data is mainly due to the presence of an identifiable non-linearity in the entire degree distribution in a log-log scale of a heavy-tailed complex network. Here, we attempt to address this issue by proposing a new class of heavy-tailed probability distributions, for modeling the entire degree distributions of complex networks and capturing the non-linearity of these heavy-tailed networks. The generalized Lomax Model is introduced to fit the degree distributions of non-scale-free real-world networks and applied to model a wide variety of large-scale real-world complex networks. Several statistical properties of the proposed model, such as extreme value and inferential statistical properties are derived into this context. Rigorous experimental analysis showcases the excellent performance of the proposed family of distributions while fitting the heavy-tailed real-world complex networks over fifty real-world data sets in comparison with the state-of-the-art.

characteristics of large-scale real-world networks is degree distribution which 9 characterize a network system [7][8][9]. Empirical studies in the last two decades on 10 real-world complex networks such collaboration, communication, social, biological and 11 temporal networks are assumed to follow a power-law distribution [8,[10][11][12][13]. 12 Mathematically, a random variable X is said to follow the power-law model if its 13 probability distribution is of the form P (x) ∝ x −α , where α is a positive constant and is 14 known as the tail-index or shape parameter of the distribution. Therefore, it is common 15 to claim the scale-free property of real-world networks that the degree distributions of 16 such networks follow a single power-law. 17 In Figure 1, a closer look at the plots of entire degree distributions for Twitter and 18 Livejournal networks (in log-log scale) offers clear evidence of the presence of an 19 identifiable non-linearity (bend) into it. This fact suggests that a single power-law is 20 inappropriate to fit the entire degree distribution data. Consequently, the current 21 researchers also reported that a baseline power-law is insufficient to properly fit the 22 empirical data in its whole range unless some of the lower degree nodes are left out while 23 model fitting [11,[14][15][16][17]. Recently, Broido & Clauset (2019) [18] mentioned that (strict) 24 scale-free networks are rare, which relate to the claims of Stumpf et al. (2005) [19], that 25 all the derived sub-networks of a scale-free network fails to meet the scale-free property. 26 Apart from power-law, researchers have also proposed various heavy-tailed distributions 27 such as lognormal, Pareto lognormal, and double Pareto lognormal for modeling the 28 degree distribution of real-world complex networks [20,21]. Despite the alternatives, the 29 long-tailed and heavy-tailed behavior of the entire degree distribution of complex (scale-free nature), which relates to the earlier claim of Stumpf, Wiuf & May (2005) [19]. 48 The log-log scale plot of the degree distribution of the Livejournal and Twitter 49 networks in Figure 1 shows the unique degree values (x) presented in the horizontal x 50 axis and the corresponding frequency presented in the vertical y axis. A node 51 corresponding to these networks represent a single user and follower of that user is 52 represented by an edge in the network. Figure 1 shows that the straight-line 53 representation using a single power-law fails to fit the degree distribution data in the 54 log-log scale. Generally, a single power-law distribution is applied to fit the degree 55 distribution data only when the values of degree are considered higher than x min 56 (minimum degree). Subsequently, the power-law exponent α is estimated using 57 maximum likelihood estimation (MLE) from the data with the help of x min . This intern 58 suggests that the power-law distribution tends to provide a better fit only when some 59 lower-degree nodes are left out. The insufficiency of such fitting using a single power-law 60 is due to the presence of a non-linearity in the log-log scale of the degree distribution of 61 a complex network [18,19] as depicted in Figure 1. Such a drawback inspires many 62 researchers to use other heavy-tailed probability models with various exponents for 63 better fitting the entire degree distribution of real-world complex networks. This article 64 proposes a new class of generalized Lomax models for modeling these heavy-tailed analysis [27][28][29][30]. Lomax distribution closely resembles a Pareto type II distribution with 108 support beginning at zero [27], and it can be motivated in several ways from the 109 theoretical and application point of view. For example, the Lomax distribution can be 110 derived as a special case of a particular compound gamma distribution [31]. Again the 111 Lomax distribution is also represented as limiting distribution of the residual lifetime at 112 a great age. The record values of the Lomax distribution, as well as the amalgamation 113 of the Lomax distribution with Poisson distribution, were well studied in [32,33]. 114 Researchers have also studied the possible extensions and various modifications of the 115 Lomax distribution to model the real-life problems [34]. Recent studies [35] suggest that 116 the Lomax distribution is used for the analysis of heavy-tailed data and can make use of 117 an alternative to the gamma, Exponential, and Weibull distributions. The 118 corresponding probability density function (PDF) and cumulative distribution function 119 (CDF) of the Lomax model are defined as follows: 120 Definition 1. A random variable X follows Lomax distribution with parameters α, γ, which is denoted by LM (α, γ) if the CDF is of the following form: where α (> 0) is a shape parameter (real) and γ (> 0) is a scale parameter (real).

4/21
The survival function is given by The hazard function is given by which is a decreasing function of x.

123
It is interesting to see that the Exponential distribution arises as a limit of the Lomax distribution when the shape parameter or tail-index (α) increases. This can be seen through a re-parameterization of where o α (1) → 0 as α → ∞ and λ > 0. Thus, we have We can conclude that the limiting distribution of the Lomax distribution for α → ∞ is 124 the Exponential distribution [32]. However, the Lomax distribution fails to provide 125 greater flexibility while modeling heavy-tailed data sets in its whole range. In this 126 paper, we introduce a new method of generalizing Lomax distributions by modeling the 127 tail index of the Lomax model for fitting the entire degree distributions of real-world 128 heavy-tailed networks.

130
An essential structural characteristic in the study of heavy-tailed real-world complex 131 networks is their degree distributions. Empirical observations in the analysis of the 132 pattern of real-world complex networks have led to the claim that their degree 133 distributions follow, in general, a single power-law. However, while fitting, a closer 134 observation suggests that the single power-law distribution is inappropriate to model 135 the network data in its whole range. We first introduce a new class of generalized 136 Lomax models (GLM) whose members belong to a maximum domain of attraction of 137 the Frechet distribution and are right tail equivalent to Pareto distribution. These 138 newly introduced heavy-tailed families will suffice the need for new probability 139 distributions for modeling network data.

140
Family of Generalized Lomax Models (GLM) 141 We consider a real, continuous, and positive function g : (0, ∞) → R + which is 142 differentiable on (0, ∞). Also, we assume that g satisfies conditions stated below.

147
It is noted that the condition 3 is equivalent to Then, for any function g with a strictly positive and finite limit at infinity and satisfying the above three conditions, we define the new class of GLM distributions 149 as follows: where, γ (> 0) is a scale parameter.

151
It is very easy to verify that: Thus, F (x) is a standard CDF and any continuous random variable X satisfying the above-mentioned conditions are called class of GLM distributions. The CDF of standard GLM distributions can also be expressed as follows: The survival function is given by For GLM family of distributions, the survival distribution function decrease as a power function. The probability density function (pdf) of GLM is The hazard function is given by  Table 1 shows some examples of GLM family of distributions satisfying limiting  Table 1. Out of which GLM Type-IV is very similar 160 to Modified lomax model (MLM) proposed in [16,17]. Though both the models (GLM 161 Type-IV and MLM distributions) are derived from completely different phenomena but 162 achieve the same goal of building a heavy-tailed lomax model for explaining nonlinearity 163 in the degree distributions of the real-world complex network data sets. Other GLM 164 type models are completely new whereas GLM family of distribution covers as particular 165 cases several very popular probability distributions as indicated in Table 2. These  Table 2 which are closely related with the life distributions. As 178 discussed in Section , the Exponential distribution arises as a limiting distribution for 179 the Lomax distribution when the shape parameter α approaches infinity. However, 180 these models in Table 2 do not satisfy the restriction on g to be strictly positive with a 181 finite limit at infinity as in condition (1 Statistical properties of the GLM family 191 We study several extreme value properties of the new family of standard GLM 192 distributions from the perspective of extreme value and risk theory [37]. Also, we 193 discuss the parameter estimation and goodness of fit for the proposed generalized 194 Lomax models. Thus, for the GLM family of distributions, we observe that where G(x) is the CDF of the Pareto Type-II distribution.

204
(c) Any probability distribution function F is said to belong to the class dominated-variation distributions if lim sup

205
(d) Any probability distribution function F is called long-tailed distributions if F has (right) unbounded support and for any fixed k > 0 Thus, lim where γ > 0.
206 (e) It should be noted that any function g as defined in Section satisfying lim z→∞ g(z) = α > 0, is slowly varying at infinity: Thus, GLM family of distributions comprise of continuously changing distributions at 207 infinity and belong to the MDA of the Frechet distribution. Type distributions.

11/21
The normal equations can be obtained by taking the partial derivatives of Eqn. (4) 222 w.r.t. α, β, γ and equating them to zero as follows.
The MLEs of the three parameters for the GLM Type-IV distributions with α, β, andγ 226 can be obtained by setting the above partial derivatives to zero and then solving them 227 simultaneously. Eqns. (5), (6) and (7)  Chi-square statistic test, which will evaluate the goodness-of-fit for the GLM Type-I, 237 Type-II, Type-III, and Type-IV distributions. We obtain the p-values using bootstrap 238 resampling computational technique as follows: First, we decide the best fit of the 239 proposed GLM family of distributions corresponding to the data after estimating the 240 parameters using MLE and then evaluate the p-values through Chi-square statistic test 241 for the goodness-of-fit of the best-fitted GLM model for the data. Then we generate 242 50000 synthetic network data sets from the concerned GLM distribution and calculate 243 the Chi-square statistic (p-values) for each of the generated synthetic data sets. Finally, 244 we obtain the p-value for the generated synthetic data sets as the fraction of GLM 245 synthetic data sets with a Chi-square value greater than the empirical one. Higher  Description of data sets 253 We consider large scale network data sets from different disciplines, namely social 254 networks, collaboration networks, web graphs, citation networks, biological networks, 255 product co-purchasing networks, temporal networks, communication networks, 256 ground-truth networks, and brain networks. We study several individual data sets from 257 each discipline. These data sets are publicly available at 258 http://snap.stanford.edu/data/index.html. These are the most standard network 259 data sets that have heavy-tailed behaviors and are used for modeling in the statistical 260 paradigm [10,25,26]. Previous studies focused on using standard single statistical 261 distributions, namely power-law, Lomax (Pareto Type-II), Exponential, Log-normal for 262 modeling this wide variety of network data sets [1,11]. But these models fail in 263 capturing the lower-degree nodes while modeling the degree distributions. To overcome 264 the drawback, we consider a new family of the proposed GLM family of distributions.

265
Note that the proposed family of heavy-tailed GLM distributions can model these 266 large-scale network data sets in the whole range. An overview of these publicly available 267 network data sets is presented in Table 3. Some statistical measures, for example, mean, 268 standard deviation (s.d.), and calculated CV corresponding to the degree distributions 269 of each network data set, are also given in Table 3. It is important to note that the 270 empirical CV for all the data sets is greater than one, as reported in Table 3.  Table 6. 276 Finally, we test the adequacy of the GLM family of distributions compared to these

283
We fitted the GLM Type-I, GLM Type-II, GLM Type-III, and GLM Type-IV models 284 over the entire degree distribution of the complex network data sets by applying MLE to 285 estimate the parameters, as discussed in Section and . We measure the goodness-of-fit 286 of the proposed GLM distributions through the bootstrap resampling Chi-square test. 287 Table 4 represents the estimated values of the parameters for four newly introduced  Table 5. We leverage some popular statistical 294 measures, viz., MAE, RMSE, and KLD, to evaluate and compare the performance of 295 the proposed standard GLM distributions with others as shown in Table 5.

296
The overall performance of the proposed GLM models are better in terms of RMSE, 297 MAE, and KLD values for most of the networks compared to other competing 298 distributions, which suggests that the proposed GLM family plausibly fits the observed 299 node distribution. Empirical data analysis suggests that the parameter estimation yield 300 the values for specifying the best fit of the proposed models. However, the estimated 301 values alone do not give any effective information regarding the validity or the 302 goodness-of-fit of the underlying models. From Table 5 and 6, it can be concluded that 303 the GLM Type-I model outperforms all the competitive models for four data sets. In 304 contrast, the GLM Type-III model outperforms others for eight network data sets from 305 various domains. The GLM Type-II model performs superior to others for thirteen data 306 sets. The power-law model with exponential cutoff [11] performed better than the GLM 307 models for four out of fifty real-world complex heavy-tailed network data sets. Overall, 308 the performance of the GLM Type-IV model is very consistent across all the data sets, 309  and it gives the 'best' fit for 22 out of 50 data sets considered in this study.   However, the putative scale-free nature of real-world networks has generated a lot of 333 interest in the past two decades. Therefore, most of the recent studies emphasized on 334 the testing of the degree distribution of networks for power-law tails [18,22,[38][39][40]. As 335 the question "Are real-world networks scale-free" has important philosophical and 336 conceptual consequences, we have considered this question methodologically. This paper 337 explores a journey towards non-scale-freeness and proposes a new family of generalized 338 Lomax models with nonlinear exponents in the shape parameter for efficient modeling 339 of the heavy-tailed behavior of complex networks. The proposed GLM models provide a 340 better fitting to the whole degree distributions of real-world complex networks than 341 other well-known heavy-tailed distributions which could be thought of as a constructive 342 answer to the overarching question raised in the literature: "Are complex networks 343 scale-free? If not then what?" [18,22,40]. With this current paper, we hope to have 344 contributed to this recent methodological progress.

345
The generalization of Lomax model provides us greater insight into probability 346 distribution theory as well. Interestingly, several well-known probability distributions 347 such as Lomax, Exponential, Rayleigh, Weibull, Gompertz and Benini and GLM The proposed GLM models satisfy several extreme value properties and desired 362 inferential characteristics. The closeness between this heavy-tailed GLM family with life 363 distributions is also discussed. Thus, the current paper may fulfill the search for a new 364 probability distribution for modelling heavy-tailed real-world networks which are rarely 365 scale-free as discussed in [11]. The proposed idea of the 'shape parameter-based' 366 generalization for the Lomax distribution presented in this paper can also be applied to 367 generalize other similar income and size distributions, namely Dagum, Burr, Beta-prime, 368 and Log-logistic distributions. It is important to note from the experimental results and 369 evaluations that four GLM Type models mostly outperforms all the well-known degree 370 distribution models for a wide variety of real-world network data sets. Though the