Preserving Minority Structures in Graph Sampling

Sampling is a widely used graph reduction technique to accelerate graph computations and simplify graph visualizations. By comprehensively analyzing the literature on graph sampling, we assume that existing algorithms cannot effectively preserve minority structures that are rare and small in a graph but are very important in graph analysis. In this work, we initially conduct a pilot user study to investigate representative minority structures that are most appealing to human viewers. We then perform an experimental study to evaluate the performance of existing graph sampling algorithms regarding minority structure preservation. Results confirm our assumption and suggest key points for designing a new graph sampling approach named mino-centric graph sampling (MCGS). In this approach, a triangle-based algorithm and a cut-point-based algorithm are proposed to efficiently identify minority structures. A set of importance assessment criteria are designed to guide the preservation of important minority structures. Three optimization objectives are introduced into a greedy strategy to balance the preservation between minority and majority structures and suppress the generation of new minority structures. A series of experiments and case studies are conducted to evaluate the effectiveness of the proposed MCGS.


INTRODUCTION
Graphs contain plentiful structures that can be categorized from diverse perspectives, such as normal or abnormal [55] and central or periphery [8].
In this work, we categorize graph structures into majority and minority depending on their occurrence frequencies and sizes in a graph. Majority structures refer to those frequently occurring (e.g., frequent subgraphs) or large (e.g., communities). Minority structures are those rarely occurring and containing only a few nodes (e.g., extremely high degree nodes and bridges between communities). Both categories are important in graph analysis and are of great concern in various fields, such as community detection [79], frequent subgraph mining [89], spammers identification [55], and bridge vulnerability estimation [68].
Sampling is an efficient graph reduction technique [34,82,85,88]. Many graph sampling algorithms have been proposed to reduce graph sizes while preserving structures [27,46,72]. They are particularly useful in accelerating graph computations [28] and simplifying graph visualizations [58]. However, we assume that the existing algorithms tend to preserve majority structures but overlook minority structures, because the influence of majority structures on the representativeness of the original graph is considerable for measurable metrics and visual perception, whereas that of minority ones is negligible. For example, the algorithms tend to select the nodes with common degrees far more than the nodes with rare degrees to maintain the power law of degree distribution [66]. Human viewers prefer to observe large structures in advance but may ignore small structures when judging whether a sample is visually similar to the original graph [43,75].
To verify this hypothesis, we conducted a pilot user study and an experimental study. In the user study (Section 3), we recruited 20 participants and asked them to freely select structures of interest (SOIs) in 34 real-world graph data sets. The results showed that four representative types of minority structures, namely, super pivot, huge star, rim, and tie, elicited strong interest from the participants. Super pivots and huge stars are a small proportion of nodes with extremely high degrees. Rims are parachute-or chain-like structures attached to community margins. Ties are sparely distributed bridges at community boundaries. Figure 1(a) shows a toy case graph containing three super pivots, one huge star, three rims, and one tie.
In the experimental study (Section 4), we selected 12 real-world graph data sets and 20 reference graph sampling algorithms and designed three new quantitative indicators to evaluate their performance on preserving the four types of minority structures. The results showed that most of the algorithms cannot effectively preserve minority structures and may generate new minority structures that did not exist in the original graphs, especially for huge stars, rims, and ties. Figure 1(b−g) show the samples obtained by popular graph sampling algorithms. Most of the samples fail to preserve the huge stars, rims, and ties in Figure 1(a). New huge stars or new rims are found in Figure 1(c, d, e, and g). This situation is detrimental to graph analyses that focus on minority structures.
A new graph sampling algorithm oriented to minority structure analysis is needed, but its design is difficult. On the basis of results and experience in the experimental study, five key points should be seriously considered in the design: (1) identifying minority structures quickly and accurately; (2) avoiding losing minority structures in samples caused by the undersampling of their neighbors [75], such as the loss of the parachutelike rim in Figure 1(b, d, and e); (3) minimizing the inconsistency of preserved minority structures in random sampling; (4) balancing the preservation between minority and majority structures; and (5) suppressing the generation of new minority structures.
We propose a new graph sampling method called mino-centric graph sampling (MCGS) by considering the above points. This work stipulates graphs are simple, unattributed, undirected, and connected to simplify representations. First, we design a fast triangle-based algorithm to identify super pivots and huge stars and a fast cut-point-based algorithm to identify rims and ties. Second, we propose a set of importance assessment criteria to guide the preservation of minority structures and their neighbors and minimize the influence of random sampling. Finally, we introduce three optimization objectives into a greedy strategy to strike a balance between the preservation of minority and majority structures and suppress the generation of new minority structures. A series of experiments and case studies are conducted to evaluate the effectiveness of the proposed MCGS. The results reveal that MCGS performs the best among the 20 reference algorithms on the preservation of minority structures and suppression of new minority structure generation. MCGS also achieves highly satisfactory performance on the majority structure preservation.
In summary, this work presents the first attempt to investigate minority structure preservation in graph sampling. This work contributes: (1) four representative types of minority structures that are summarized through a controlled user study, (2) an experimental study that evaluates the , RW (f), SST (g), and our proposed MCGS (h), respectively, with a sampling rate of 50%. The graph is slightly modified from the character relationship network of the novel Les Misrables [31]. The relative locations of nodes in a sample are consistent with those of the corresponding nodes in the original graph. Solid-color dots represent original minority structures. Hollow-color dots represent newly generated minority structures.
performance of existing graph sampling algorithms on minority structure preservation, and (3) a new graph sampling algorithm that is oriented to minority structure preservation.

Graph Structure Analysis
Graph structures have been categorized and defined from diverse analytical perspectives. In this work, we categorize graph structures into minority and majority and define four types of minority structures (i.e., super pivot, huge star, rim, and tie). Our categorization and definition have certain connections with those in graph anomaly detection, community detection [37], structural role discovery [32], and frequent subgraph mining [89].
Graph structures are categorized into normal and abnormal in graph anomaly detection. Anomalies are nodes, edges, or subgraphs that differ from most ones [21,55]. In general, node and edge anomalies are minority structures, such as spammers in email networks with extremely high degrees, because of low occurring probability and small size. However, subgraph anomalies with relatively large sizes are not minority structures.
Graph structures can be categorized into communities/clusters, hubs, and outliers to facilitate community detection [63]. Communities are majority structures [76]. Hubs correspond to ties in minority structures. Outliers refer to small structures attaching at margins of communities, such as secret leaders controlling a criminal gang through intermediaries [78], corresponding to chain-like rims in minority structures.
Structural role discovery assigns behavior roles, such as clique-or periphery-members, to structures [32,33]. Cliques are not minority structures because of containing many densely connected nodes, but extremely high degree nodes in cliques are minority structures. Some periphery members at clique marginal areas, such as parachute-and chainlike nodes, are rims in minority structures.
In the fields of frequent subgraph mining [89], motifs discovery [18], and graphlet-based characterization [42,69], graph structures are categorized depending on occurrence frequency. Most frequent subgraphs, motifs, and graphlets are majority structures. However, their variants could be minority structures. For example, large star-shaped motifs can be regarded as huge stars in minority structures but only the central high degree nodes of such motifs are included in our definition of huge stars. Moreover, many studies of attributed data analysis devoted to rare category detection [56] and classified data entities into majority and minority classes [86], which supports this work.

Graph Sampling Algorithms
Existing graph sampling algorithms can be classified into three groups, namely, node-, edge-, and traversal-based [34,75]. In the node-based and edge-based groups, Random Node (RN), Random PageRank Node (RPN), and Random Degree Node (RDN) [46] are typical node-based algorithms. Random Edge (RE) [46] and Random Node Edge (RNE) [40] are classic edge-based algorithms. These algorithms are lightweight and provide theoretical references and building blocks for advanced algorithms. However, these algorithms have a common defect that randomly selected nodes are uncorrelated, thus causing unsatisfying preservation of graph connectivity [46]. Totally-Induced Edge Sampling (TIES) [3] is an edgebased approach that introduces a graph induction step into sampling. Additional edges in this step are retained to restore graph connectivity. We learn from this idea to reduce the generation of new minority structures.
The traversal-based group includes many algorithms. Their common merit is that connected graphs remain connected after sampling [15]. Breadth-First (BF) and Depth-First (DF) samplings are two basic approaches that select nodes in a breadth-first and depth-first traversal order, respectively [14,15]. Snowball (SB) and Forest Fire (FF) are variants of BF [42] that expand limited neighbors instead of exhaustive expansion to ensure the picking of depth nodes [34,46]. Random Walk (RW) and Random Jump (RJ) are variants of DF [30,48] that allow walking to neighbors or random nodes during the traversal for preserving the global topology. Variants of RW include Multi-Dimensional Random Walk (MDRW) [61], MetropolisHastings Random Walk (MHRW) [67], Rejection-Controlled Metropolis-Hastings (RCMH) [26], and Generalized Maximum-Degree Random Walk (GMD) [49]. In our approach, a depthfirst traversal and random-pick processing are adopted.
Some graph sampling approaches cannot be directly sorted into the above three groups. These approaches provide alternative solutions for specific goals. Random Areas Selection Sampling (RAS) selects an area of nodes each time to fully preserve their neighborhood structures [89]. We use this idea to preserve neighbors of minority structures. Distributed Learning Automata Sampling (DLAS) uses multiple automata for sampling [60]. Sampling with Shortest Paths (SSP) and Sampling with Spanning Trees (SST) identify important edges to guide sampling [36,59]. Multiple Snowball with Cohen (RMSC) combines the advantages of RN and SB. Sampling based on Graph Partition (SGP) and Sampling based on Densification Power Law (DPL) partition a graph before sampling [16,81]. We adopt this partition idea to deal with irregularly shaped graphs. Moreover, all the aforementioned algorithms perform graph sampling in the data space rather than the visual space to simultaneously support graph computation acceleration and graph visualization simplification. Our MCGS is also a data space approach.

Evaluation of Graph Sampling
Graph sampling algorithms have been thoroughly evaluated by two groups of metrics. The first group quantifies how well structural properties of the original graph are preserved [46,61,72]. Popular metrics are the Degree Distribution (DD) and Clustering Coefficient Distribution (CCD) [7,45].
The second group measures the similarity between the original and sampled graphs. Two popular metrics in this group are the Jaccard Index (JI) that measures the similarity by the size of intersections [20] and the number of connected components (NCC) that measures the similarity of graph connectivity [64]. Recently, visual perception factors are considered in graph sampling evaluation. Wu et al. [75] found that three factors, namely, cluster quality, high degree nodes, and coverage area, influence the visual perception of sampled node-link diagrams. Quan et al. [54] studied proxy graphs to measure the shape-based faithfulness of sampled graphs.
At present, no metrics are tailored for evaluating the preservation of majority and minority structures. In this work, we use traditional metrics, including DD, JI, and NCC, to evaluate majority structure preservation because inherent connections exist between these metrics and structural properties or overall shapes of majority structures. For example, DD can measure the structural properties of frequent structures. JI can compare the overall shapes formed by large structures. Moreover, we design three new indicators to evaluate minority structure preservation. We also consider perception factors in our evaluation experiments [73,77,84,87].

PILOT USER STUDY
We assumed that existing graph sampling algorithms cannot effectively preserve minority structures. Before verifying this hypothesis, we conducted a pilot user study to answer two basic guiding questions: whether and which minority structures are important in graph analysis.

User Study Design
We recruited 20 participants (8 females and 12 males, all were graduate students aged 20-26 years) and selected 34 real-world graph data sets. The task was to select any SOIs in the graphs. The graphs were visualized in node-link diagrams as plain graphs. Graph layouts used the ForceAtlas2 algorithm [35]. The participants were asked to perform the task on the 34 graphs in a random order. For a graph, 60 s was allotted, and an interval of 2 s was set before proceeding to the next graph. After completing all graphs, the participants reviewed their selections and stated their thoughts. The study was conducted on a desktop with a 23.8-inch 1920 1080 LCD display, a standard keyboard, and a mouse. Descriptions of data sets and the experimental interface are provided in the supplementary material.

Result Analysis
The selected SOIs and selection sequence of each participant on each graph were recorded as the results. We manually categorized all SOIs into eight types and counted the entries for each type. If multiple SOIs of the same type were sequentially selected by a participant in a graph, then the count of the type was only 1 to avoid overcounting. We also counted the entries for each type in orders of 1st, 2nd, 3rd, and others in all sequences. Table 1 shows the statistical results of the eight SOI types. The result showed that four SOI types, namely, global high degree structure (HD-global), margin structure (MS), boundary structure (BS), and community structure (CS), had entries more than the average of 198 for all types. The remaining four SOI types, namely, small cliques far away from graph main bodies (FC), community overlapping structure (CSoverlapping), local high degree structure (HD-local), and isolated structure (IS), had entries far fewer than the average. Therefore, HD-global, MS, BS, and CS were considered as the most appealing structures in interactive graph explorations. Among them, HD-global, MS, and BS belonged to minority structures, whereas CS belonged to majority structures.
HD-global, MS, and BS must be further discussed. (1) HD-global ranked first. Most of the participants confirmed that the nodes with extremely high degrees had a strong visual saliency, which was in line with the previous research that stated that visually salient high degree nodes should not be lower than the global top 10% [57]. In general, high degree nodes have two subtypes [71], namely, pivot and star. A pivot is a high degree node whose neighbors have at least one interconnection. A star is a high degree node whose neighbors do not interconnect. We stipulated that our concerned pivots had degrees within the global top 5%, named as super pivots; our concerned stars had degrees above the global mean [62], named as huge stars. We used a strict threshold for pivots but a lax one for stars because stars were lower in quantity than pivots. (2) MS structures ranked second. They were appendage nodes that occasionally occurred in the marginal areas of communities and formed specific visual shapes, such as shapes like parachute, chain, balloon, or tree. A large proportion of the participants reported that parachute-and chain-like structures at community rims were especially eye-catching. Thus, we used rims as a concrete representative of MS structures with the two visual shapes. (3) BS structures ranked third and were a sequence of nodes bridging any two communities. Many of the participants commented that the structures that tied communities were sometimes more attractive than communities. Thus, we used ties as a concrete representative of BS structures in this work.
As a result, we obtained four representative types of minority structures (i.e., super pivot, huge star, rim, and tie). These structures were visually salient in node-link diagrams and elicited strong interest from the participants in the pilot user study. A literature review (Section 2.1) confirmed that these structures were also important in various research and application branches.

EXPERIMENTAL STUDY
We conducted a controlled experiment to examine the performance of existing graph sampling algorithms in preserving the four representative types of minority structures.

Hypotheses
To guide the experiment, we formulated three specific hypotheses: H1: Existing algorithms pursue graph similarity and have a low ability to preserve minority structures. At present, graph similarity is largely measured by the structural properties and overall shapes of majority structures (Section 2.3). Thus, we assume that existing algorithms naturally have a relatively low ability for minority structure preservation.
H2: Existing algorithms may produce new minority structures. Many nodes in samples inevitably have incomplete original neighbors, thereby causing that some structures may degenerate to minority structures that do not exist in the original graph.
H3: Existing algorithms cannot guarantee the preservation of important minority structures. Only a part of minority structures in a graph are crucially important according to certain criteria (Section 5.4.2). The randomness of graph sampling has a fatal influence on minority structure preservation. Slight differences in selecting nodes may lead to the disappearance of minority structures. Thus, we assume that important minority structures will disappear or be no longer important in samples.

Experimental Study Design
Data preparation. We selected 12 real-world graph data sets as the experiment data, two of which were for pilot tests to determine the design of the formal experiment. The graphs were mainly social, web, and communication networks popular in graph studies and included seven small, three medium, and two large scales. Data processing was conducted to detect and label the type and importance of each minority structure in each graph by using the methods introduced in Sections 5.4.1 and 5.4.2.
Reference algorithms and parameter settings. We selected 20 graph sampling algorithms as references. These algorithms had two common parameters, namely, initial seed and sampling rate. To reduce the influence of parameter settings, we prepared four types of seeds [62] (i.e., random, high degree, high betweenness, and peripheral nodes) and four sampling rates (i.e., 10%, 20%, 30%, and 40%). Other distinctive parameters were set with defaults. Detailed data descriptions and parameter setting considerations are provided in the supplementary material.
Experimental Procedure. Each algorithm ran 800 trials (10 graphs 4 types of seeds 4 different sampling rates 5 runs). 16,000 samples in total were obtained as the raw results. The experiment was conducted on a desktop with a 3.4 GHz Intel i7 CPU and 16 GB of RAM.

Indicator Design
Given the lack of established indicators, we designed three quantitative indicators to verify the three hypotheses (Section 4.1) and measure the performance of minority structure preservation.
Minority structure preservation rate (MSPR). This indicator is the ratio of the preservation rate of minority structures to a sampling rate (H1), in which the preservation rate refers to the proportion of original minority structures preserved in a sample. For example, given a sample with a sampling rate of 30%, the MSPR is 1 when 3 out of 10 minority structures in the original graph are preserved. Generally, MSPR is related to a certain type of minority structure and is defined as where MS x represents the set of minority structures of x type in a graph G; MS S x represents the set of minority structures of x type in a sample G s ; Φ is the sampling rate; and | MS x | and | MS S x | are the cardinalities of MS x and MS S x respectively. An MSPR approaching, equal to, or even greater than 1 means a perfect result. New minority structure generation rate (MSGR). This indicator represents the probability that new minority structures of a certain type occur in a sample (H2). An MSGR approaching or equaling 0 is a perfect result. MSGR is formulated as Mean importance precision (MIP). This indicator evaluates the mean preservation precision of top K minority structures of a certain type before and after sampling (H3). MIP is a variant of average precision in recommendation system ranking [74]. We define MIP as where

Result Analysis
Result processing. The result processing consisted of three parts. (1) We calculated the values of the three indicators for the four types of minority structures on each sample. (2) We calculated the medians and standard deviations of each indicator on each minority structure type and algorithm.
(3) We selected empirical thresholds based on two criteria: indicating good sampling results and differentiating the performance of the reference algorithms. We stipulated that MSPR ≥ 0.9 was good results, indicating that the preservation rate of minority structures was approximately equal to the sampling rate. Empirically, this threshold represents an ideal balance for the preservation of minority and majority structures. MSGR ≤ 0.5 denoted good results, representing that new minority structures were no more in quantity than the preserved original ones. Thus, original minority structures can still be dominant in the sample and analyzed without distinct interference. MIP ≥ 0.5 was good, implying that more than half of the top K important minority structures were preserved. We verified the hypotheses based on the processed results (Table 2). Additional results are provided in the supplementary material.
Hypothesis verification. H1 was partially confirmed. The MSPR results reflected that these algorithms generally had low ability in preserving minority structures but a few algorithms can effectively preserve a certain minority structure type. For the four types of minority structures, the MSPR results of super pivots (µ = 0.6733) and huge stars (µ = 0.4387) were better than those of rims (µ = 0.2835) and ties (µ = 0.1199). The reason was that the former two were commonly embedded in communities with relatively stable neighborhood structures, whereas the latter two were located at the margin or boundary areas of communities with sparse and unstable neighborhoods. From a single-algorithm perspective, RDN, RPN, DPL, and RAS performed well (MSPR ≥ 0.9) in super pivots because they were in favor of high degree nodes [26,46,59]. TIES also performed well in super pivots because of its graph induction step [3]. SST performed well in rims because the use of spanning trees maintained peripheral nodes [36].
H2 was partially confirmed. The MSGR results reflected that these algorithms can effectively suppress the generation of new super pivots (µ = 0.3212) but hardly inhibited the generation of new huge stars (µ = 0.6304), rims (µ = 0.8231), and ties (µ = 0.6947). Many algorithms performed well (MSGR ≤ 0.5) in super pivots because other structures rarely degenerated to pivots after sampling. However, pivots may degenerate to stars when all edges between neighbors were lost. Thus, only seven algorithms performed well in huge stars. Furthermore, peripheral nodes and CS-overlapping structures had chances to degenerate to rims and ties, respectively. RMSC was the only algorithm that effectively suppressed the generation of new ties. Its breadth-first and multi-snowball strategy effectively preserved the connections between communities [26].
H3 was fully confirmed. Most of the MIP medians in Table 2 were poor (lower than 0.5), indicating that no algorithms can guarantee the preservation of important minority structures for two reasons. First, originally important minority structures were not well preserved. Second, newly generated minority structures became important in samples.
Other findings. The graph data sets, sampling rates, and initial seeds affected minority structure preservation to some extent. (1) Data sets. A graph with a single cluster or multiple clusters in approximate sizes is called a balanced graph. A graph with multiple clusters that present a wide difference in size is called an unbalanced graph. We found that the results on unbalanced graphs were worse than those on balanced graphs, because small clusters that contained important ties or chain-like rims in unbalanced graphs were not effectively maintained. (2) Sampling rates. The scores of the three indicators generally improved when the sampling rate increased because high sampling rates resulted in highly completed structures. (3) Initial seeds. The high-degree type of initial seeds generally performed better than the other three types of initial seeds because the former provided numerous available paths for sampling.

NEW ALGORITHM PROPOSAL
The results of the experimental study confirm that the existing algorithms can hardly preserve minority structures. In this section, we introduce a new algorithm called MCGS.

Definitions and Notations
A graph is notated with G = (V, E), where V ={v 1 , v 2 , ..., v n } represents nodes, E ={e 1 , e 2 , ..., e m } represents edges, and an edge e = (v i , v j ) connects nodes v i and v j . This work focuses on scale-free graphs [2] and stipulates that graphs are simple, unattributed, undirected, and connected to simplify the representations.
A graph sample is notated with G s = (V s , E s ), where V s is a subset of nodes (V s ⊂ V ) and E s = (V s ×V s )∩E. Considering a node-based sampling strategy, a sampling rate is defined as We use Ω ={P, S, R, T } to represent the four types of minority structures in G, where P represents the set of all super pivots, notated by P ={p 1 , , p l }, whereas S, R, and T represent the sets of all huge stars, rims, and ties, respectively. The four types are defined as follows: A super pivot is a node whose one-step neighbor nodes have at least one interconnection, with its degree within the global top 5% as represented by µ. Super pivot is notated as ∀p i ∈ P: A huge star is a node whose one-step neighbor nodes are not connected to one another, with its degree above the global mean as represented by ε. Huge star is notated as ∀s i ∈ S: . A rim is a node appearing at community margins and forming a parachute-like visual shape with its one-step neighbors or a sequence of Table 2. Results of the experiments in Sections 4, 6.1.1, and 6.1.2 in terms of the medians of indicators (columns) and algorithms (rows). P, S, R, and T represent super pivot, huge star, rim, and tie, respectively. Blue indicates the winners in significance tests among algorithms. Bold indicates that the value is better than the empirical good indicator threshold (only for MSPR, MSGR, and MIP). Using the first column as an example, significant differences are found among algorithms on MSPR and super pivot. RDN, RPN, TIES, and MCGS are the winners. Six algorithms obtain good results of MSPR on super pivot. Indicators for Majority  Structure Preservation  MSPR  MSGR  MIP  KSD SDD RCC  JI  P  S  R  T  P  S  R  T  P  S  R  nodes forming a chain-like visual shape, notated as ∀r i ∈ R: (1) V r i ⊂ V c i , where C ={c 1 , , c n } is the set of all disjoint communities created from G, and V c i is the node set of c i ; (2) ∃v ∈ V r i : v ∈ V cut ∧ | Γ(v) |≥ 2, where V cut is the set of cut points of G; and (3)

Indicators for Minority Structure Preservation
A tie is a sequence of nodes that bridge any two communities and form a chain-like visual shape, notated as ∀t i ∈ T :

Design Considerations
On the basis of the experience and results of the experimental study, we formulate six key points to be considered in the algorithm design.
C1: Identifying minority structures. Effectively preserving minority structures through global random sampling is difficult because minority structures only involve a small proportion of nodes. A feasible way is to identify and maintain them in advance.
C2: Preserving important minority structures. Prioritizing important minority structures is necessary for two reasons, that is, reserving space in a sample for subsequent majority structure sampling and minimizing the influence of random sampling on minority structure preservation.
C3: Preserving neighbors of minority structures. Simply selecting the self-nodes of minority structures is insufficient because they are no longer minority structures if losing neighbors.
C4: Balancing minority and majority structures. The absence of majority structures invalidates the sample because minority structures coexist with majority structures.
C5: Suppressing the generation of new minority structures. Existing algorithms produce new minority structures, possibly leading to a misjudgment on the graph by sample analysis.
C6: Improving robustness and scalability. For robustness, the influence of parameter settings and graph data sets on sampling should be minimized. For scalability, sampling should be completed within an acceptable time on large-scale graphs.

Algorithm Pipeline
The proposed MCGS algorithm consists of four steps, as shown in Figure 2.
STEP1. Minority structure identification. Given a graph G, a sample G s , and a sampling rate Φ, we identify minority structures in G by using two newly designed algorithms (Section 5.4.1). This step outputs the four sets of minority structures {P, S, R, T } that contain all super pivots, huge stars, rims, and ties in G, respectively.
STEP2. Minority structure ranking. We initially rank the four sets of minority structures separately in descending order of importance by using our proposed importance assessment criteria (Section 5.4.2) and quick sort. Then, we select the most important ones in each of the four sets based on Φα, where α is a constant that controls the quantity of preserved minority structures and is set to 1 by default. Specifically, a smaller α indicates more minority structures to be preserved. For example, given a graph with 3 ordered rims and Φ = 50%, we pick the top two 3 × (0.5/1) important ones. This step outputs {P im , S im , R im , T im }.
STEP3. Minority structure sampling. We directly put all nodes in {P im , S im , R im , T im } into G s and then randomly select a proportion of their one-step neighbor nodes into G s by using an improved RAS sampling (Section 5.4.3). This step outputs an incomplete G s that contain nodes of all important minority structures and parts of their neighbors.
STEP4. Majority structure sampling. We propose a greedy strategy to select the nodes in G to G s to maximize the similarity between G s and G. After reaching the sampling rate, we preserve all edges in the induced subgraph from G based on G s to suppress the generation of new minority structures (see Section 5.4.4). This step outputs the completed G s .
We also provide an optional additional step, that is, unbalanced graph processing. If G is examined as an unbalanced graph, then G will be divided into several subgraphs, and the above sampling process will be conducted on each of them. This step can reduce the influence of unbalanced graphs on the minority structure preservation (C6). This step is optional because a majority of graphs are balanced. We use a method based on the gradient boosting decision tree [24,50] and a multilevel partitioning method [6] in this step. Supporting information for this step is provided in the supplementary material.

Minority Structure Identification
STEP 1 is to identify minority structures in G (C1). We propose two fast identification algorithms because straightforward methods are generally time-consuming (C6). A straightforward method of identifying pivots and stars is to seek out high degree nodes and check whether edges exist between its neighbors. If no edge exists, then it is a star; otherwise, it is a pivot. This method is time-consuming with a time complexity of O(n 2 ), where n is the number of nodes in G. We design a triangle-based algorithm inspired by two ideas: a high degree node is either a pivot or a star, facilitating a simultaneous detection of pivot and star; a node whose relationships with any two of its neighbors form a triangle cannot be a star, accelerating the process.
This algorithm consists of five steps. (1) Given a graph, a DF traversal starts from any node. (2) For a visiting node, we check whether it forms a triangle with its predecessor and the node preceding the predecessor. If so, then we mark all the three nodes, such as the nodes marked with hollow dots in Figure 3(a-1) and 3(a-2). (3) After the traversal, unmarked high degree nodes are identified as stars, such as the nodes e and f in Figure 3(a-3). (4) We identify high degree nodes in G but not in the set of unmarked nodes as pivots. (5) We extract pivots with degrees within the global top 5% as super pivots and stars with degrees above the global mean as huge stars, such as the huge star e in Figure 3(a-4). The time consumption of this algorithm mainly arises from the DF traversal with a complexity of O(n + m), where n and m are the numbers of nodes and edges in G, respectively.
A straightforward method to identify rims and ties is to use community information. However, community detection methods are time-consuming and complicated in parameter tuning. For example, it is difficult to determine the number of communities, which directly influences the identification of rims and ties (C6). We propose to use cut-point information because both a rim and tie have at least one cut point. A cut point is a node whose removal will cause the relevant connected subgraph to be disconnected. Figure 3(b-1) highlights all cut points in a graph.
The cut-point-based algorithm has four steps. (1) Given a graph, we obtain all cut points by DF traversal [23] to generate an induced subgraph denoted as G cut = (V cut , E cut ). (2) We merge each connected component of G cut into a hyper node, as shown in Figure 3(b-2) and 3(b-3). (3) We identify a hyper node that contains only one original node as a parachutelike rim. (4) A hyper node point that contains multiple original nodes is a chain structure. If any end node of the chain has one and only one neighbor with degree 1 in G, then all nodes in the chain together with the neighbor are identified as a chain-like rim; otherwise, all nodes in the chain are regarded as a tie. The time complexity of this algorithm is O(n + m). In addition, large parachute-like rims may be identified as super pivots or huge stars. This case rarely appears because the degree of rim is generally not high. In our work, such large rims are counted as not only rims but also super pivots or huge stars.

Minority Structure Ranking
Essential to STEP 2 is to determine importance assessment criteria for each minority structure type (C2). The results of the pilot user study reflect that the degree or size of a minority structure is directly related to its visual importance. We adopt this empirical result, which is simple and efficient. For a super pivot or a huge star, we stipulate its importance proportional to its degree. The importance of a parachute-like rim is proportional to the number of its neighbors with a degree of 1. For chain-like rims, a long chain is important. For a tie, we consider two factors, namely, the chain length and number of neighbors connecting to both ends of the chain.

Minority Structure Sampling
STEP 3 is to preserve the neighborhood structures of important minority structures (C3). RAS provides an efficient and effective way that directly maintains all one-step neighbors [89]. However, this way may include extensive nodes in a sample, thereby causing an early reaching of the sampling rate (C4). We propose to add a specific condition to RAS sampling. The amount of preserved neighbors for a minority structure should satisfy the condition | Γ s (ms i ) |=| Γ(ms i ) | * Φ/β , where ms i denotes the key nodes of the minority structure i, Γ(ms i ) is the set of neighbors of ms i , Γ s (ms i ) is the set of preserved neighbors of ms i . For a super pivot, huge star, or parachute-like rim, ms i includes only one node; whereas for a chain-like rim or tie, ms i includes the end nodes of the chain. β is a constant that controls the quantity of preserved neighbors; it is set to 2 by default to reserve half of the sample space for majority structure preservation (STEP4). Using α and β altogether can tune the preserving ratio of minority structures versus majority structures.

Majority Structure Sampling
An incomplete sample G s that contains all important minority structures and parts of their neighbor nodes is obtained. STEP 4 aims to further add nodes and edges from G to G s , making the completed sample G s as similar to G as possible (C4). This situation can be described as an optimization problem. Given a G = (V, E), an incomplete sample G s = (V s , E s ), and a sampling rate Φ, an optimal node set V op and an edge set E op are added to G s to ensure that the completed sample G s can effectively represent G, where V op ⊂{V −V s }, E op ⊂{E − E s }, and | V op |=| V | * Φ− | V s |. The objective function is as follows: This problem is NP-hard. Classical optimization algorithms, such as genetic algorithm [53] and simulated annealing [39], are candidates for problem solving but commonly have high computational consumptions and complicated parameter settings (C6). We adopt a greedy strategy that has no parameters, a high speed, and desired effects. The strategy consists of four steps. (1) For each node in {V −V s }, we suppose to add it to V s and then calculate the deviance of G and G s (the induced subgraph of G based on V s ), called loss. (2) We find the minimum loss obtained in (1) and add the corresponding node into V s formally. (3) We repeat (1) and (2) until | V s | reaches | V | ×Φ. (4) We output the induced subgraph of G based on V s as the completed sample G s . The complexity of the strategy is The induction step is important. It can repair the neighborhood structures of sampled nodes and make the incomplete structures close to that of the original graph [3]. As a result, the neighborhoods of potential new stars, rims and ties can be repaired, thereby effectively suppressing the generation of new minority structures (C5). We are inspired by RDN, RPN, and TIES that obtained the distinguished performance on the MSGR indicator in the experimental study because they adopted an induction step.
The definition of the loss function is critical in the greedy strategy. The ultimate goal of majority structure sampling is to make G s as similar to G as possible. Such similarities are commonly measured by the metrics mentioned in Section 2.3. We reference three popular metrics, namely, DD [34], NCC [51], and JI [20], to propose three objectives as follows.
(1) We use the mean square error (MSE) of degrees between G s and G to depict the similarity of DD, notated as: (2) We use NCC(G s ) to represent the number of connected components of G s , which can measure the similarity of connectivity between G s and G because G is supposed to be connected in this work, thus NCC(G)=1, notated as: where L is the Laplacian matrix of G, L ∈ R n×n , σ i (L) s are the singular values of L.
(3) We use the Jaccard Index (JI) to measure the structural similarity between G s and G, notated as: As a result, our loss function is defined as: loss = ω 1 * Scaler(MSE)+ω 2 * Scaler(NCC(G s ))+ω 3 * Scaler(JI(G, G s )), where ω i is a weight coefficient,ω i ∈ [ 0, 1], ∑ 3 i=1 ω i = 1; and Scaler is a normalization processing to reduce the influence of the magnitude difference among the three objective functions.
The computation of the loss function could be accelerated (C6). We could only consider the incremental information of adding a new node into G s each time when calculating MSE. We could use the union-find algorithm [25] to immediately obtain the number of disjoint sets of a graph for NCC. We could adjust the three weight coefficients on demand to involve only one or two objectives in computation. We set them with [1:0:0] by default. Empirically, no single sampling method can simultaneously fulfill the optimal effects on the three objectives [13,17].

EVALUATION
We evaluated the proposed MCGS algorithm through an objective performance analysis, a subjective assessment, and case studies.

Objective Performance Analysis
The performance analysis had three experiments with different indicators. The data, reference algorithms, execution conditions, result processing, and apparatus of the experiments were consistent with those of the experiment introduced in Section 4.

Minority Structure Preservation Performance
Indicators in this experiment were MSPR, MSGR, and MIP, which can evaluate the performance of minority structure preservation (Section 4.3). The results of MCGS are shown in the last row of Table 2. We conducted 12 groups of significance tests (3 indicators 4 minority structure types) for MCGS and the 20 references algorithms. We initially used Shapiro-Wilk tests for each group to examine the normality of the experimental results of each algorithm on the 10 graphs and four sampling rates. The examination results did not follow the normal distribution. Then, we used a non-parametric Friedman test for each group. Significant differences (p < 0.05) were found in all the 12 groups. Finally, we used a DunnBonferroni test to identify the winners in each group.
The results show that MCGS won 12 times in the significance tests and performed best 11 times in terms of indicator medians. MCGS underperformed RMSC on the MSGR medians of ties because of the superiority of RMSC in suppressing new ties. MCGS did not obtain a good MIP median on rims because MCGS could not completely avoid the generation of new minority structures, and new rims were most apt to be produced among the four types. In summary, MCGS overall performed best among the 20 references. It could effectively preserve the four types of minority structures, suppress the generation of new minority structures, and prevent the loss of important minority structures.

Majority Structure Preservation Performance
Indicators in this experiment were KolmogorovSmirnov distance (KSD) [52], skew divergence distance (SDD) [44], reciprocal of NCC (RCC) [64], and JI [20]. They are commonly used in graph sampling evaluations (Section 2.3). KSD and SDD measure the difference of degree distributions between a graph and sample, RCC measures the connectivity differences before and after sampling, and JI measures the similarity of graphs. The four indicators range from 0 to 1. Small values for KSD and SDD and large values for RCC and JI are good. The experimental results are shown in the four rightmost columns of Table 2. Notably, we tested significant differences but did not provide empirical good thresholds in the results.
Significant differences were found in the four indicators among the 21 algorithms. MCGS became the winners of KSD, RCC, and JI, indicating that MCGS did not pursue minority structure preservation at the expense of majority structure preservation. For SDD, MCGS obtained a satisfying median. Thus, the performance of MCGS in preserving majority structures was fairly satisfactory.

Time Performance
The indicator in this experiment was the mean time consumption of 20 samplings (4 seed types 5 runs) performed by an algorithm on a graph under a sampling rate. We tested 21 algorithms, 10 graphs, and four sampling rates. Due to page limit, we only show the results of MCGS and seven reference algorithms under a sampling rate of 30% on four graphs (Table 3). More results are provided in the supplementary material. We found that the eight algorithms can be divided into two groups. The first group included TIES, RMSC, FF, and RDN. Their time consumptions were lower than those of algorithms in the second group, which included DPL, MCGS, DLAS, and SST. The main reason was that the algorithms of the second group commonly had additional computation steps during random sampling. For example, DPL detected communities and SST generated spanning trees. Such steps in MCGS (i.e., minority structure identification, importance ranking, and loss function computation) were not very time-consuming. Therefore, MCGS presented relatively good time consumptions in the second group. Moreover, MCGS was very fast on the Facebook1912 and Facebook107 due to its insensitivity to the scale of edges. In summary, the time performance of MCGS was at the lowmedium level in the experiment. The time performance can be further improved by adopting parallel computations or simplifying the greedy strategy by selecting the optimal node out of random nodes rather than all remained nodes.

Subjective Assessment
We recruited the 20 participants in the pilot user study again to conduct a subjective assessment experiment. They were asked to assess similarities between a graph and samples by perceiving node-link diagrams [80] and rating on six metrics. Three of the metrics were similarities of the overall shape, community, and connectivity for assessing majority structure preservation. The other three metrics were similarities of high degree, margin, and boundary structures related to the preservation of minority structures. We selected eight popular graphs from the 34 graphs used in the user study. We selected the proposed MCGS and five reference algorithms, namely, RDN, TIES, FF, RW, and SST, most of which performed relatively well in the previous experiments. We used the same high degree nodes as initial seeds. The sampling rate was set at 30%, which is empirically suitable for visual perception [40,41]. A graph and six randomly arranged samples were presented at a time ( Figure 4). The participants rated on each sample from the six metrics with a five-point Likert scale ranging from 1 (the lowest similarity) to 5 (the highest similarity). , and SST (µ = 2.9). MCGS obtained the highest average rating six times, followed by TIES and RDN with relatively high average ratings. The results reflected that the minority structure preservation affected perceived similarities to a certain extent and the MCGS samples achieved considerable perceived similarities. In the interview, we focused on similarity judgment principles. Most of the participants stated that they initially observed the overall shape and connectivity and then considered visually prominent minority structures.

Case Studies
We used three popular graph data sets to demonstrate the features of MCGS. The reference algorithms, initial seeds, sampling rate, and layout method were consistent with the subjective assessment experiment. In node-link diagrams, the relative locations of nodes in a sample are consistent with those of the corresponding nodes in the original graph. Additional cases are provided in the supplementary material.

AS-733 Graph Data Set
The AS-733 graph data set [48] is an autonomous systems network on the Internet with 6,474 nodes and 13,895 edges. The original graph and samples obtained by RDN, SST, MCGS are shown in Figure 5.
We marked four SOIs popular in the pilot user study, as shown in Figure 5(a). SOI-1 and SOI-2 were the first and second largest communities in the graph, respectively. SOI-3 was a visually prominent super pivot in a relatively sparse area. SOI-4 was a huge star far way the two communities. For the RDN sample in Figure 5(b), the overall shape and density distribution of the original graph were considerably preserved. The five super pivots in SOI-1 were maintained, but only the top two super pivots (a ) were consistent with those in the original graph in order. The super pivot in SOI-3 was well preserved. The huge star in SOI-4 was retained but lost many neighbors.
The SST sample in Figure 5(c) was of low similarity with the original graph in the overall shape and density distribution. The second largest community in SOI-2 and the huge star in SOI-4 disappeared. The five super pivots in SOI-1 were maintained. ) were preserved in order; however, most of their preserved neighbors were the nodes with the degree of 1 and many inter-connections in the community lost. This situation was also reflected in SOI-3.
For the MCGS sample in Figure 5(d), the overall shape and the density distribution were well preserved. Among the three algorithms, RDN performed best on majority structure preservation, followed by MCGS. MCGS performed best on preserving super pivots and huge stars, especially for the maintenance of the importance order of super pivots. SST only performed well in preserving one-degree neighbors of super pivots.

Cpan Graph Data Set
The Cpan data set is a collaboration network with 839 nodes and 2,127 edges [1]. It depicts the relationships between the developers using the same Perl modules. The original graph and samples obtained by FF, TIES, and MCGS are shown in Figure 6. This case focused on the preservation of parachute-like rims at marginal areas. The top four important rims in the original graph were marked as a > b > c > d in descending order in Figure 6(a). The FF sample in Figure 6(b) presented an overall shape dissimilar to the original graph, and only one of the four important rims and another rim were maintained. The TIES sample in Figure 6(c) presented an overall shape greatly similar to the original graph, and the four important rims were preserved with a changed importance order (a > d > b = c) and unclear parachute shapes. For the MCGS sample in Figure 6(d), the preservation of the overall shape was slightly worse than that of the TIES sample. The four rims were preserved with clear parachute shapes, and their importance order was completely maintained (a > b > c > d).

Facebook1684 Graph Data Set
The Facebook1684 graph data set is an online social network with 775 nodes and 14,006 edges. The original graph and six samples are shown in Figure 4. This data set is an unbalanced graph in which two large communities include 705 nodes and two small communities/cliques contain only 70 nodes. This case focused on the preservation of ties between communities in an unbalanced graph. The SST (a), TIES (e), and RDN (f) samples maintained a few nodes in the two small communities and lost the connections between small and large communities. The FF (b) and RW (d) samples lost the two small communities and generated new chain-like rims at the margins of the largest communities. The MCGS sample (c) was the only one that effectively preserved the two small communities and the connections between small and large communities.

DISCUSSION
In this section, we discuss the limitations of this work and suggest directions for further work.

Limitations
We mainly used scale-free graphs in this work, such as social, communication, and web networks. Whether our MCGS is applicable for other types of graphs, such as biological networks [4], bipartite graphs [5], and signed networks [47], needs to be deliberated. Our definitions of the four minority structures could be different in other graph types.
The experimental study demonstrated that the existing algorithms had a low ability of preserving minority structures. However, two points should be noted. (1) Existing algorithms did not perform well just because they are not originally designed for minority structure preservation. They still had distinguished competences in diverse application scenarios [29], such as RW in large graph estimation and RE in computational cost reduction.
(2) Our MCGS is mainly suitable for graph analyses oriented to minority structures, such as graph visualization [12,65] and anomaly detection. Its ability for other scenarios remains unknown.
In the pilot user study, we ranked the eight SOI types only depending on the number of entries. In practice, the importance of SOI types should be considered. For example, BS type should rank first in network vulnerability analysis. Likewise, the definitions of minority structure types could be adjusted on demands [70,83]. For example, HD-global structures are not needed to be subdivided into super pivots and huge stars in some cases. A tie was strictly defined as a single chain bridging two communities in this work, but communities may be connected by multiple chains.
In the evaluation, the experiment of majority structure preservation was not fully comprehensive. Some common metrics, such as clustering coefficient distribution and connected component size distribution [46,64], were not included. In the subjective assessment experiment, we presented six samples at a time to facilitate a convenient comparative perception, but this manner may cause differentiated ratings. An iterative manner is to show one sample at a time. Moreover, we tested the proposed MCGS on large-scale graphs (see the supplementary material). The experimental results showed that the graph visualizations after sampling still presented severe visual clutters, even when the sampling rates were very low, which reflects that sampling in the data space may not be adequately suitable for visualizing large-scale graphs. It is worth to explore if sampling in the visual space can solve this challenge.

Further work
In the pilot user study, advanced techniques could be adopted in the future, such as using crowdsourcing approaches [9] to involve extensive participants and using eye tracking [10] to improve our manual SOI classification.
In the experimental study, a ranking preservation indicator can be designed to evaluate the matching accuracy of the rankings of important minority structures before and after sampling, referring to the normalized discounted cumulative gain in the recommendation system community [11]. Sampling rates with a short interval and a wide range should be examined to find an appropriate sampling rate for minority structure preservation.
In the algorithm study, we plan to properly modify MCGS to extend its applied scope from undirected graphs to directed graphs or from scale-free graphs to other types of graphs [38]. We plan to add new types of minority structures and new objectives of loss function into the sampling process. Moreover, comprehensive methods for unbalanced graph identification and partition should be further studied. A layout that can present minority structures distinctly is worth further exploration [19,22].

CONCLUSION
This work investigated the preservation of minority structures in graph sampling. We conducted a pilot user study and identified four representative types of minority structures. We conducted an experimental study and found that existing algorithms cannot effectively preserve the four types of minority structures. We designed a new graph sampling algorithm named MCGS that presented great performance of minority structure preservation in a series of experiments. This work is the first investigation of minority structure preservation in graph sampling. We hope this work will be conducive to the research and application of graph analyses oriented to minority structures. We also expect that this work will inspire other researchers to further study minority structure classification, identification, sampling, and visualization.