key: cord-0299805-tdmkhm82
authors: Ferreyra, Nicolas E. Diaz; Hecking, Tobias; Aimeur, Esma; Heisel, Maritta; Hoppe, H. Ulrich
title: Community Detection for Access-Control Decisions: Analysing the Role of Homophily and Information Diffusion in Online Social Networks
date: 2021-04-19
journal: nan
DOI: 10.1016/j.osnem.2022.100203
sha: 8da6f4d44c5c0e5b305e652788b3d4dc144174b0
doc_id: 299805
cord_uid: tdmkhm82

Access-Control Lists (ACLs) (a.k.a. friend lists) are one of the most important privacy features of Online Social Networks (OSNs) as they allow users to restrict the audience of their publications. Nevertheless, creating and maintaining custom ACLs can introduce a high cognitive burden on average OSNs users since it normally requires assessing the trustworthiness of a large number of contacts. In principle, community detection algorithms can be leveraged to support the generation of ACLs by mapping a set of examples (i.e. contacts labelled as untrusted) to the emerging communities inside the user's ego-network. However, unlike users' access-control preferences, traditional community-detection algorithms do not take the homophily characteristics of such communities into account (i.e. attributes shared among members). Consequently, this strategy may lead to inaccurate ACL configurations and privacy breaches under certain homophily scenarios. This work investigates the use of community-detection algorithms for the automatic generation of ACLs in OSNs. Particularly, it analyses the performance of the aforementioned approach under different homophily conditions through a simulation model. Furthermore, since private information may reach the scope of untrusted recipients through the re-sharing affordances of OSNs, information diffusion processes are also modelled and taken explicitly into account. Altogether, the removal of gatekeeper nodes is further explored as a strategy to counteract unwanted data dissemination.

Online Social Networks (OSNs) like Twitter or Facebook are virtual spaces that allow people to connect with like-minded others by exchanging different types of media content including posts, links, and pictures [45] . To a large extent, social interaction inside these platforms resembles communication aspects of everyday life. Particularly, the exchange of private information, either inside or outside OSNs, is fundamental for creating and maintaining social relationships [59] . Hence, it is not surprising that people often disclose personal information in social media platforms in order to strengthen their bonds and maximize their social capital [30] . Nevertheless, keeping private information away from untrusted recipients becomes a challenging task for average users since OSNs place the members of disjoint social circles (e.g., family and work colleagues) under a same communication channel [30, 57] . Consequently, this often leads to unintentional privacy breaches due to misalignments between the intended and the actual audience of online publications [50] .

Access-Control Lists (ACLs) or just "friend lists" are one of the most salient privacy mechanisms of OSNs since they allow users to constrain the audience of the content they publish online [22, 44] . Basically, ACLs are collections of contacts that are deemed untrusted by the user regarding the access to certain pieces of personal information. In principle, ACLs are an effective way to keep the information disclosed inside posts away from unintended recipients. For example, an ACL composed of work colleagues could be applied to restrict the visibility of a post with a negative comment about one's employer. However, creating custom ACLs can introduce a high cognitive burden on average users since it demands assessing the trustworthiness of a large number of contacts [41, 1] . Furthermore, different ACLs must be created for different types of personal information since individuals' trustworthiness may vary from content to content [17] . Last but not least, keeping ACLs' internal consistency can demand a great effort given that users are likely to add or remove contacts from their network over time [40] .

Access-Control Predictive Models (ACPMs) aim to reduce the burden of manual configurations through the automatic generation of custom ACLs [1] . Particularly, ACPMs leverage a set of classification features (e.g. personal attributes or network structure) to elaborate and recommend ACLs aligned with a set of access-control preferences provided by the user [27, 44] . One approach consists of applying community detection algorithms for identifying clusters of untrusted members inside the user's ego-network (i.e. the network of connections between her friends) [39, 27] . Under this approach, an ACL is created out of the members of the cluster that best fits the user's access-control preferences. In this case, preferences are specified through a set of contacts marked as "untrusted" by the user (Fig. 1) . Such a community-based ACPM is suitable particularly in cases where accessing the personal attributes of network members (e.g. age, gender, workplace) is limited or not possible at all [38, 39] . This is because traditional community-detection algorithms (such as Leading Eigenvector [43] or Multilevel Community [6] ) can identify clusters without requiring information on the attribute values of its nodes (if any).

Although community-based ACPMs only require structural network information for their application, they can nonetheless lead to inaccurate ACL configurations. Particularly, this is due to the influence of homophily in the formation of communities inside a network as on people's accesscontrol preferences [55, 12, 19] . Basically, homophily refers to an organization principle in which the members of a network with similar characteristics tend to create connections with each other [35] . On the one hand, such a principle has a large impact on a network's structure and consequently in the formation of its communities [12] . Particularly, it is to expect that the members of an emerging cluster would hold a certain similarity with regard to a set of attribute values (e.g. location, gender or workplace) [28] . On the other hand, users' access-control preferences are also largely influenced by some of the attribute values shared among network members. For instance, in order to generate an ACL of work colleagues, a user would label as "untrusted" some of the contacts who share her same "workplace" attribute value. However, unlike the user's ACL preferences, traditional community-detection algorithms are not driven by attribute similarity causing flaws in the identification of untrusted communities [15, 39, 18] . In particular, they may select best-fit clusters whose members do not portray similarities on the attribute values deemed relevant for the generation of the corresponding ACL [15] . Consequently, such a community-based ACPM may generate inaccurate ACL configurations and privacy vulnerabilities under certain homophily scenarios.

Another factor that can affect the efficacy of ACLs is information diffusion in OSNs. Particularly, OSNs are endowed with affordances that allow their users to re-share content from others making it available to an audience Figure 1 : Alice's ego-network. Untrusted friends John, Bill, and Bob are grouped under cluster 1 . An ACL containing all the nodes in 1 is generated and recommended to Alice [15] . beyond the one defined by the content owner [23] . This can result in privacy breaches especially when a trusted member makes the user's post available to an untrusted one after resharing it [23, 61] . Nevertheless, despite its importance in terms of privacy and access-control, the dynamics of information diffusion in OSNs are often neglected by ACPMs including community-based ones [2] . Consequently, privacy violations can occur even when ACLs manage to properly identify all the untrusted recipients of a particular piece of private information. This calls for the elaboration of more robust solutions that are not only capable of satisfying users' access-control preferences, but also manage to prevent the propagation of personal information throughout the untrusted segments of a network.

This work elaborates on the findings reported in [15] further investigating the impact of homophily and information diffusion in community-based ACPMs. Particularly, the performance of this approach is evaluated against different network topologies and information diffusion settings. For this, two simulation experiments are outlined, executed, and interpreted. In the first one, topologies are generated according to particular homophily conditions describing the attachment probability between nodes of similar characteristics. These network configurations are used thereafter to determine the precision, recall, and F1 score of communitydetection algorithms when predicting ACLs. Unlike in [15] , Links in Context [21] , an attribute-based clustering approach, is also considered for performance evaluation along with traditional structure-based methods. In the second experiment, the dynamics of information diffusion (not considered in [15] ) is introduced in the generated networks to elaborate countermeasures against unwanted data dissemination. In this case, the removal of gatekeeper nodes (i.e. trusted members linked to untrusted clusters) is investigated through a model of social influence as an approach for reducing the amount of personal information spread across untrusted segments of a network.

The rest of this paper is organized as follows. In the next section, related work on access-control prediction and information diffusion in OSNs is discussed and analysed. Following, Section 3 introduces the theoretical foundations of this paper. In particular, homophily-driven preferential attachment and social influence models are presented and described for later application. Sections 4 and 5 describe the methodology and results of simulation experiments 1 and 2, respectively. Such results together with their limitations are further discussed in Section 6. Finally, in Section 7, we outline the conclusions of this paper and introduce directions for future work.

Controlling the access to personal information is until now one of the main privacy challenges of OSNs. Particularly, such a challenge has motivated several contributions dealing with the automatic configuration and recommendation of ACLs. This section discusses related work elaborating on different ACPMs within the current literature. Likewise, advances on information diffusion and social influence are also analysed under the lens of access-control decisions.

Privacy and security scholars have developed several approaches to facilitate the definition of access-control policies in OSNs. Particularly, ACPMs seek to automate the generation of ACLs through machine learning [44, 17, 37, 14, 52] , formal logic [48, 56] , and network analysis [18, 38, 15] among other methods. On a large scale, ACPMs can be classified into community-based [18, 38, 15] or attribute-based [44, 17, 14, 54] , depending on whether they leverage communities or personal attributes for the automatic generation of access-control policies. For instance, Díaz Ferreyra et al. [14] introduced an attribute-based solution in which decision trees are generated to recommend adequate post audiences in OSNs. Under this approach, friends are classified into trusted or untrusted by applying a number of conditional tests over a set of profile attributes such as age, gender, interest and education. A similar strategy is followed by Dong et al. [17] who elaborated on a classifier in which the sharing tendency between users together with the sensitivity of the content being disclosed are employed as audience predictors. Similar attributes along with demographic and location-related information were used by Ni et al. [44] in a machine learning solution that recommends personalized privacy policies for user-generated content in OSNs.

Despite their levels of accuracy, attribute-based solutions display some limitations related to the information on which access-control predictions are made. Particularly, deciding which attributes should be deemed as predictors is not a trivial decision. Furthermore, some attributes may not be available across different OSNs, making it difficult to engineer multi-platform solutions [38] . Conversely, communitybased ACPMs do not suffer from these limitations and may be considered as less privacy invasive since their predictions are grounded on clustering algorithms that use solely the network structure as input. For example, Misra et al. [38] unveiled untrusted social circles in OSNs using the Clique Percolation Method, an approach which builds up communities out of fully-connected network sub-graphs. Certainly, hybrid solutions have also been introduced and discussed along with community and attribute-based ACPMs. Such is the case of Fangfang et al. [52] who elaborated a model for the automatic trust assessment of network members based on both, the connections and attribute similarities among them. Nevertheless, although hybrid solutions can show a high accuracy, their performance tends to decrease as the size of the network becomes larger [16] . Furthermore, a performance comparable to the one of hybrid ACPMs can be also achieved by community-based solutions within a shorter time frame [38, 18] .

A large body of research has been dedicated to the study of diffusion processes in complex networks [13, 46, 31, 24, 2, 20] . Particularly, a significant number of works have explored approaches for the identification of the most influential users in a network with the aim of maximizing the spread of information [34] . For instance, Dhamal [13] elaborated on influence optimization through an adaptive seed strategy consisting of multiple phases of information diffusion. Under this method, influence maximization is achieved by selecting seed nodes at different stages according to the observed propagation of information over time.

In line with this, Radion Purba et al. [46] introduced an information diffusion model that incorporates the engagement and activeness levels of Instagram users as indicators of their influence susceptibility and influence degree, respectively. Work concerning the identification of diffusion participants can also be found within the current literature. Such is the case of Li et al. [31] who developed a method for spotting members that are likely to forward viral information in Twitter. Particularly, they elaborated a ranking of potential hashtag adopters by analysing and forecasting the use of hashtags across multiple chains of followers.

All in all, influence maximization research is of great value for many real-world applications including online marketing [62] , trending topic detection [36] , and information summarization [53] . Nevertheless, information diffusion has also been analysed from an influence minimization perspective and applied in areas such as public health policies [63] , misinformation [60] , and cybersecurity [24, 2, 20] . For instance, Jia et al. [24] proposed limiting the propagation of adversarial content through the deletion of critical nodes and edges in a network. Likewise, Yan et al. [60] employed an edge-removal strategy to counteract the diffusion of rumours across OSNs. Overall, these applications share the common goal of stopping the spread of an agent (viruses, information, etc.) across the members of a network. However, accesscontrol decisions in OSNs introduce the additional challenge of optimizing the utility of information disclosure. That is, maximizing the number of trusted recipients while reducing the number of untrusted ones. Prior work has introduced formal frameworks for analysing this problem and algorithms for approximating its solution (cf. [2, 20] ). Still, to the best of our knowledge, such a challenge has not been tackled from a community-detection perspective. Furthermore, it has not been analysed yet under the lens of homophily.

As discussed in Section 2.2, ACPMs often neglect the role of information diffusion when generating custom privacy policies. Moreover, homophily can also affect the performance of community-based approaches since it has a direct impact in the formation of communities inside OSNs.

This work aims to investigate these aspects through simulation models of information diffusion and attributed scalefree networks. Such models and their main characteristics are introduced and discussed in the following subsections.

Large complex networks like the Internet or OSNs have caught the attention of researchers across different fields [3] . Empirical studies have shown that these networks share an important property: only a small number of nodes hold a big amount of connections to other nodes, whereas most nodes have just a few [4] . Networks containing such important nodes, or hubs, tend to be scale-free in the sense that the degree of these hubs (i.e. number of connections to other nodes) widely exceeds the average [25, 4] . Up to now, scholars have proposed several evolution models for constructing scale-free networks [26, 28, 10, 5] . Among those models, one of the most prominent ones is the one introduced by Barabasi and Albert [5] . This model indicates that two simple mechanisms, growth and preferential attachment, are responsible for the emergence of scale-free networks [4] . On the one hand, growth refers to the process in which at each time step a node with (≤ 0 ) links is added to the network and connected to pre-existing nodes. Preferential attachment, on the other hand, describes the process by which new nodes prefer to link to the more connected nodes in the network (i.e. the hubs) [4] .

In principle, the preferential attachment mechanism proposed by Barabasi and Albert does not take attribute similarity into consideration. Nevertheless, this approach has been enriched with homophily characteristics by other researchers resulting in more specific network evolution models [26, 28, 15] . For example, Kim et al. [28] introduced a groupopenness mechanism for modelling the homophily and attachment probability between two nodes in a network. Under this approach, a node characterised with the attribute value is considered a member of the group . Then, the attachment probability between a node of group and a node of group is computed as a function of the openness factor Λ between the groups and . Such a factor can adopt values between 0 and 1 and indicates how closed (or open) is a node in group to create links with other nodes in [29] . Diaz Ferreyra et al. [15] adopted and extended this mechanism to situations in which nodes are characterized through multiple attribute values and therefore deemed members of more than one group. Particularly, homophily across different groups is expressed through an openness matrix composed by the openness factors of all pairs of attribute values available in the network (see Appendix A for a detailed description). The corresponding preferential attachment model is adopted in this work for the generation of network topologies aligned with specific homophily conditions.

Many efforts have been made to understand and recreate information diffusion processes in OSNs [33, 7, 32] . To a wide extent, such efforts have their origins in studies seeking to simulate the dynamics of epidemic diseases in biological networks. Overall, this is because information (just like a virus) begins to spread from a set of seed nodes to the rest of the network at a certain diffusion rate [7, 32] . Different information diffusion models within the current literature have sought to recreate such a process making particular assumptions about its dynamics [7] . For example, progressive models like the Lineal Threshold (LT) [51] assume that, once infected, nodes cannot switch back to their original non-infected state. Conversely, non-progressive models like the Susceptible Infected Susceptible (SIS) [42] consider the scenario in which an infected node can return to its initial condition and therefore be infected many times. Particularly a progressive variation of this last one, the Susceptible Infected Recovered (SIR), has been applied in the context of the COVID-19 pandemic to simulate chains of contagion across communities [8] .

Another approach widely applied to describe information diffusion processes in OSNs is the Independent Cascade (IC) [32, 51] . Let us consider an ego-network representation = ( , ) where the node set corresponds to the user's befriended contacts and to the connections existing between them. The IC is a non-progressive model in which information spreads from an initial set of infected nodes 0 ⊆ to the rest of the network members like a domino [7] . For the case of sharing a post in an OSN, 0 could be defined as a sub-group of friends who notice the user's post once published and are likely to re-share it. Hence, at each time step ≥ 0 an infected node ∈ can pass the information to an inactive (i.e. non-infected) neighbour ∶ ( , ) ∈ with probability . If successful, becomes infected at time step +1, otherwise stays inactive and has no chance to infect it again. However, if has more than one infected neighbour, it will be approached by each of them independently. Such a contagion process continues to unfold until there are no more infection attempts to be triggered. In Section 5.1, the IC model is adopted and instantiated to evaluate i) the effectiveness of the predicted ACLs under information diffusion conditions, and ii) elaborate countermeasures to prevent unwanted data dissemination in OSNs.

All in all, homophily is a process that may impact a network's structure and consequently impair the outcome of community-based ACPMs. This experiment aims to analyse the effect of homophily when generating ACLs from the emerging communities inside ego-networks. For this, such a community-based approach was put into practice in simulated network topologies under different homophily conditions. The following subsections describe the methodology employed in the experiment and the results obtained from its execution.

In this experiment, the simulation approach introduced in [15] is followed. As it is shown in Fig. 3 , a communitybased ACPM requires (i) the user's ego-network and (ii) her privacy preferences in order to generate a personalized ACL. Particularly, such preferences can be described through a small sample of untrusted friends. That is, some of those contacts who should be excluded from the audience of a certain piece of personal information. Therefore, this stage consists of a method for simulating ego-networks and a criterion for the selection of untrusted network members:

• Simulation of ego-networks: As discussed in Section 3.1, networks are generated through a homophilydriven preferential attachment model in which nodes are characterized with the attributes gender, workplace, and location. Particularly, nodes in the generated networks can adopt the values or for the gender attribute, , , or for workplace, and , or as location. Under this model, the attachment probability between two nodes is computed through an openness matrix describing the homophily conditions of the network. Basically, the values inside can range from 0 to 1 and represent the strength of attribute similarity in the linking process (e.g. a value Λ closer to zero describes a setting in which users located in York are less likely to connect with users living in Leeds). An extended description of this mechanism can be found in the Appendix.

• Selection of untrusted nodes: The selection of untrusted network members is guided by a hypothetical self-disclosure scenario in which a user working in (i.e. the ego of the simulated network) wishes to create an ACL to exclude her work colleagues from the audience of her publications. Hence, nodes with the attribute value = are selected from the generated ego-networks representing the user's access-control preferences. Particularly, nodes with the highest degree are selected as these are often the most influential ones in the network. The corresponding ACL is then built out of the community that brings together the largest amount of these untrusted nodes. For this, a community-detection algorithm (CDA) is employed to identify clusters of nodes inside the network under analysis.

The goal of this simulation experiment is to evaluate the performance of community-based ACLs under different homophily conditions. For this, a set of network topologies were generated out of different values and used thereafter to produce the corresponding ACLs. Since is the attribute that drives the selection of untrusted nodes, three configurations were defined to simulate networks with a higher/lower degree of homophily:

As it can be observed, these configurations differ only in the values assigned to Λ , Λ and Λ while the rest of the group-openness factors were set to one. These values were adopted from [15] as they lead to topologies with significant differences in structure and workplace homophily.

The simulation model described in the previous subsection was implemented using iGraph [9] , a library for network analysis and visualization for R 1 . A total of three ego-networks of size = 500 were simulated each aligned with the homophily conditions 1 , 2 and 3 . In all cases, the attribute values of the nodes were assigned following the probability tree of Fig. 4 . Likewise, = 10 untrusted nodes were selected from each of the networks for the generation of the corresponding ACLs. Particularly, three different CDA were applied to unveil clusters inside the simulated networks: Leading Eigenvector (LE) [43] , Multilevel Community (MC) [6] , and Links in Context (LiC) [21] . Both LE and MC were initially assessed in a prior study (c.f. [15] ). These methods rely solely on structural information to split the nodes of the network into a hierarchy of nested communities. Conversely, LiC also leverage the network's attributes to identify densely-connected groups of nodes [21] . Basically, LiC defines the context of each link in the network as the subset of attributes that are common to its endpoints. Thereby, it identifies communities of nodes through agglomerations of links sharing the same context. To the best of our knowledge, LiC has not been yet evaluated for the automatic generation of ACLs. and Λ ), the nodes of the network tend to agglomerate around three big communities. Hence, the precision of the predicted ACLs (i.e. the percentage of untrusted nodes it contains) improves considerably from 1 to 2 and reaches its maximum in the homophily condition 3 (Table 1) . Particularly, the highest precision in 1 and 2 is achieved through the LE algorithm with a 48.98% and 75.00%, respectively. In the case of 3 the best precision score corresponds to the algorithm MC with a value of 99.29%. On the other hand, the method's recall (i.e. percentage of untrusted nodes in the network included in the generated ACL) also reaches its maximum value in 3 for all the clustering methods. In terms of execution time, MC and LE were able to identify clusters within seconds whereas LiC took more than 3 minutes in the best case ( 3 ) 2 . With respect to the number of generated clusters, LiC unveiled more communities than LE and MC in all configurations. Particularly, LiC identified the largest number of clusters in 2 (530 clusters) whereas the smaller amount corresponds to MC and LE in 3 (6 clusters each). As for the ACL size, the average number of clustered members was 124.67±127.61 in 1 , 53.33±13.50 in 2 , and 162.67 ± 18.82 for the homophily condition 3 .

As discussed in Section 2.2, the re-sharing affordances of OSNs can impair the effectiveness of ACLs and violate, in turn, the privacy preferences of content-owners. This section introduces a simulation experiment for analysing the performance of community-based ACLs under different information diffusion settings. For this, the network topologies generated in Section 4 are infected using the IC model introduced in Section 3.2. Particularly, an ANOVA test is conducted to determine the effects of the network's homophily and the number of seed nodes on the percentage of infected nodes at the corresponding ACL. Furthermore, the effects of removing gatekeeper nodes are also evaluated and proposed as a means for counteracting unwanted data dissemination. Fig. 5 illustrates the main building blocks of the proposed simulation approach. First, a total of gatekeeper nodes of the highest degree are removed from the egonetwork under analysis. Such nodes are trusted members linked to one or more nodes inside the corresponding ACL and may therefore forward the information to untrusted network segments. Next, a group of seed nodes is selected from outside the ACL and infected according to the premises of the IC model. Basically, this step recreates the situation in which the user's post reaches members of the trusted audience that are likely to re-share it. In principle, the IC model assumes that all infected nodes in the network can pass the information to their non-infected neighbours with the same probability. However, to a certain extent, the dynamics of information diffusion are related to homophily characteristics of OSNs [11] . Particularly, the flow of usergenerated content across OSNs can be strongly influenced by attribute similarities among its users [47] . Therefore, the IC model was adjusted so that the probability with which an infected node spreads information to a non-infected neighbour is computed as:

where # corresponds to the number of attributes characterizing the network (3 in this case) and # ℎ to the number of attribute values shared between and . The parameter represents the maximum infection probability among all network members and can adopt values ranging between 0 and 1.

The effects of homophily along with the number of seed nodes and removed gatekeeper nodes were analysed in a 3x2x3 factorial experimental design. Particularly, = 0, 1 3 or 2 3 of the getaway nodes were removed from the networks corresponding to the homophily configurations 1 , 2 and 3 . In addition, these networks were infected with = 75 or 150 seed nodes as described in Section 5.1. In all cases, the parameter was set to 0.6 to represent a tendency of the network members towards information re-sharing. Likewise, all ACLs were generated using the LE algorithm since it was the one with the best overall precision in Simulation 1. The IC model was executed 100 times for each of the 3x2x3=18 experimental conditions resulting in 1800 simulation runs.

An ANOVA test was conducted on the simulation results to determine the effects of i) homophily, ii) diffusion seeds, and iii) removed gatekeepers on the percentage of infected ACL members. At first, a Levene's test determined that the dataset resulting from the simulation outputs did not meet ANOVA's assumption of homogeneity of variance. However, this condition can be relaxed for large samples (N ≥ 30) and when the number of observations is equally distributed (or nearly so) across the different factors and levels [49] . Since these conditions are met by the generated data set (i.e. N=1800 and 100 observations per level), the ANOVA analysis was carried out accordingly. As summarized in Table 2 , all of the main factors were significant for the percentage of infected ACL members after the execution of the IC model ( > 0.001). Particularly homophily yielded an effect size of 2 = 0.080, indicating that around 8% of the variance in the percentage of infected ACL nodes was explained by this factor ( 2,1782 = 77.622, = 0.000). Such a variance is affected in 3.9% by the diffusion seeds ( 1,1782 = 72.731, = 0.000) and in 77.8% by the removed gatekeepers ( 2,1782 = 3123.266, = 0.000). A significant interaction effect of 0.008 was observed between homophily and diffusion seeds ( 2,1782 = 7.111, = 0.001). Likewise, a significant effect of 0.184 was yielded from the interaction between homophily and removed gatekeepers ( 4,1782 = 100.131, = 0.000). However, no significant effects were observed from the interaction of all three factors altogether ( 4,1782 = 2.014, = 0.090).

In order to further examine the significant main effects, a Tukey HSD Post-Hoc test was conducted (Table 3) . Particularly, an average drop of 30.95% in the infected ACL nodes was observed after removing 1 3 of the gatekeeper nodes and 47.05% after removing 2 3 ( < 0.001). Likewise, the average difference of -16.10% between the 2 3 and 1 3 gatekeeper removal conditions was also found significant ( < 0.001). On the other hand, the percentage of infected ACL members increases 6% on average from homophily condition 1 to 2 and decreases about 6.96% from 2 to 3 ( < 0.001). However, no significant differences were observed between homophily conditions 3 and 1 .

Several remarks can be drawn upon the results obtained from the simulation experiments. On the one hand, the performance of community-based ACPMs is closely related to the homophily of the network under analysis and the attribute values that are critical for the ACL configuration. This can be observed from the results of Simulation 1 where the precision of the predicted ACL becomes higher as the homophily in workplace increases. Particularly, LE resulted the method with the highest overall precision making it more suitable than MC and LiC for cases in which the costs of false positives (i.e. classifying trusted members as untrusted) are deemed high. Conversely, the recall and F1 values obtained in this experiment suggest that LiC would qualify better for cases where the penalty of false negatives (i.e. classifying untrusted members as trusted) is larger. In part, this is explained by the number and size of the communities generated by this method which, compared to MC and LE, result much larger. Nonetheless, the recall and F1 scores of LE are close to the ones of LiC in both, 2 and 3 , making it also a good approach in such a case with the advantage of being computationally more effective.

Overall, the F1 values of LiC show that its loss in terms of precision is not so large. However, this metric along with the method's recall did not improve from configuration 1 to 2 (something that did happen with MC and LE). Furthermore, LiC produced much more false positives than MC or LE making it less suitable for cases in which Type 1 errors are considered critical. In part, this may be related to the approach followed by this method when partitioning the network space into overlapping communities. Particularly, LiC determines the optimal cut level of the resulting dendrogram through a measure of its partition density which tends to divide the network into many small communities. Moreover, such communities may in turn be included in larger ones which is not the case for LE nor MC. This has a direct impact in the detection of the community that best fits the user's privacy preferences in terms of size and composition. Hence, such an aspect should be further investigated and adjusted so the LiC algorithm becomes more suitable for access-control prediction.

On the other hand, the results of Simulation 2 suggest that homophily can impact significantly the performance of community-based ACLs under information diffusion conditions. Likewise, such a performance is also affected by the number of seed nodes spreading the information across the network at = 0. Nevertheless, the largest effect size and the highest drop in the percentage of infected ACL members is given by the number of gatekeeper nodes removed from the network. Certainly, it is to expect that as more gatekeeper nodes get removed, fewer infections inside the corresponding ACL will be observed. However, this has also a negative impact on the trusted audience's extent since gatekeepers are, after all, nothing but trusted network members. Furthermore, as observed in Fig. 6 some trusted nodes get disconnected from the network as a consequence of applying such a removal strategy. Hence, such a collateral damage should be minimized so the influence range of the content being disclosed is not significantly reduced. Particularly, these side effects should be taken explicitly into consideration for the elaboration of an adequate gatekeeper removal criterion.

To a certain extent, the results yielded by the simulation experiments of this work are subject to limitations. Particularly, such results may be affected by the set-up parameters of the simulation models as well as by the size and attribute characterization of the generated networks. On the one hand, the average size of ego-networks can largely differ from one OSN to another. For instance, it was estimated that around 38.35% of Facebook users in the United States had between 500 and 200 friends by 2016 3 . However, this number can easily reach the order of thousands in other OSNs like Instagram 4 . On the other hand, the type of attributes proposed in Section 4.2 as well as the distribution of their values may not necessarily represent actual OSNs conditions. Instead, it is to expect that certain attribute values will prevail over others and may not be equally distributed across the network members [58] . Finally, the probability of influence among OSNs users is subject to factors that go beyond attribute similarity. For instance, it can be affected by individual relations, time, and network centrality [34] . Hence, further experimental studies should closely consider these factors in order to characterize the IC model more adequately.

The experimental results of this work provide valuable insights into the role of homophily and information diffusion when applying community-based ACPMs. Particularly homophily was shown to be a critical aspect for the generation of adequate ACL configurations through communitydetection methods. In this sense, it is of paramount importance to have a good understanding of the homophily conditions at the network under analysis before applying clustering methods for ACL prediction. On the other hand, avoiding unwanted data dissemination remains an open challenge for access-control policies in OSNs. In principle, this issue can be mitigated by excluding a percentage of gatekeeper nodes from the members of the trusted audience. Nevertheless, side effects on the utility of information disclosure should be taken into consideration when applying such a countermeasure. For this, a removal criterion of gatekeeper nodes should seek to maximize the influence of the shared information while minimizing the chances of its propagation towards untrusted network segments.

There are multiple questions and future research directions that arise from the results of this work. One of them corresponds to conducting a study of community-based ACPMs under actual network conditions. This includes an analysis of attribute similarity and influence processes in OSNs for an adequate characterization of the simulation models. Likewise, evaluating other community detection approaches for access-control prediction will be a matter of future investigations. On the other hand, we expect to explore the models and principles presented in this work on areas outside privacy and security. Particularly the removal of gatekeeper nodes under certain homophily conditions could be applied to the design of public health policies, for instance, in elaborating social distancing strategies to control the spread of the COVID-19 pandemic. In such a case, more specific models for the simulation of epidemic diseases will be investigated and evaluated along with the principles of gatekeeper removal discussed throughout this work. 

As introduced in section 3.1, Kim et al. [28] defined a group-openness mechanism to investigate the role of homophily in the evolution of scale-free networks. Particularly, such approach considers that nodes sharing a particular attribute value are members of a common group . Then, the group-openness factor Λ between two groups and is defined as:

where the homophily index Λ is a real number between 0 and 1 [29] . Particularly a value of Λ = 0 describes the case in which nodes in are completely reluctant to create ties with others who are members of group . Conversely, for Λ = 1, members of connect openly with other nodes outside their group. Such Λ values can be employed to define how strong (or weak) is the role of homophily between two nodes in terms of group membership. As mentioned in Section 4, the nodes of the networks analysed in this work are characterized with the attributes gender, workplace and location. Particularly, a node can adopt the values male or female for the "gender" attribute, Starbucks, Google, or Ikea for "workplace", and York or Leeds for "location". Therefore, each network consists of 7 groups making it possible to define up to 7,2 = 7! 2!(7−2)! = 21 group openness factors. For instance, Λ can be used to specify how open (or closed) is people from to connect with others from . Likewise, Λ can describe how likely are workers from to create links with others from . In sum, the information concerning all homophily factors of a network can be expressed through a group-openness matrix as shown in Fig 7. Consequently, the total homophily factor between a node and another node can be defined as:

= groups to which node belongs = groups to which node belongs (2) where and are the group sets to which and belong, respectively.

In the Barabasi and Albert model, a node's probability of creating new connections with others depends exclusively on its degree. However, Eq. 2 can be used to introduce the role of attribute similarity in the estimation of the attachment probability between network members. Consequently, the probability Π that a new node of group-set is linked to node of group-set is defined as:

. = degree of node from group-set  = total homophily factor between group-set and group-set (3) where is the degree of node from group-set , and  represents the total homophily factor between the group-set of the new node and the group-set of node . This attachment rule describes the case in which a node without any connections is incorporated to the network. Nevertheless, new links may also emerge between existing network members over time. Particularly, such a probability Π that an existing node links to another existing one is defined as:

where and are the degrees of node and of node respectively and  is the total homophily factor between their corresponding group-sets and . This attachment rule together with the one of Eq. 3 were used to generate the network topologies analysed in this paper.

Information and friend segregation for online social networks: A user study

Reconciling privacy and utility in continuous-time diffusion networks

Scale-free networks: A decade and beyond

Network Science

Emergence of Scaling in Random Networks

Fast unfolding of communities in large networks

Study on information diffusion analysis in social networks and its applications

A SIR model assumption for the spread of COVID-19 in different communities

The igraph software package for complex network research. InterJournal Complex Systems

Generation models for scale-free networks

Analyzing the Dynamics of Communication in Online Social Networks

Discovering homophily in online social networks

Effectiveness of diffusing information through a social network in multiple phases

At Your Own Risk: Shaping Privacy Heuristics for Online Self-disclosure

Access-Control Prediction in Social Network Sites: Examining the Role of Homophily

A novel trust model based overlapping community detection algorithm for social networks

Ppm: A privacy prediction model for online social networks

Circles, posts and privacy in egocentric social networks: An exploratory visualization approach

Friends and circles-a design study for contact management in egocentric online social networks

Enhanced models for privacy and utility in continuous-time diffusion networks

Links in context: Detecting and describing the nested structure of communities in node-attributed networks

Analyzing and optimizing access control choice architectures in online social networks

A survey on interdependent privacy

Blocking Adversarial Influence in Social Networks

The structure of communities in scale-free networks

Online Social Networks Evolution Model Based on Homophily and Preferential Attachment

Detecting privacy preferences from online social footprints: A literature review

Effect of homophily on network formation

The Impact of the Subgroup Structure on the Evolution of Networks: An Economic Model of Network Evolution

Mastering the challenge of balancing self-disclosure and privacy in social media

Forecasting participants of information diffusion on social networks with its applications

Social influence analysis: Models, methods, and evaluation

A survey on information diffusion in online social networks: Models and methods

Influence maximization on social graphs: A survey

Birds of a Feather: Homophily in Social Networks

Cost-effective online trending topic detection and popularity prediction in microblogging

PACMAN: Personal Agent for Access Control in Social Media

React: Recommending access control decisions to social media users

Non-sharing communities? An empirical study of community detection for access control decisions

Moving beyond set-itand-forget-it privacy settings on social media

The Potential for User-Tailored Privacy on Facebook

The structure and function of complex networks

Finding community structure in networks using the eigenvectors of matrices

An empirical study on user access control in online social networks

The Future of Online Social Networks (OSN): A Measurement Analysis Using Social Media Tools and Application

Influence maximization diffusion models based on engagement and activeness on instagram

Homophily-driven evolution increases the diffusion accuracy in social networks

Learning to share: Engineering adaptive decision-support for online social networks

Analysis of variance: The fundamental concepts

Selectivity in posting on social networks: The role of privacy concerns, social capital, and technical literacy

The independent cascade and linear threshold models, in: Diffusion in Social Networks

A smart access control method for online social networks based on support vector machine

Vegas: Visual influence graph summarization on citation networks

Identifying hidden social circles for advanced privacy configuration

Trust decisionmaking in online social communities: A network-based model

A novel trust-based access control for social networks using fuzzy systems

The Impact of Context Collapse and Privacy on Social Network Site Disclosures

The Length of Bridge Ties: Structural and Geographic Properties of Online Social Interactions

Modeling Self-Disclosure in Social Networking Sites

Rumor blocking through online link deletion on social networks

My friend leaks my privacy: Modeling and analyzing privacy in social networks

Business location selection based on geo-social networks

Data-driven efficient network and surveillance-based immunization

This work was partially supported by the H2020 European Project No. 787034 "PDP4E: Privacy and Data Protection Methods for Engineering" and Canada's Natural Sciences and Engineering Research Council (NSERC).