influence maximization in network by genetic …...inﬂuence maximization in network by genetic...

Influence Maximization in Network byGenetic Algorithm on Linear

Threshold Model

Arthur Rodrigues da Silva1,2, Rodrigo Ferreira Rodrigues1,2,Vinıcius da Fonseca Vieira1,2, and Carolina Ribeiro Xavier1,2(B)

1 Department of Computer Science, Universidade Federal de SaoJoao Del Rei - UFSJ, Sao Joao Del Rei, Brazil

[email protected] Graduate Program in Computer Science, Universidade Federal de Sao

Joao Del Rei - UFSJ, Sao Joao Del Rei, Brazil

Abstract. The problem of maximum influence on the network consistsin the search for a subset of k vertices called seeds which when activatedare able to influence as much elements as possible, considering a model tosimulate the propagation of influence in a network. This paper proposesa Genetic Algorithm to optimize the selection of seeds for the LinearThreshold Model (LTM), a widely adopted simulation model for influencepropagation, by investigating different strategies for initial populationconfigurations based on high centrality nodes. The results obtained bythe application of the proposed methodology to the Linear ThresholdModel considering real world networks show significant improvementson the convergence of the algorithm.

Keywords: Maximum influence · Genetic algorithmLinear Threshold Model · Centrality

1 Introduction

According to [5], the maximum influence problem is the optimization problemthat consists in finding a group of vertices with size k on a network such that,when the vertices are activated and a propagation model is applied, the globalinfluence is maximized. The most common application of propagation models isin viral marketing on on-line social networks, where the goal is to find a smallsubset of influent people which is able to attract the attention of the otherindividuals to posts and publications about new products, services and ideas.

A common approach to study the influence maximization problem is throughthe modeling of the subject of study as an graph, considering each person asa vertex and their personal relationships as edges of this vertex. Propagationmodels, like Linear Threshold Model, allow us to study the propagation of theinfluence in controlled environment. Thus, considering a propagation model, an

c© Springer International Publishing AG, part of Springer Nature 2018O. Gervasi et al. (Eds.): ICCSA 2018, LNCS 10960, pp. 96–109, 2018.https://doi.org/10.1007/978-3-319-95162-1_7

http://crossmark.crossref.org/dialog/?doi=10.1007/978-3-319-95162-1_7&domain=pdf

Influence Maximization in network by Genetic Algorithm 97

analyst is able to investigate the level in which a set of vertices is able to influenceor activate the others vertices of the network.

According to [9] the definition of the optimal subset of k individuals for theinfluence optimization problem on a graph is NP-hard. Thus, many heuristics tosearch for a good set of vertices have been proposed and considered in the liter-ature. Over these strategies, this work highlights the effectiveness and proposeda genetic algorithm for the influence maximization problem.

In order to maximize the influence using a genetic algorithm as a methodof seed selection, this work proposes the perturbation of the initial populationthrough a single individual containing the k vertices of greater centrality withinthe network, rather than considering random initial populations. Thus, the anal-ysis of how much these vertices, considering different strategies, can impact thefinal solution of GA and the investigation of structural properties of the verticesthat correspond to the solution are an important step of the methodology of thiswork.

Kempe et al. [8] propose a greedy algorithm in order to overcome heuristicsbased on seed selection by the centrality in their strategy they obtains a solutionthat is provably within 63% of optimal for several classes of models. Howeverthis algorithm depends on the calculation of the propagation of influence.

Chen et al. demonstrate that calculating the influence in directed acyclicgraphs (DAGs) can be performed in time linear to the size of the graphs. Basedon this, they proposed a scalable algorithm (LDAG) for the Linear ThresholdModel an applied it to networks with millions of edges and vertices that is fasterthan other algorithms in the literature.

Bucur and Iacca, [4] demonstrate in their work that the use of genetic algo-rithms on the problem of maximizing influence can offer solutions of high levelof influence comparable to the heuristics found by other heuristics and viableat runtime. They use the independent cascade model as a fitness function. Alsowithout the need for any assumptions about the network.

The remainder of the work is organized in the following way. First, the cen-trality measures used in this work (Sect. 2) are presented, and the LTM influencemodel (Sect. 3) is detailed, then GA fundamentals (Sect. 4). Section 5 describesthe proposed methodology for the selection of optimal seeds and the propagationmodel considered in this work. Finally in Sects. 6 and 7 the results, discussionsand conclusions from the work are presented.

2 Background

According to Barabasi [1] a network is a catalog of a system’s components oftencalled nodes or vertices and the direct interactions between them, called edges.Newman [12] explains that a network (also called graphs in mathematical lit-erature) abound in the world. Examples include the Internet, the World WideWeb, social networks of acquaintance or other connections between individuals,organizational networks and networks of business relations between companies,neural networks, metabolic networks, food webs, distribution networks such asblood vessels and others.

98 A. R. da Silva et al.

A network contains two basics characteristics:

– Nodes: each node represents an entity of what the network represents and thetotal number of nodes defines the size of the network.

– Links: represent the connections between the nodes.

Links on a network can be direct or undirected. In cases of networks thatmodel the Internet, considering that each vertex is a website and each edge isa link between these sites, a site A may contain a link to a site B but B doesnot necessarily have a link to A, and it becomes more appropriate to representa directed network. Other systems have undirected links, like transmission lineson the power grid, on which the electric current can flow in both directions [1].

2.1 Communities

The nodes of a network can be organized into communities. Communities can bedefined as a group of nodes that are more likely to connect to one another thanto nodes in other communities. In other words, groups of nodes that have moreconnections to each other than to nodes that do not belong to this group. Someexamples are Work places, circles of friends, or a group of individuals who pursuethe same hobby together, or individuals living in the same neighborhood [1].

Among the many algorithms proposed in the literature for community detec-tion, the Multilevel algorithm [3] stands out, due to its capability of findinggood community structures in reasonable time, being executed quickly in largenetworks (millions of vertices). Its operation is divided into two phases repeatediteratively. First, each vertex of the network is considered separately as a mem-ber of a community, in this way there are as many communities as there arevertices. Then, for each vertex the algorithm investigates if its insertion in thecommunity of one of its neighbor vertices increases a measure called modular-ity, frequently used for the assessment of quality of community structures. Thevertex is then transferred to the neighbor community that maximizes the modu-larity or stays in its current community. The process continues until modularitystops improving. According to [3] the order in which the vertices are analyzedhas no significant impact on the obtained modularity.

2.2 Centrality

The centrality can reflects the importance of the node on the web, some impor-tant metrics used for this study are:

– Degree: When a node has a large number of edges connected to it, it is saidthat it has a high degree. This degree can be measured in different waysdepending on the network. If the network is undirected, the degree of a nodeis given by the number of edges connected to it.If the network is directed, the degree is measured separately. Each vertex hasa degree of output (outdegree), given by the number of edges starting from


this vertex, and degree of input (indegree), given by the amount of edges thatreach it, or even the total degree, which takes into account the input andoutput edges. In this work the degree of output will be used.

– Closeness: It is the average distance between a vertex and all the others onthe graph using the shorter path. The vertex became more central how muchcloser it is from the others [2].

– Betweenness: This centrality measure how many times a vertex makes upthe shorter path between two vertices on the graph. So, if this vertex has ahight probability of make up a random short path chosen between two othersvertices chosen randomly, It contains a high valor of betweenness [6].

– PageRank: In this measure each vertex has a value, calculated according tothe amount of input edges that it has, the larger the value the more importantit is. When a vertex has an output edge to another vertex, it adds value to thisother vertex in proportion to the value that it has divided by its number ofoutput edges. Thus, vertices that have many input edges tend to have greaterimportance values as well as vertices that receive edges of other importantvertices [13].

3 Linear Threshold Model (LTM)

According to [15], the Linear Threshold Model is an influence model that eachedges e ∈ E has a non negative weight, and for each vertex v ∈ V the sum of theinput edges must be less than or equal to one. Therefore, considering w(e) theedge’s weight of any incident to v and the amount of edges that connect withthe vertex v we have: ∑

e∈v

w(e) ≤ 1. (1)

There is also a threshold θ associated to each vertex v which must be defined.This threshold is fixed in a uniformly random way, with a value between zeroand one for each vertex. Giving a graph with this characteristics and a initialset of actives vertices, for each iteration t a new one vertex (previously inactive),can be activated at iteration t + 1 if the sum of values of their incoming edgesstarting from an already active vertex is greater than or equal to the thresholdof the vertex.

In this way the set of activated vertices, or influenced, by initial set is definedwhen, after a undefined number of iterations, the total active vertices does notincrease. The Fig. 1 illustrated this process for three iterations.

By analyzing Fig. 1 in t = 0 the initial set of actives vertex (on gray), andin t = 1 the vertices connected to the initial set become actives if the influencecreated by the set of actives vertex are sufficient. Finally, in t = 2 shows theactivation of the vertices influenced by the resulting vertices activated set oft = 1.


Fig. 1. Representation of the operation of the LTM in three instants of time representedby t.

4 Genetic Algorithm

According to [7] Evolutionary Algorithms are computer programs that solvecomplex problems imitating the process described by Darwin. In this way thepopulation of individuals competes against each others. Thus, it is expected thatindividuals evolve and find a better solution than those found previously untilthe goal or some stop criterion is reached.

In this way, each individual of the population represents a candidate solutionfor solving the problem, and each individual must be evaluated by a fitnessfunction, that will define the quality of solution that individual represents.

According to [11] a Genetic Algorithm (GA) it is a kind of EvolutionaryAlgorithm, and even if are several variations of a GA four characteristics arecommon to all of them: Chromosome populations, fitness selection, crossoverand mutation.

4.1 Chromosome Populations

A chromosome populations is the name given to the set of candidate solutions.Each candidate solution (chromosome) has a numerical sequence (genes) [11].

4.2 Fitness Selection

Fitness is the name given to the value of the evaluation function, also used asan operation to selection population members to reproduction. This functionallows to check how much a candidate solution fit to the solution. Through it ispossible to define the best members of each population [11].

4.3 Crossover

A crossover is an operation in which given two individuals selected as parents, acriterion is established for the characteristics of these two individuals are presentin the individuals generated by the operation [11].


For a crossing of the individuals represented by numerical, or binary chainscan be used the crossing of n points. In this type of crossing one or more randompoints are chosen from this chain for the generation of subsequences. This sub-sequences are linked, containing parts of the individuals selected, forming twonew individuals called sons. Figure 2 illustrates the crossover at one point.

Fig. 2. Crossover of one point, where the parents (on left side) generating two sons(right side) containing parts of their genes

Crossing is responsible for the search intensification, because it explores solu-tions close to existing solutions.

4.4 Mutation

A mutation occurs when a individual has one of their genes changed. This genemust be randomly chosen and the change probability must be small. The muta-tion is responsible for diversify the population, because it is change the indi-vidual randomly without analyze the informations about previous generations.This strategy is used to escape from local optimums [11].

5 Methodology

For the implementation of the evaluation function an adaptation of the previ-ously presented LTM was performed. Each edge of the network has a relativevalue for each vertex, so that the sum of the edge values of each vertex is alwaysone. So if a given vertex has x input edges, then the value of each edge forthis vertex will be 1/x. This decision was taken so that each vertex exportedan influence proportional to the amount of neighbors of the vertex influenced.In this way it was also possible to use the evaluation function for non-directednetworks. The influence is divided evenly between all the edges of each vertex inthis networks.

Before to run the GA an influence test with the most central nodes of thenetwork were performed. The 50 best ranked nodes in each centrality measurewas chosen as seeds for the influence model simulation later this results will becompare with the GA results.


For the genetic algorithm a population of 50 individuals was defined, eachcontaining 50 genes, and each gene representing one vertex. The parents selectionfor the next generation was done by tournament, implementing crossing at onlyone point and using the mutation rate of 10%. In tournament selection fourindividuals are selected randomly, two at a time, and the two best ones arechosen as parents. Each generation replaces the previous individuals, exceptfor the individual with the best fitness value of the previous generation, whichis inserted in the next generation. That is, assuming a n generation, at eachiteration the n + 1 generation, generated from n, will replace n and will bereplaced by the n + 2 generation, and so on.

In order to check the impact on the quality of the solution, some distur-bances of the initial population were performed. It was added a single individualcontaining the fifty vertices of greater centrality of the network. This centralityis defined by one of the four previously presented: degree, closeness, between-ness and PageRank. Thus, for each network to be analyzed, the four measuresof centrality are applied and each one has one individual inserted in the initialpopulation at a time.

Defining the operation of the algorithm (LTM) and the genetic algorithm,the operation of the proposed method follows the following steps:

1. Calculate the vertex centrality.2. Generate an individual with the fifty most central vertices of the chosen mea-

sure and insert it into the initial population that will be formed.3. Create the first population4. Evaluate this population.5. Crossover.6. Mutation.7. Evaluate this population.8. Elitism (saved the best o previously generation).9. Return to the 5 step.

The algorithm finishes executing after a predefined number of iterations, forthis study we considered 100 generations or when the best fitness value stopsgrowing for 10 consecutive generations.

5.1 Data

In order to test the proposed methodology, three networks were selected and theinfluence results were compared by applying the seeds selected by the centralitymeasures directly on the LTM and, inserting them with the rest of the initialAG population.

CA-GrQc. CA-GrQc [10] is a network that represents the collaboration ofauthors in a same work in the context of General Relativity and Quantum Cos-mology. This network is undirected containing 5242 nodes and 14496 edges, suchthat a node represents a researcher and an edge represents the collaboration oftwo authors in a publication.


CA-HepPh Network. CA-HepPh [10] is a collaboration network in HighEnergy Physics - Phenomenology. It is also undirected and contains 12008 ver-tices and 118521 edges, such that a node represents a researcher and an edgerepresents the collaboration of two authors in a publication.

Epinions Network. Epinions [14] is an on-line social networks based on thetrust of individuals in services reviews. It contains 75879 vertices and 508837edges, such that a node represents an individual and an edge represents thetrusty between two individuals.

6 Results and Discussions

In this section will be present the main results of this work.For each network, first will be present the graphic with the reach of influence

per number of seeds (1 up to 50) for different centrality method to select seeds.So, a graphic with the results of GA by generation, with and without the initialpopulation disturbance will show the different effects of centralities in the GAresults. Finally, a table with an analysis of the best fit individuals and its rankfor each centrality measure will be discussed.

Figures 3(a), 4(a), and 5(a) represent the diffusion of influence of these net-works. Axis y shows the percentage of vertices influenced and axis x shows thequantity of seeds used, without considering the GA.

When GA is used, axis x represents the number of generations passed. It isimportant mentioning that Figs. 3(b), 4(b) and 5(b) represent the performanceof GA with and without the proposes disturbance.

Tables 1, 2 and 3 represent the percentage of the GA results on the lastgeneration to the worst, the average, and the best result of the same tests shownin the charts (Fig. 3-b, 4-b, and 5-b).

In Fig. 3 it can seen that the GA improve the results around 50% whencompared with the best set of seeds in case without GA and it is possible to seedifferent behavior for different centralities measure. It show that the simplestGA can give great results in influence maximization problem.

As in previously network, in Fig. 4 the results show an great improve of theGA when compare with the results using only the centrality measures.

Tables 1, 2, and 3 show that the variation around the average value standsbelow 3% in most cases. This occurs because the GA discards all the worseresults then the one already discovered. However, this result also demonstrateshow difficult it is to improve a far above average result. This indicates that evenif a good result is found, it will not be surpassed very easily.

On Fig. 3 it is noticed that the seeds selected by the calculation of the central-ity by betweenness reached greater influence, followed by the seeds selected byDegree. Also, when this same group of seeds was used in the GA not only therewas an improvement in the influence but also a great approximation betweenthe results achieved by the measures. The same behavior can be observed in the


(a) Without GA

(b) With GA

Fig. 3. Difference of -CA-GrQc network influence with and without GA

network Fig. 4 but in this, when GA was used the seeds selected by betweennessgenerated a significantly better result.

Analyzing Fig. 5 it is possible to observe a similar behavior in the othernetworks so that the seeds defined by degree and betweenness had better results,however in this network there is a great difference between the best and theworst results when the GA is used. To understand this difference we analyzedthe results of the 10 executions for each measure.

For this analysis we made the intersection of the 10 response sets of eachmode so that when a vertex appeared in 8 or more groups it was used in theanalysis. For each selected vertex it was verified if it was in the ranking of 50 firstvertices of each one of the four measures of centrality shown. The objective ofcomparing the membership of these vertices to the group of the first 50 is justifiedby the fact that the initial populations received the 50 highest ranked verticesin the graph, in some measure of centrality. In addition, it was checked whether


Table 1. CA-GrQc group

Measure Worse Average Best

Degree 14.7081 18.4872 20.8890

Pagerank 14.3838 16.1599 18.0465

Closeness 14.6509 16.1560 19.1721

Betweenness 15.4140 18.6227 20.2213

GA-pure 14.1740 16.0588 17.8558

(a) Without GA

(b) With GA

Fig. 4. Difference of CA-HepPh network influence with and without GA

these vertices made up a community calculated by the multilevel algorithm. TheTables 4, 5, 6, and 7, show the results of the analysis regarding the position ofthe vertices in each measure for each response group:


Table 2. CA-HepPh group


Degree 37.6553 39.2080 40.6806

Pagerank 33.4467 35.3464 37.5685

Closeness 36.1048 38.3628 40.4193

Betweenness 41.0287 43.3561 45.1084

GA-pure 34.0023 35.8228 37.8997

(a) Without GA

(b) With GA

Fig. 5. Difference of Epinions network influence with and without GA

It is possible to note on Tables 4, 5, 6, and 7 that not all the analyzed verticesare on the group of the top 50 measures like in Tables 6 and 7. Moreover, itis possible to note that almost all the elements of the Table 5 are contained inTable 4. Surprisingly also the closeness and PageRank groups contained the same


Table 3. Epinions group


Degree 35.3734 37.4674 38.7327

PageRank 17.0640 20.0053 22.4225

Closeness 19.6721 20.5236 21.6806

Betweenness 37.2382 38.0835 39.6460

GA-pure 10.0225 16.2087 28.9922

Table 4. Betweenness group

Vertex Betweenness Degree Closeness PageRank

363 4 1 2 307

697 3 32 164 76

1677 1 3 27 375

1867 2 2 7 257

2776 10 11 9 1335

8499 11 21 125 1322

Table 5. Degree group


363 4 1 2 307

1677 1 3 27 375

1867 2 2 7 257

4971 22 18 29 1857

Table 6. Closeness group


1138 529 481 613 942

2868 114 218 82 1701

8491 23 65 155 1136

Table 7. PageRank group


1138 529 481 613 942

2868 114 218 82 1701

8491 23 65 155 1136


vertices. Observing Tables (4, 5, 6, and 7) it is possible to verify that PageRankis not a good measure to choose the seeds. Another verified characteristic is thatno vertex shown in the tables occupy the same community together.

7 Conclusion

This work proposed a genetic algorithm for the influence maximization problem.The results shown that the use of GA, with the basic operations can improvesignificantly the influence results. Moreover, a very simple disturbance in theinitial population was propose and the tests showed that the insertion of oneindividual with vertices with high centrality can improve even more this results.

It is possible to notice in this work the insertion of a special individual inthe initial population of GA produced better results than without its insertion.Another visible feature is that the use of Betweenness and Degree measure-ments obtained better results than other measures. Thus, considering the resultsobtained, it is reasonable that for the maximization of influence in a network onLTM, it makes sense to consider high-centrality seeds that are not in the samecommunity. It is also possible to realize that good seeds do not necessarily havehigh centrality. Another relevant factor is that due to the computational cost ofcalculating the values of centrality by betweenness for each vertex of the networkit is possible to obtain very close results using the degree of the vertex (whichis computationally cheaper to be calculated). This work comes to demonstratethe quality of results for the maximizing influence using a genetic algorithm andmeasures of centrality.

References

1. Barabasi, A.-L.: Network Science. Cambridge University Press, Cambridge (2016)2. Bavelas, A.: Communication patterns in task-oriented groups. J. Acoust. Soc. Am.

22(6), 725–730 (1950)3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of

community hierarchies in large network. J. Stat. Mech. 1008 (2008)4. Bucur, D., Iacca, G.: Influence maximization in social networks with genetic algo-

rithms. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9597,pp. 379–392. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31204-0 25

5. Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networksunder the linear threshold model. In: 2010 IEEE 10th International Conference onData Mining (ICDM), pp. 88–97. IEEE (2010)

6. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry40, 35–41 (1977)

7. Jones, G.: Genetic and evolutionary algorithms. Encycl. Comput. Chem. 2, 1127–1136 (1998)

8. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence througha social network. In: Proceedings of the Ninth ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, pp. 137–146. ACM (2003)

https://doi.org/10.1007/978-3-319-31204-0_25

https://doi.org/10.1007/978-3-319-31204-0_25


9. Kempe, D., Kleinberg, J., Tardos, E.: Influential nodes in a diffusion model forsocial networks. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung,M. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1127–1138. Springer, Heidelberg(2005). https://doi.org/10.1007/11523468 91

10. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification andshrinking diameters. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 2 (2007)

11. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge(1998)

12. Newman, M.E.J.: The structure and function of complex networks. SIAM Rev.45(2), 167–256 (2003)

13. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:bringing order to the web. Technical report. Stanford InfoLab (1999)

14. Richardson, M., Agrawal, R., Domingos, P.: Trust management for the semanticweb. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol.2870, pp. 351–368. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39718-2 23

15. Shakarian, P., Bhatnagar, A., Aleali, A., Shaabani, E., Guo, R.: Diffusion inSocial Networks. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-319-23105-1

https://doi.org/10.1007/11523468_91

https://doi.org/10.1007/978-3-540-39718-2_23

https://doi.org/10.1007/978-3-540-39718-2_23

https://doi.org/10.1007/978-3-319-23105-1

https://doi.org/10.1007/978-3-319-23105-1

influence maximization in network by genetic …...inﬂuence maximization in network by genetic...

Documents