ieee transactions on systems, man, and …ieee transactions on systems, man, and cybernetics—part...

15
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite Networks Based on Degree Distributions Qize Le and Jitesh H. Panchal Abstract—This paper presents an approach for generating sur- rogate bipartite networks with varying sizes based on degree distributions of given bipartite networks. The resulting surrogate networks can be used for problems such as design of algorithms for similarity search, community detection and clustering, and recommender systems. The primary advantage of using smaller surrogate networks over original large-scale networks is the re- duction in associated computational expense. Degree distribution is chosen because of its widespread acceptance, simplicity, and prior literature suggesting its ability to better capture large-scale network properties. The approach is illustrated using a bipartite network from an open-source software development repository. The network consists of nodes representing people and projects, and edges representing people working on different projects. A comparison between the surrogate networks and the original net- works is presented. The results show that the resized networks obtained using the proposed approach can be used to match the original degree distribution. A comparison of seven other network characteristics is also provided. Index Terms—Bipartite networks, degree distribution, open source, surrogate models. I. I NTRODUCTION A. Surrogate Models of Networks and Their Applications I N THIS paper, we present an approach for developing surro- gate bipartite networks of different sizes based on the degree distributions of given bipartite networks. Surrogate models are compact and approximate representations of complex systems [1]. They are used instead of complex, higher fidelity models during system analysis and design to reduce the associated computational effort [2]. The applications of surrogate models for networks are in the design of processes and algorithms that work on existing networks. Such design problems are associated with a num- ber of systems including the Internet, infrastructure networks (e.g., power grid and telecommunications networks), software Manuscript received September 3, 2010; revised January 27, 2011 and July 1, 2011; accepted September 4, 2011. This work was supported in part by National Science Foundation through the CAREER Grant # 0954447. This paper was recommended by Associate Editor W. Pedrycz. The authors are with the School of Mechanical and Materials Engi- neering, Washington State University, Pullman, WA 99164 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCA.2012.2183589 packages and classes, electronic circuits, peer-to-peer networks, and distributed databases [3]. Examples of such problems in- clude: 1) design of search algorithms for the world-wide web; 2) optimization of routing strategies on transportation networks; 3) development of internet protocols; 4) strategies for pre- venting the spread of epidemics within social networks; and 5) prevention of cascading failures in power networks [4], [5]. While the processes and algorithms developed for all these applications can work regardless of the topology of the under- lying networks, it is well known that their performance can be significantly improved by accounting for the network topology [6]. For example, by accounting for the topology of social networks, targeted strategies can be developed to minimize the spread of epidemics [7]. The design of such processes and algorithms involves testing alternative designs on the network. This becomes challenging because the computational cost increases rapidly as the com- plexity of the networks increases. As a simple example, the time complexity of insertion sort algorithm, which is commonly used in network analysis is O(n 2 ) where n is the number of nodes in a network. If the time required for analyzing a network with 1000 nodes using insertion sort algorithm is 0.07 s [8], the corresponding time grows up to 2083 h for a network with 1 000 000 nodes [8]. Even with high-performance computers, the analysis of the network structures and their evolution is computationally intensive. To address the challenge of compu- tational effort, simplified surrogate models of networks can be used in place of the original networks. The specific types of networks studied in this paper are bipartite networks (also referred to as two-mode networks or affiliation networks [9]). A bipartite network consists of two disjoint sets of nodes and a set of edges such that each edge con- nects nodes in the two node sets. Bipartite networks are used to model a number of real-world relationships between pairs of en- tities such as authors-publications [10], actors-movies, people- projects, content-tags [11], proteins-metabolic reactions, and products-customers. While there has been significant amount of work on network-generation models for one-mode networks, there is a lack of methods for bipartite networks (see detailed literature review in Section II-A). The proposed approach for generating surrogate models of bipartite networks has vari- ous applications including design of algorithms for similarity search [12], community detection and clustering [13], and rec- ommender systems [14]. Degree distributions of networks are used as the underlying characteristics for generating surrogate 1083-4427/$31.00 © 2012 IEEE

Upload: others

Post on 28-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1

Building Smaller Sized Surrogate Modelsof Complex Bipartite Networks Based

on Degree DistributionsQize Le and Jitesh H. Panchal

Abstract—This paper presents an approach for generating sur-rogate bipartite networks with varying sizes based on degreedistributions of given bipartite networks. The resulting surrogatenetworks can be used for problems such as design of algorithmsfor similarity search, community detection and clustering, andrecommender systems. The primary advantage of using smallersurrogate networks over original large-scale networks is the re-duction in associated computational expense. Degree distributionis chosen because of its widespread acceptance, simplicity, andprior literature suggesting its ability to better capture large-scalenetwork properties. The approach is illustrated using a bipartitenetwork from an open-source software development repository.The network consists of nodes representing people and projects,and edges representing people working on different projects. Acomparison between the surrogate networks and the original net-works is presented. The results show that the resized networksobtained using the proposed approach can be used to match theoriginal degree distribution. A comparison of seven other networkcharacteristics is also provided.

Index Terms—Bipartite networks, degree distribution, opensource, surrogate models.

I. INTRODUCTION

A. Surrogate Models of Networks and Their Applications

IN THIS paper, we present an approach for developing surro-gate bipartite networks of different sizes based on the degree

distributions of given bipartite networks. Surrogate models arecompact and approximate representations of complex systems[1]. They are used instead of complex, higher fidelity modelsduring system analysis and design to reduce the associatedcomputational effort [2].

The applications of surrogate models for networks are inthe design of processes and algorithms that work on existingnetworks. Such design problems are associated with a num-ber of systems including the Internet, infrastructure networks(e.g., power grid and telecommunications networks), software

Manuscript received September 3, 2010; revised January 27, 2011 andJuly 1, 2011; accepted September 4, 2011. This work was supported in partby National Science Foundation through the CAREER Grant # 0954447. Thispaper was recommended by Associate Editor W. Pedrycz.

The authors are with the School of Mechanical and Materials Engi-neering, Washington State University, Pullman, WA 99164 USA (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSMCA.2012.2183589

packages and classes, electronic circuits, peer-to-peer networks,and distributed databases [3]. Examples of such problems in-clude: 1) design of search algorithms for the world-wide web;2) optimization of routing strategies on transportation networks;3) development of internet protocols; 4) strategies for pre-venting the spread of epidemics within social networks; and5) prevention of cascading failures in power networks [4], [5].While the processes and algorithms developed for all theseapplications can work regardless of the topology of the under-lying networks, it is well known that their performance can besignificantly improved by accounting for the network topology[6]. For example, by accounting for the topology of socialnetworks, targeted strategies can be developed to minimize thespread of epidemics [7].

The design of such processes and algorithms involves testingalternative designs on the network. This becomes challengingbecause the computational cost increases rapidly as the com-plexity of the networks increases. As a simple example, thetime complexity of insertion sort algorithm, which is commonlyused in network analysis is O(n2) where n is the number ofnodes in a network. If the time required for analyzing a networkwith 1000 nodes using insertion sort algorithm is 0.07 s [8],the corresponding time grows up to 2083 h for a network with1 000 000 nodes [8]. Even with high-performance computers,the analysis of the network structures and their evolution iscomputationally intensive. To address the challenge of compu-tational effort, simplified surrogate models of networks can beused in place of the original networks.

The specific types of networks studied in this paper arebipartite networks (also referred to as two-mode networks oraffiliation networks [9]). A bipartite network consists of twodisjoint sets of nodes and a set of edges such that each edge con-nects nodes in the two node sets. Bipartite networks are used tomodel a number of real-world relationships between pairs of en-tities such as authors-publications [10], actors-movies, people-projects, content-tags [11], proteins-metabolic reactions, andproducts-customers. While there has been significant amountof work on network-generation models for one-mode networks,there is a lack of methods for bipartite networks (see detailedliterature review in Section II-A). The proposed approach forgenerating surrogate models of bipartite networks has vari-ous applications including design of algorithms for similaritysearch [12], community detection and clustering [13], and rec-ommender systems [14]. Degree distributions of networks areused as the underlying characteristics for generating surrogate

1083-4427/$31.00 © 2012 IEEE

Page 2: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

networks. The rationale for choosing degree distribution isdiscussed in the following section.

B. Rationale for Using Degree Distributions for DevelopingSurrogate Models

The fundamental goal of network generators is to developtopologically similar networks that can be used for the types ofdesign problems discussed in Section I-A. An ideal surrogatenetwork is the one that replicates all the properties of the origi-nal network. Unfortunately, such a match can only be achievedby an isomorphic graph. This defeats the purpose of a surrogatenetwork. Further, there is no generic set of metrics that canbe matched to evaluate whether the networks are completelysimilar. Similarity between networks can be defined only in thecontext of a given set of characteristics, which depends on theapplication for which the surrogate model is developed.

Some of the common characteristics used for comparing sur-rogate networks with original networks include degree distribu-tion, degree correlations, characteristic path length, centrality,clustering coefficient, and eigenvalues spectra [3]. Since each ofthese characteristics quantifies a different aspect of the networkstructure (and performance), different characteristics are impor-tant for different systems. For example, the characteristic pathlength quantifies how fast information spreads over a network,how fast a fault propagates on the network and the numberof legs of journey for a traveler. Hence, characteristic pathlength is important for computer networks, power grids, andtransportation networks. Similarly, centrality is an importantmeasure of a network’s resilience to failure.

One of the network measures that influences a number ofaspects of a network’s performance is the degree distribution.In a network, a node’s degree is defined as the number of othernodes linked to it [15]. The distribution of degrees of nodesin a network provides insights about a graph’s topology andits growth mechanisms. It is a simple yet key indicator of thetopology of a network. For example, a random network has abinomial degree distribution which takes the form of a Poissondistribution for large networks [16], [17]. Exponential graphshave exponential degree distribution and scale-free graphs fol-low a power-law distribution [18], [19]. Degree distributioncan be used to infer about a variety of network performancecharacteristics such as stability and resilience (error and attacktolerance).

Degree distribution has been used to generate models ofvarious complex networks such as the world-wide web, theInternet, power-line networks, citation networks, ecologicalnetworks, and protein networks [20]. For example, variousdegree-based network generators have been developed to gen-erate the topology of the Internet [21]–[24]. These networkgenerators are based on replicating the power-law topology ofthe Internet [25]. Other types of network topology generatorsinclude structural generators (which attempt to reproduce theglobal hierarchical structure of the Internet), and random graphgenerators (which extend the classical random graph models[26]. Tangmunarunkit and coauthors [26] compared variousdegree-based network generators [27] with random graph gen-erators [28] and structural network generators [29], [30], and

concluded that network generators based on degree distribu-tion more accurately represent the topology of the Internet,as compared to the structural and random graph generators.Additionally, degree distribution is computationally simple tocalculate. Due to the widespread acceptance, simplicity, andliterature suggesting its ability to better capture large scalenetwork properties, we use degree distribution as a basis fordeveloping surrogate networks.

The paper is organized as follows. In Section II, a review ofexisting literature and an overview of network analysis metricsand tools used in this paper are provided. In Section III, theproposed approach is presented. The approach is illustratedusing a bipartite network of participants working on projectsin a open-source software (OSS) development community. Theresults are presented in Section IV. Closing thoughts and futurework are presented in Section V.

II. BACKGROUND

A. Related Literature on Network Generation Models

In this section, we review the core literature on networkgeneration models for unimodal and bipartite networks.

1) Unimodal Network Construction Models: Unimodal net-work construction models include models for specific topolo-gies such as random graphs and scale-free networks, andmodels for arbitrary degree distributions. A random graph isgenerated by starting with a given number of nodes, selectingeach pair of nodes, and creating a link between them with aprobability p. This model is called Erdös-Rényi random graphmodel [16], [31]. The degree distribution of large randomnetworks is a Poisson distribution [32]. In a small-world net-work model, introduced by Watts and Strogatz [33], an initialnetwork is created as a ring lattice with n vertices and k edgesper vertex. The initial network is then modified by rewiringthe links with a probability p. For 0 < p < 1, the networkhas small-world properties, i.e., the nodes in the network havemany local connections and a few long distance connections.As p approaches 1, the network converges to a classical ran-dom graph. To generate scale-free topologies, Barabási-Albertmodel [18] proposed a preferential attachment model where asmall number of nodes are initialized, and then during eachstep, a new node with m edges is added to connect with mexisting nodes based on their degree. The probability that a newnode connects to an existing node is proportional to its degree[23]. The networks generated using this model have a power-law degree distribution.

Researchers have proposed various modifications to the orig-inal scale-free network model to represent unique features ofspecific networks. For example, Holme and Kim [34] extendthe original scale-free network model to include a “triad for-mation step.” In this modified model, the clustering coeffi-cient is tunable simply by changing a control parameter whengrowing the network. Mossa and coauthors [35] formulate ageneral model for the growth of scale-free networks underfiltering information conditions. This modified model displaystruncation of power-law behavior due to the subset of thenetwork “accessible” to the newly introduced node. Barabási[36] and Ravasz et al. [37] propose hierarchical network models

Page 3: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LE AND PANCHAL: BUILDING SMALLER-SIZED SURROGATE MODELS OF COMPLEX BIPARTITE NETWORKS 3

to generate networks with power-law degree distribution andhigh clustering independent of network size. Dangalchev [38]presents a two-level stochastic network generation model forscale-free networks with a preference function which is propor-tional to the sum of the node’s degree and the sum of degreesof all nodes connected to it.

Research efforts have also been devoted to general ap-proaches for generating networks with arbitrary degree distri-butions. Krapivsky and Redner [39] discuss an approach togrow networks with arbitrary degree distributions based on “at-tachment kernel,” which is nonlinear preferential attachment.In this method, the probability that a new node connects toa present node depends on kα, where k is the degree of thenode and α is an arbitrary parameter. Ghoshal and Newman[40] present an approach to grow distributed networks (suchas peer-to-peer networks) with arbitrary degree distributions byappropriately manipulating the rules which govern how newlyadded nodes connect to the network. The authors illustratethe approach by generating networks with power-law degreedistribution. Newman and coauthors [41] present mathematicalapproaches to calculate the statistical properties of graphs withgiven degree distributions in the limit of a large size. Theauthors illustrate the approach for special cases of degreedistributions including Poisson, exponential, and power-law.Hintze and Adami [42] present a similar approach to grownetworks with arbitrary degree distributions and modularitymeasures by appropriately setting six probability parametersthat describe the growth of a network. Badham and Stocker[43] present an approach to construct unimodal networks withtarget characteristics including degree distribution, clusteringcoefficient, and degree assortativity.

2) Bipartite Network Construction Models: Some recentstudies have also investigated models for bipartite networks.Peruani and coauthors [44] study a specific type of bipartitenetwork in which the number of nodes in one partition is keptfixed while the other partition is allowed to grow. Skvoretzand Faust [45] develop logit models based on p∗ models andMarkov graphs for bipartite networks. Wang et al. [46] proposeexponential random graph models for bipartite networks anddevelop an approach based on Mahalanobis distance to good-ness of fit for the model. Koskinen and Edling [47] introduce aclass of models and a Bayesian inference scheme that extendsprior models for one-mode networks. These approaches caterto most topologies for the realistic bipartite networks andinclude parameters such as goodness of fit to estimate thelikelihood of models with data sets. Jones and Handcock [48]introduce preferential attachment as a mechanism for humansexual network, which is a bipartite network, and adjust akey parameter ρ between 2 and 3 to obtain desired degreedistributions. Latapy et al. [49] evaluate the differences betweenreal bipartite networks and random bipartite networks.

3) Research Gap: The existing models discussed above arelimited for generating surrogate models of complex bipartitenetworks with arbitrary distributions because of two reasons.First, existing methods are mainly focused on generating gen-eral topological classes of complex networks. However, fordeveloping surrogate models there is a need for closely match-ing arbitrarily specified degree distributions. Second, the meth-

ods lack the support for resizing bipartite networks. Develop-ing smaller surrogate networks from large complex networksis a key requirement for the design problems discussed inSection I-A. Hence, there is a need for establishing a newapproach for developing smaller (resized) surrogate modelsfrom complex bipartite networks that conserve degree distri-butions, thereby reducing the computational cost for designof network-based processes and algorithms. In this paper, ageneral approach for addressing this need is presented andillustrated using data from a real world network.

B. Overview of Network Analysis Metrics and Tools

In this paper, both the bipartite networks and one-modenetworks derived from them are analyzed. The following char-acteristics are used for bipartite networks.

a) Degree Distribution: There are two degree distributionscorresponding to both node types [49]. For each type ofnodes, the degree distribution is defined as the fractionof nodes with degree k. The joint degree distribution ofa network, P (k1, k2), represents the probability that arandomly selected edge is connected to nodes with de-grees k1 and k2[50], [51]. The joint degree distribution isdifferent from the conditional probability P (k2|k1) whichmeasures the probability that a given node of degree k1 isconnected to a node of degree k2.

b) Degree Correlation refers to the correlation between thedegrees of two types of nodes in a bipartite network. Thedegree correlation is quantified by mapping the degrees ofnodes of one type with the average degree of its neighbors[49].

c) Density: The density of a bipartite network with nA nodesfrom node set A and nB nodes from node set B and mlinks is [9]:

density =m

nAnB

d) Degree Centrality measures the degree of inequality orvariance in the network as a percentage of that of a perfectstar network of the same size. For a bipartite network G =(VA, VB, E) with nA nodes from node set A and nB nodesfrom node set B, the normalized degree centrality CD(v)for a node V from node set A is [9]:

CD(V ) =degree(V )

nB. (1)

The degree centralization of the bipartite network with v∗ asthe most central vertex is [9]:

CD(G) =

∑ni=1 [CD(v

∗)− CD(vi)]

(nA + nB)nA − 2(nA + nB − 1). (2)

The following metrics are used to analyze the weighted one-mode networks [52]:

e) Clustering coefficient measures the degree that nodes in anetwork tend to cluster together. The clustering coefficientreflects the “cliquishness” of the mean closest neighbor-hood of a vertex.

Page 4: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

Fig. 1. Density of points in degree distribution of Nodes A. (a) Degree distribution of type A nodes. (b) Density of the points in the degree distribution of typeA nodes.

f) Diameter is the largest distance between any two nodes of aconnected network [53]. The diameter of a network tells ushow “big” the network is.

g) Shortest path is the shortest path of vertices and edgesthat links two given vertices [54]. The average shortestpath can describe whether there exists the “small worldphenomenon” [55] within the networks.

h) Density of a network is the average proportion of linksincident with nodes in the network [56]. The density of anetwork goes from 0 (if there are no links present) to 1 (ifall possible links are present).

Social network analysis tools are used for evaluating differ-ent metrics of a social network. The most widely used tools areStructure [57], Gradap [58], Ucinet [59], and Network Work-bench [60]. Other social network analysis tools are discussedby Huisman et al. [61] and Freeman [62]. In this paper, we useNetwork Workbench [60].

III. PROPOSED APPROACH FOR DEVELOPING SURROGATE

BIPARTITE NETWORKS WITH SPECIFIED

DEGREE DISTRIBUTION

A. Overview of the Proposed Approach

The proposed approach supports the generation of smallersurrogate networks from complex bipartite networks with spec-ified degree distributions. Assume a bipartite network withtwo types of nodes: A and B. The inputs to the process are:1) number of nodes of types A and B, 2) number of linksconnecting nodes of types A with B, and 3) degree distributionsof type A and type B nodes. Based on this information from theoriginal networks, the following steps are followed: Step 1—Degree distribution reconstruction, Step 2—Establishment oflinks between nodes in the bipartite network, and Step 3—Projecting the bipartite network to unimodal networks. Thedetails of the steps are presented next.

B. Step 1. Degree Distribution Reconstruction

In this step, we start with a bipartite network and a target forthe size of surrogate network expressed as a percentage of thesize of the original network. The outcome of this step is a setof nodes with link stubs that satisfy the degree distributions oforiginal bipartite network. In this step, first the target number ofnodes and links for the resized network are determined. Then,the distribution of link stubs for each type is determined using

the normalized degree distributions. The link stubs are assignedto the nodes.

1) Determination of the Target Numbers of Nodes and Links:Assume NA,o and NB,o be the number of nodes of types Aand B and NL,o be the number of links in the original bipartitenetwork. The number of nodes and links are related to thedegree distributions of both types of nodes. For example, apoint (xi, yi) in the degree distribution of type A nodes impliesyi fraction of nodes of type A have a degree of xi (i.e., areconnected to xi nodes of type B). Similarly, a point (pj , qj)in the degree distribution of type B implies that qj fraction ofnodes of Type B have a degree of pj (i.e., are connected to pjnodes of Type A). Based on the degree distribution

NA,o =∑

i

yi NB,o =∑

j

qj (3)

NL,o =∑

i

xiyi =∑

j

pjqj . (4)

In order to resize the network by a fraction r, the numberof nodes and links is scaled by that fraction. These modifiednumbers of nodes and links are referred to as target valuesbecause these numbers are adjusted in the following steps tosatisfy the degree distributions:

NA,t =NA,o

rNB,t =

NB,o

rand NL,t =

NL,o

r. (5)

2) Assignment of Link Stubs to the Nodes: Having deter-mined the target numbers of nodes and links, the next stepis to determine the number of links that originate from eachnode. The knowledge of the degree distribution function isessential but not sufficient for determining the number oflinks originating from nodes. Different networks with similarfunctions can have different densities of points (xi, yi) and(pi, qi) for different degrees in the degree distribution plots. Theknowledge of density of points in different parts of the degreedistribution plot is also important for determining the numberof links originating from nodes.

To account for the variation in density of points on thedegree distribution plot, the degree distribution is discretizedinto a set of intervals and the percentage of points within eachinterval is calculated. This is shown in Fig. 1, where the degreesare discretized into four intervals ([0 5], [5 10], [10 15], and[15 20]). 20% of the points on the degree distribution plot liewithin the degree interval [0 5] and 10% lie within the interval

Page 5: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LE AND PANCHAL: BUILDING SMALLER-SIZED SURROGATE MODELS OF COMPLEX BIPARTITE NETWORKS 5

[15 20]. The number of intervals can be increased to increasethe resolution.

Based on the information about the number of nodes (NA,t

and NB,t), the number of links (NL,t), and the density of pointson degree distribution plots, the objective is to determine a setof points on the degree distribution plot such that the followingconditions are met as closely as possible:

i

yi,t =NA,t

j

qj,t = NB,t (6)

i

xi,tyi,t =NL,t

j

pj,tqj,t = NL,t. (7)

Determining the points (xi,t, yi,t) and (pj,t, qj,t) that satisfythese equations is a discrete variable constraint satisfactionproblem (CSP). We convert this CSP into a single-objectiveoptimization problem as follows:

Variables: (xi,t, yi,t) and (pj,t, qj,t)Subject to:

ε1 =∑

i

yi,t −NA,t

ε2 =∑

j

qj,t −NB,t

ε3 =∑

i

xi,tyi,t −NL,t

ε4 =∑

j

pj,tqj,t −NL,t.

Minimize:

z = |ε1|+ |ε2|+ |ε3|+ |ε4|.

We utilize the simulated annealing methodology [63] to solvethe optimization problem. Simulated annealing methodologyis a flexible optimization method, suited for large-scale com-binatorial optimization problems [64]. The outcomes of theoptimization are two series of points on the degree distributionplots of type A nodes and type B nodes of a resized networkthat closely represent the original bipartite network. Each point(xi,t, yi,t) represents yi,t nodes whose degree is xi,t. Thus, inthe network, yi,t nodes of Type A are created and xi,t link stubsare added to each node. Similarly, qi,t nodes of Type B arecreated and pi,t link stubs are added to each node. These nodeswith link stubs are then connected in Step 2, discussed next.

C. Step 2. Establishment of Links Between Nodes in theBipartite Network

In this step, we determine how to link the stubs of nodetype A with the stubs of node type B in order to generatea bipartite network. This linking can be carried out in threeways. The simplest (and the least accurate) approach is a node-based random attachment process. The second approach is link

stub-based random attachment process. Finally, the most accu-rate approach is based on the joint probability distribution.

1) Node-Based Random Attachment: In the node-based ran-dom attachment process, pairs of type A and type B nodesare selected randomly as in a random graph. A link stub fromthe selected type A nodes is connected with a link stub ofthe selected type B node if the two nodes are not alreadyconnected with each other and there are unconnected stubsavailable. The process is continued until all the link stubs areconnected to form the complete resized bipartite network. Notethat the random mapping process is a simple process and doesnot utilize any other information from the original network.

2) Link Stub-Based Random Attachment: A variant of therandom attachment process is the link stub-based attachment.In this case, instead of picking nodes at random, the linkstubs from nodes A and B are randomly chosen and a link isestablished between the corresponding nodes if they are notalready connected. The probability of selecting a node of aspecific type is equal to the degree of that node divided by thetotal number of links. For example, the probability of a specificType A node with degree di,A being selected is di,A/NL,o. Theprocess induces an element of preferential attachment betweennodes because the nodes with greater number of stubs have ahigher probability of being chosen. Hence, the process resultsin greater connections among nodes with higher degrees.

3) Joint Degree Distribution Approach: The two ap-proaches for linking nodes are based entirely on the degreedistribution of networks. However, in many complex networks,degree distribution is inadequate to describe the network topol-ogy. It has been shown by Newman [65], [66] that manynetworks show assortative mixing on their degrees, i.e., high-degree nodes tend to link to other high-degree nodes andlow-degree nodes tend to connect to low-degree nodes. Somenetworks also show a reverse trend, disassortative mixing,where high degree nodes tend to attach to low degree nodes. Forexample, social networks show significant assortative mixingwhereas technological and biological networks display disas-sortative mixing [66].

In order to take the assortative and disassortative natureof the networks into account, the third approach for linkingnodes is based on the joint degree distribution of the net-work. To simplify the computation of joint degree distribu-tion, we divide the degree ranges into intervals and determineP ([di,A di+1,A], [dj,B dj+1,B ]) as the ratio of the number oflinks between nodes of type A and B within a given degreerange to the total number of links. Hence, the joint probabilityfunction is extracted and utilized in the proposed approach asfollows:

a) the nodes of types A and B are sorted based on theirdegrees;

b) the range of degrees for both types of nodes are indepen-dently split into a subset of degree ranges [di,A di+1,A]for nodes A and [dj,B dj+1,B ] for nodes B;

c) for each combination of the degree ranges [di,A di+1,A]and [dj,B dj+1,B ], the fraction of links, i.e., the ratio ofnumber of links between nodes within that degree rangeto the total number of links, is calculated from the originalnetwork;

Page 6: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

Fig. 2. Normalized log-log plots of people and project degree distributions. (a) Log-log people degree distribution. (b) Log-log project degree distribution.

d) the number of links in the surrogate network is dividedinto each combination of degree ranges based on thefraction of links within the corresponding degree rangein the original network;

e) within each combination of degree subsets, preferentialattachment is utilized to link the nodes by connecting theirlink stubs.

The resolution of the approach can be increased by increas-ing the number of degree ranges for each node type.

4) Step 3. Projecting the Bipartite Network to Unimodal Net-works: Bipartite networks can be projected onto two weightedundirected networks consisting of uniform types of nodes (one-mode or unimodal networks). Two nodes in a projected networkare connected if they share at least one neighbor in the bipartitenetwork. The weights on the links are equal to the numberof common neighbors. In engineering and social science ap-plications, the projected unimodal networks derived from thebipartite networks are particularly important and convey mean-ingful information. For example, in the author-paper bipartitenetwork the coauthorship relationship between different au-thors provides insights about the collaboration structure in thescientific community. Although the projection of bipartite net-works results in some loss of information [9], in many scenariosthe projected unimodal networks provide explicit relationshipsamong one type of nodes. In OSS development, the bipartitenetwork consisting of participants and projects can be used toderive the collaboration between different participants workingon common projects. The projection from bipartite networkof people working on projects to unimodal people network ismeaningful from a social network analysis perspective. Hence,we also compare the unimodal networks generated from thesurrogate bipartite networks with those generated from theoriginal bipartite networks.

The computational complexity of the proposed approach isdependent on the numbers of nodes and links, the number ofintervals chosen in the degree distribution plot, the number ofdivisions of the adjacency matrix in the joint degree distributionapproach, and the extent of resizing of the original bipartitenetwork. There is tradeoff between the resolution and the com-putational complexity. The resolution of the approach is propor-tional to the number of intervals in the degree distribution andthe divisions in the adjacency matrix. The number of intervalsdetermines the number of variables in the optimization prob-lem. Similarly, the number of divisions of the adjacency matrixaffects the joint degree distribution of the resulting networkbut increases the number of computations. The computational

effort decreases as the percentage reduction of the size of thenetwork increases.

In the following section, we illustrate the approach using abipartite network from OSS development.

IV. CASE STUDY—OPEN-SOURCE SOFTWARE

DEVELOPMENT COMMUNITY NETWORK

A. Open-Source Software DevelopmentNetwork—Sourceforge.net

The data used in this paper are obtained from a largeOSS community—SourceForge.net. SourceForge is the world’slargest OSS development platform, with the largest repositoryof open-source code and applications available on the Inter-net [67]. The Sourceforge Research Data Archive (SRDA),located at http://zerlot.cse.nd.edu, is a collection of OSS dataand resources [68]. The repository is being used by variousresearchers from different fields for understanding the structureand dynamics of open source software communities [69]. In thispaper, our aim is to illustrate the proposed approach. We obtainthe original bipartite network called “people-project network”by querying the SRDA. The two types of nodes in this networkare people and projects.

B. Key Features From Original People-Project Network

After obtaining the original people-project network fromthe database, the following characteristics of the network areextracted: 1) total number of people: 112 969; 2) total numberof projects: 63 695; 3) total number of links between projectsand people: 154 887; 4) people and project degree distributions,shown in Fig. 2; 5) point densities in people distribution andproject distribution, shown in Fig. 3; 6) total number of pointsin people degree distribution: 24; and 7) total number of pointsin project degree distribution: 86. All of the data are obtainedfrom SRDA by querying the database in May 2009.

C. Building Surrogate Bipartite Networks of Equal-Size:Analysis of Results

Having obtained the key characteristics from the originalpeople-project bipartite network, the three steps in Section IIIare followed to build a surrogate bipartite network with thesame size as original bipartite network. The first step is thedegree distribution reconstruction. The resulting degree distri-butions for the generated network are compared with the degreedistributions of the original network in Fig. 4. As shown in the

Page 7: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LE AND PANCHAL: BUILDING SMALLER-SIZED SURROGATE MODELS OF COMPLEX BIPARTITE NETWORKS 7

Fig. 3. Density of points in people and project distributions. (a) Density of the points in people degree distribution. (b) The density of the points in project degreedistribution.

Fig. 4. Comparison between reconstructed degree distributions and original ones. (a) Comparison of original and reconstructed 100% log-log people degreedistribution. (b) comparison of original and reconstructed 100% log-log project degree distribution.

figure, the people and project distributions of the surrogate net-work are very close to the distributions of the original network.

As mentioned in Section III-C, three different approachesare used to establish links between the nodes with link stubs:1) node-based random attachment; 2) link stub-based randomattachment; and 3) joint degree distribution approach. Theseapproaches do not affect the resulting degree distributions.However, they influence other network characteristics listed inSection II-B. We analyze how these approaches affect othernetwork parameters in both bipartite networks and projectedunimodal networks after the degree distributions are explicitlymatched.

Fig. 5 displays a comparison of degree correlation betweenthe original bipartite network and surrogate bipartite networkswith the same number of nodes. It is observed that in theoriginal network, the average degree of neighbors of peopleincreases with the increase in the degree of people. This implies

that people with higher degrees normally link with projects withhigher degree. However, on the project side, there are no sig-nificant differences in the average degrees of neighbors amongprojects with different degrees. In the three surrogate networks,the average degrees of neighbors of both people and projectsincrease with the increase of their degrees. The trend can beattributed to the manner in which the links stubs are connected.For example, the increase in the average degrees of neighborswith increasing node degree is due to the fact that when nodesare chosen randomly, there is a higher chance of higher degreenodes getting selected earlier. Hence, the nodes with higherdegrees are connected with each other first. This results in thepositive correlation between the degrees of nodes and theirneighbors. In the joint degree distribution approach, the trendis likely due to the imposition of preferential attachment withineach subset. The degree correlation is closer to the actual trendfor people nodes implying that preferential attachment is an

Page 8: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

Fig. 5. Comparison of degree correlation between original and surrogate bipartite networks of identical size.

TABLE ICOMPARISON OF KEY FEATURES OF THE NETWORKS OBTAINED BY DIFFERENT LINKING APPROACHES

underlying mechanism for people nodes, which is consistentwith the observations in the literature [70], [71].

The degree centralization and density of the bipartite networkand the characteristics of the one-mode people networks arecompared in Table I. The parameters are obtained by applyingthe three approaches for the establishment of links. There isclose agreement between the density of the bipartite and thesurrogate networks. The degree centralization is of the sameorder of magnitude but the surrogate models have lower degreecentralization than the original network. Comparing the one-mode people network, the original network is a large-scale net-work with high clustering coefficient (0.785) and low density(0.00021).

Using the node-based random mapping process, the differ-ences between the original people-people network and designed

network are significant in terms of the diameter and the aver-age shortest path. The difference in the clustering coefficientis small (0.785 in the original network versus 0.758 in thesurrogate network). The diameter is 15, as compared to theoriginal diameter of 21. Since the node-based random mappingprocess does not well represent the allocation of links in thenetwork because of lack of link allocation information, thediameter is smaller than the diameter of the original network.In the node-based random attachment process, the allocation oflinks becomes homogenous since each pair of nodes has equalopportunity to get connected. However, in the original network,the allocation of links has significant heterogeneity.

Major improvements are observed in the diameter and theaverage shortest path by applying the link stub-based randommapping process. The diameter is 17 as compared to 15 in

Page 9: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LE AND PANCHAL: BUILDING SMALLER-SIZED SURROGATE MODELS OF COMPLEX BIPARTITE NETWORKS 9

node-based random mapping process. The average shortest pathis 5.69, which is closer to the original average shortest path ascompared to the node-based random mapping process. Theseimprovements demonstrate that the link stub-based randommapping better represents the original network than the node-based mapping.

Finally, the joint degree distribution approach is utilizedto establish the links. The approach involves discretizing thedegree ranges into subsets and applying link stub-based randommapping within corresponding subsets. Three scenarios arepresented. In the first scenario, the degree ranges are dividedinto two subsets. The results from two subsets are close tolink stub-based random mapping process. Hence, in the secondscenario, we divide the degree range into four subsets. Theresults are a slight improvement over the first scenario. In thethird scenario, the degree ranges are divided into 16 subsets.In that scenario, the network characteristics are significantlycloser to the original network. Further increase in the numberof subsets within the degree range would increase the fidelityof the network but at the expense of additional computationalcost of the linking process. Comparing the results of the derivedone-mode networks with those of Guillaume and Latapy [72],[73], we find that the performance of the method is similar interms of clustering coefficient and average distance.

The degree correlations of original and surrogate bipartitenetworks are in the same order of magnitude, but do notclosely match. The difference highlights the fact that matchingthe degree distributions does not necessarily imply that allother network characteristics also match [74]. The proposedmethod for generating surrogate network is established with thegoal of matching degree distributions only. We do not claimthat the proposed approach will result in matching all othercharacteristics at the same time. The comparison is presentedto determine which of the characteristics (degree correlationsfor the people nodes, clustering coefficient, diameter, averageshortest path, and density) are closer to the original network,and which parameters show departure from the original (e.g.,degree correlations for the project nodes, average degree). Theresult reiterates the assertions from Anderson et al. [75] and Liet al. [74] that the modeling of degree distributions may notrelate to all network parameters and properties. Determininga comprehensive list of network parameters that are similar ifthe degree distribution of two networks is similar is an openresearch issue.

In Table I, the average degrees of the surrogate networksshow some deviations from the original network. In order todetermine the reason for this deviation, the network degree dis-tributions for the people-people networks are plotted in Fig. 6.It is observed that in the original one-mode people network, thedegree distribution is not a straight line but shows significantscatter. However, the degree distributions of reconstructed andrescaled networks are constructed to closely follow a trend.For the OSS development data, this trend is given by thepower-law relations as shown in Fig. 4. The variations aroundthese power-law relations are not taken into consideration inthe proposed approach. The variations induce the scatter ob-served in the degree distribution of people-people network. Thedegree distribution of the reconstructed network matches the

Fig. 6. People-people network degree distribution.

original network well for the number of people between [10010 000] but shows variation between the range [1 10]. Sincethese variations are not taken into consideration, the degreedistribution of resized people-people network deviates from theoriginal degree distribution of people-people network. This is alimitation of the proposed approach.

D. Resizing of the Bipartite Network: Analysis of Results

For brevity, the comparison of the networks obtained for dif-ferent resizing levels are presented using only the joint degreedistribution approach. The results using node-based randommapping and link-based random mapping are not presentedbecause they are known to be inferior to the joint degreedistribution approach. Three surrogate networks are developedwith a target reduction to 75%, 50%, and 25% size of theoriginal network. The target numbers of nodes (people andprojects) for the surrogate networks are shown in Table II.These target numbers of nodes and the degree distributions areused in the optimization problem to obtain the number of nodesand link stubs for each node.

The degree distributions for the resized network to 75%,50%, and 25% of the original size are shown in Figs. 7–9,respectively. Based on these figures, it is observed that thedegree distributions of the resized networks faithfully replicatethe degree distributions of the original network.

Since it was shown in Section IV-C that partitioning thedegree range into 16 components resulted in the networksclosest to the original network, the same number of partitionsis used for network resizing. Fig. 10 presents comparison ofdegree correlation of the resized bipartite network with 75%,50%, and 25% sizes of the original bipartite network. Table IIIpresents the key features of surrogate bipartite and one-modepeople-people networks from 100% to 25% size using the jointdegree distribution approach.

In comparison of bipartite networks, it is observed that thedegree correlations of the surrogate networks are closer tothe original network when the network size is reduced. Thedegree centralization shows a similar trend as observed in

Page 10: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

TABLE IITARGET NUMBERS OF NODES AND LINKS IN THE RESIZED NETWORK

Fig. 7. Comparison between reconstructed 75% degree distributions and original ones. (a) Comparison of original and reconstructed 75% log-log people degreedistribution; (b) comparison of original and reconstructed 75% log-log project degree distribution.

Fig. 8. Comparison between reconstructed 50% degree distributions and original ones. (a) Comparison of original and reconstructed 50% log-log people degreedistribution; (b) comparison of original and reconstructed 50% log-log project degree distribution.

Fig. 9. Comparison between reconstructed 25% degree distributions and original ones. (a) Comparison of original and reconstructed 25% log-log people degreedistribution; (b) comparison of original and reconstructed 25% log-log project degree distribution.

Page 11: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LE AND PANCHAL: BUILDING SMALLER-SIZED SURROGATE MODELS OF COMPLEX BIPARTITE NETWORKS 11

Fig. 10. Comparison of degree correlation of bipartite network with 75%, 50%, and 25% sizes of the original bipartite network using joint degree distributionapproach.

TABLE IIIKEY FEATURES OF THE SMALLER SURROGATE NETWORKS (100%–25% SIZE) USING JOINT DEGREE DISTRIBUTION APPROACH

the previous section. The reduction of network size increasesthe degree centralization and brings it closer to the originalone. The density of the surrogate networks increases (up to9.16E-5) as the networks are reduced in size. On comparingthe unimodal projected networks, it is observed that the clus-tering coefficient, diameter, and average shortest path do notchange significantly and can be accurately modeled using thejoint degree distribution approach. The density of the resizednetwork increases as the size of the network is reduced. Thedensity of the 75% resized network is 0.00022, which is closeto the density of the original network (0.00021). As the sizeof the network is reduced to 50% the density increases to0.00029. For the network resized to 25% of the original size, thedensity increases to 0.00071. When the number of nodes andthe number of links are simultaneously decreased by a factor ofk, the density increases approximately by the same factor (forlarge networks). This is because in the proposed approach thetarget number of nodes and links are both reduced by the sameamount. In order to maintain the same density as the network

is resized, the target number of links must be reduced by k2.While reducing the number of links by k2 will help in bringingthe density of the network closer to the original network, itwill have an adverse effect on the clustering coefficient, whichis based on the local topology of connections. This highlightsthe fact that when reducing the size of a network, it is notpossible to preserve all the characteristics of a network. Hence,an appropriate reduction of the number of links has to bechosen depending on the characteristic (degree or clusteringcoefficient) that is of interest in a given network.

E. Validation of the Approach for NetworksWith Different Topologies

In this section, we validate the approach using networks withdifferent topologies. Three bipartite networks with differenttopologies are generated for validation purposes. Networks I, II,and III have Poisson, exponential, and a scale-free degree distri-bution. Table IV displays the basic parameters of corresponding

Page 12: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

TABLE IVBASIC PARAMETERS OF THREE BIPARTITE NETWORKS

Fig. 11. Comparison between surrogate and original networks (three topologies).

networks including the number of type A nodes, the numberof type B nodes, the number of links, and the mathematicalexpressions of corresponding degree distributions on both typesof nodes.

Since this section is for validation purposes only, smallersized networks are generated. Surrogate networks are gener-ated for the three networks using the proposed approach. Thedegree distributions of the original networks and equal-sizedsurrogate networks with three different topologies are displayedin Fig. 11. As shown in the figure, the proposed approach can be

used to match the degree distributions networks with differenttopologies also.

V. CONCLUSION AND POTENTIAL FOR FUTURE WORK

In this paper, we present an approach for constructing surro-gate smaller-sized networks to match the degree distribution oflarge-scale complex bipartite networks. The proposed approachconsists of three steps: 1) degree distribution reconstruction;2) establishment of links between nodes; and 3) deriving

Page 13: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LE AND PANCHAL: BUILDING SMALLER-SIZED SURROGATE MODELS OF COMPLEX BIPARTITE NETWORKS 13

networks with uniform node types. Different ways of linkingnodes are presented. The approach is based on the premisethat degree distribution is a key characteristic that describesthe topology of a network and can be used to design networkswith targeted topologies. Furthermore, the following metricsare used to compare the surrogate networks with the originalnetwork: 1) for the bipartite networks, the metrics are degreecorrelation, degree centralization, density; 2) for unimodal pro-jected networks, the metrics are clustering coefficient, diameter,average shortest path, average degree, and density. It is shownthat the approach replicates some of these characteristics ofcomplex networks but not all. The strengths of the proposedapproach are that 1) it is general enough for application to awide range of network topologies, and 2) it can be used toscale down complex bipartite networks while preserving de-gree distribution. The advantage of using the surrogate modelsdeveloped through this approach is to reduce computationalcost associated with the analysis and design of processes andalgorithms for networks, as discussed in Section I.

The proposed approach is based on the structures of bipartitenetworks rather than on the evolutionary dynamics. Hence,the approach is suitable for scenarios where the underlyingmechanisms of network growth are not important and the focusis only on the structure of the network. This is the case inthe problems related to processes and algorithms on existingnetworks. This is a fundamental difference between our modeland other models of bipartite networks available in the literaturethat are based on the dynamics of network growth (e.g., [72],[73]). If correct mechanisms of network growth are known,existing models can be used to generate networks with givendegree distributions. However, there are two limitations ofgrowth-based models: 1) they are based on assumed growthmechanisms, which may not be valid for arbitrary networks,and 2) such models require iterative manual adjustments to findthe parameters of the growth mechanisms until the networks areclose to the given networks. In contrast, the proposed approachdoes not assume any growth mechanisms and can closely matchthe degree distributions of given bipartite networks.

While the focus of this paper is on matching degree distri-bution only, we also show the impact on seven other networkmeasures. Further research is required to investigate the impactof resizing based on degree distribution on other metrics. Thisis a challenging task because a large number of diverse metricscan be used for evaluating whether the surrogate model is closeto the original network. A detailed list of such metrics can befound in [52]. Additionally, the metrics to use for evaluatinga surrogate model depends on the application. In the case ofthe surrogate models for the Internet, metrics such as expan-sion, resilience, and distortion are used [26]. As highlightedin Section I-B, the literature on Internet topology generatorssuggests that surrogate networks based on degree distributionare better at replicating these properties than other models [26].The approach proposed in this paper is particularly suitable forsuch applications where the suitability of degree-based networkgeneration is already established.

We envision that different characteristics can be categorizedinto classes of metrics that have similar requirements and canbe handled together. Further research is needed to categorize

all the network characteristics that remain invariant under dif-ferent conditions. In addition to using the proposed approachfor design of processes and algorithms, the approach can beused for embodying the topological characteristics of naturallyoccurring networks into engineered networks. For example, ithas been shown [76]–[78] that by following the topology ofsocial networks, distributed databases and peer-to-peer com-puter networks can be designed to perform fast search. Thisapplication needs further exploration.

ACKNOWLEDGMENT

The authors would like to thank the access to SRDA providedby Dr. G. Madey.

REFERENCES

[1] N. V. Queipo, R. T. Haftka, W. Shyy, T. Goel, R. Vaidyanathan, andP. K. Tucker, “Surrogate-based analysis and optimization,” Progr. Aerosp.Sci., vol. 41, no. 1, pp. 1–28, 2005.

[2] Z. Qian, C. C. Seepersad, V. R. Joseph, J. K. Allen, and C. F. J. Wu,“Building surrogate models based on detailed and approximate simula-tions,” J. Mech. Des., vol. 128, no. 4, pp. 668–677, 2006.

[3] M. E. J. Newman, “The structure and function of complex networks,”SIAM Rev., vol. 45, no. 2, pp. 167–256, 2003.

[4] Z. Wang, R. J. Thomas, and A. Scaglione, “Generating randomtopology power grids,” in Proc. 41st Hawaii Int. Conf. Syst. Sci., 2008,p. 183.

[5] P. Hines, S. Blumsack, E. C. Sanchez, and C. Barrows, “The topologicaland electrical structure of power grids,” in Proc. 43rd Hawaii Int. Conf.Syst. Sci., Koloa, Kauai, HI, 2010, pp. 1–10.

[6] P. Radoslavov, H. Tangmunarunkit, H. Yu, R. Govindan, S. Shenker, andD. Estrin, “On characterizing network topologies and analyzing theirimpact on protocol design,” Dept. CS, Univ. Southern California, LosAngeles, CA, Rep. 00-731, 2000.

[7] M. E. J. Newman, “Spread of epidemic disease on networks,” Phys.Rev. E, Statist., Nonlin., Soft Matter Phys., vol. 66, no. 1, p. 016128,Jul. 2002.

[8] V. Batagelj and A. Mrvar, “PAJEK—Program for large network analysis,”Connections, vol. 21, pp. 47–57, 1998.

[9] S. P. Borgatti and M. G. Everett, “Network analysis of 2-mode data,”Social Netw., vol. 19, no. 3, pp. 243–269, 1997.

[10] K. Börner, J. T. Maru, and R. L. Goldstone, “The simultaneous evolutionof author and paper networks,” Proc. Nat. Acad. Sci., vol. 101, no. Sup-plement 1, pp. 5266–5273, Apr. 2004.

[11] E. Simpson, “Clustering tags in enterprise and web folksonomies,” inProc. Int. Conf. Weblogs Social Media, Seattle, WA, 2008, pp. 1–9.

[12] V. Fionda, L. Palopoli, S. Panni, and S. E. Rombo, “GRAPPIN: BipartiteGRAph based protein-protein interaction network similarity search,” inProc. IEEE Int. Conf. Bioinf. Biomed., 2007, pp. 355–361.

[13] P. G. Lind, M. C. González, and H. J. Herrmann, “Cycles and clusteringin bipartite networks,” Phys. Rev. E, Statist., Nonlin. Soft Matter Phys.,vol. 72, no. 5, p. 056127, 2005.

[14] D. Jannach, M. Zanker, A. Felfernig, and G. Friedrich, Recom-mender Systems: An Introduction. New York: Cambridge Univ. Press,2010.

[15] R. Diestel, Graph Theory, 3rd ed. New York: Springer-Verlag, 2005.[16] P. Erdos and A. Rényi, “On the evolution of random graphs,” Publ. Math.

Inst. Hungarian Acad. Sci., vol. 5, pp. 17–61, 1960.[17] B. Bollobás, Random Graphs, 2nd ed. Cambridge, U.K.: Cambridge

Univ. Press, 2001.[18] R. Albert and A.-L. Barabási, “Statistical mechanics of complex net-

works,” Rev. Modern Phys., vol. 74, no. 1, pp. 47–97, Jan.–Mar. 2002.[19] S. N. Dorogovtsev and J. F. F. Mendes, “Evolution of networks,” Adv.

Phys., vol. 51, no. 4, pp. 1079–1187, 2002.[20] S. N. Dorogovtsev and J. F. F. Mendes, Evolution of Networks: From

Biological Nets to the Internet and WWW. New York: Oxford Univ.Press, 2003.

[21] A. Medina, A. Lakhina, I. Matta, and J. Byers, “BRITE: An approachto universal topology generation,” in Proc. MASCOTS, Cincinnati, OH,2001, pp. 346–353.

Page 14: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

[22] C. Jin, Q. Chen, and S. Jamin, “Inet: Internet Topology Generator,”EECS Dept., Univ. Michigan, Ann Arbor, MI, Rep.: CSE-TR-433-00,2000.

[23] A.-L. Barabasi and R. Albert, “Emergence of scaling in random net-works,” Science, vol. 286, no. 5439, pp. 509–512, Oct. 1999.

[24] T. Bu and D. Towsley, “On distinguishing between internet power-lawgenerators,” in Proc. IEEE INFOCOM—21st Annu. Joint Conf. IEEEComput. Commun. Soc., 2002, pp. 638–647.

[25] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-lawrelationships of the internet topology,” in Proc. SIGCOMM—Conf. Appl.,Technol., Archit., Protocols Comput. Commun., Cambridge, MA, 1999,pp. 251–262.

[26] H. Tangmunarunkit, R. Govindan, S. Jamin, S. Shenker, and W. Willinger,“Network topology generators: Degree-based vs. structural,” in Proc.SIGCOMM—Conf. Appl., Technol., Archit., Protocols Comput. Commun.,2002, pp. 147–159.

[27] W. Aiello, F. Chung, and L. Lu, “A random graph model for massivegraphs,” in Proc. 32nd Annu. Symp. Theory Comput., 2000, pp. 171–180.

[28] B. M. Waxman, “Routing of multipoint connections,” J. Sel. Areas Com-mun., vol. 6, no. 9, pp. 1617–1622, Aug. 2002.

[29] K. Calvert, M. Doar, and E. Zegura, “Modeling internet topology,” IEEECommun. Mag., vol. 35, no. 6, pp. 160–163, Jun. 1997.

[30] M. Doar, “A better model for generating test networks,” in Proc. IEEEGLOBECOM, London, U.K., 1996, pp. 86–93.

[31] P. Erdös and A. Rényi, “On random graphs,” Publ. Math., vol. 6, pp. 290–297, 1959.

[32] R. Durrett, Random Graph Dynamics. New York: Cambridge Univ.Press, 2007.

[33] D. J. Watts and S. Strogatz, “Collective dynamics of “Small-World” net-works,” Nature, vol. 393, no. 6684, pp. 440–442, Jun. 1998.

[34] P. Holme and B. J. Kim, “Growing scale-free networks with tunableclustering,” Phys. Rev. E, Statist., Nonlin., Soft Matter Phys., vol. 65, pt. 2,no. 2, p. 026 107, Feb. 2002.

[35] S. Mossa, M. Barthélémy, H. E. Stanley, and L. A. N. Amaral, “Trun-cation of power law behavior in “Scale-Free” network models due toinformation filtering,” Phys. Rev. Lett., vol. 88, no. 13, p. 138 701,Apr. 2002.

[36] E. Ravasz and A.-L. Barabási, “Hierarchical organization in complexnetworks,” Phys. Rev. E, vol. 67, no. 2, pp. 026 112-1–026 112-7, 2003.

[37] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A.-L. Barabasi,“Hierarchical organization of modularity in metabolic networks,” Science,vol. 297, no. 5586, pp. 1551–1555, Aug. 2002.

[38] C. Dangalchev, “Generation models for scale-free networks,” Phys. A,Statist. Mech. Appl., vol. 338, no. 3/4, pp. 659–671, Jul. 2004.

[39] P. L. Krapivsky and S. Redner, “Organization of growing random net-works,” Phys. Rev. E, Statist. Nonlin. Soft Matter Phys., vol. 63, no. 6,p. 066123, Jun. 2001.

[40] G. Ghoshal and M. E. J. Newman, “Growing distributed networks witharbitrary degree distributions,” Eur. Phys. J. B—Condens. Matter ComplexSyst., vol. 58, no. 2, pp. 175–184, 2007.

[41] M. E. J. Newman, S. H. Strogatz, and D. J. Watts, “Random graphs witharbitrary degree distributions and their applications,” Phys. Rev. E, Statist.Nonlin. Soft Matter Phys., vol. 64, no. 2, pp. 026 118-1–026 118-17,Aug. 2001.

[42] A. Hintze and C. Adami, “Modularity and anti-modularity in networkswith arbitrary degree distribution,” Biol. Direct, vol. 5, no. 32, pp. 1–25,2010.

[43] J. Badham and R. Stocker, “A spatial approach to network generationfor three properties: Degree distribution, clustering coefficient and de-gree assortativity,” J. Artif. Soc. Soc. Simul., vol. 13, no. 1, p. 11,2010.

[44] F. Peruani, M. Choudhury, A. Mukherjee, and N. Ganguly, “Emergence ofa non-scaling degree distribution in bipartite networks: A numerical andanalytical study,” Euro. Phys. Lett., vol. 79, no. 2, pp. 28 001-1–28 001-6,Jul. 2007.

[45] J. Skvoretz and K. Faust, “Logit models for affiliation networks,” Soc.Methodology, vol. 29, no. 1, pp. 253–280, 1999.

[46] P. Wang, K. Sharpe, G. L. Robins, and P. E. Pattison, “Exponential randomgraph (p∗) models for affiliation networks,” Social Netw., vol. 31, no. 1,pp. 12–25, 2009.

[47] J. Koskinen and C. Edling, “Modelling the evolution of a bipar-tite network—Peer referral in interlocking directorates,” Social Netw.,2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0378873310000122

[48] J. H. Jones and M. S. Handcock, “An assessment of preferential attach-ment as a mechanism for human sexual network formation,” Proc. R. Soc.Lond. B, vol. 270, no. 1520, pp. 1123–1128, Jun. 2003.

[49] M. Latapy, C. Magnien, and N. Vecchio, “Basic notions for the analysisof large two-mode networks,” Social Netw., vol. 30, no. 1, pp. 31–48,Jan. 2008.

[50] P. Mahadevan, D. Krioukov, M. Fomenkov, X. Dimitropoulos,K. C. Claffy, and A. Vahdat, “The internet AS-level topology: Three datasources and one definitive metric,” ACM SIGCOMM Comput. Commun.Rev., vol. 36, no. 1, pp. 17–26, Jan. 2006.

[51] S. Zhou and R. J. Mondragón, “Structural constraints in complex net-works,” New J. Phys., vol. 9, no. 6, pp. 173-1–173-11, Jun. 2007.

[52] R. A. Hanneman and M. Riddle, Introduction to Social Network Methods.Riverside, CA: Univ. California Riverside, Jan. 12, 2005.

[53] G. Chartrand and P. Zhang, Introduction to Graph Theory. New York:McGraw Hill, 2005.

[54] M. E. J. Newman, “Scientific collaboration networks. II. Shortest paths,weighted networks, and centrality,” Phys. Rev. E, Statist. Nonlin. SoftMatter Phys., vol. 64, no. 1, p. 016132, Jul. 2001.

[55] J. Kleinberg, “The small-world phenomenon: An algorithm perspective,”in Proc. 32nd Annu. ACM Symp. Theory Comput., Portland, OR, 2000,pp. 163–170.

[56] S. Wasserman and K. Faust, Social Network Analysis: Methods and Ap-plications. Cambridge, U.K.: Cambridge Univ. Press, 1994.

[57] R. S. Bur, Toward a Structural Theory of Action: Network Models ofSocial Structure, Perception and Action. New York: Academic, 1982.

[58] K. J. A. Sprenger and F. N. Stokman, GRADAP: Graph Definition andAnalysis Package, Groningen, ProGamma1995.

[59] S. P. Borgatti, M. G. Everett, and L. C. Freeman, UCINET for Windows:Software for Social Network Analysis. Harvard, MA: Analytic Technol.,2002.

[60] Network Workbench Tool, NWB Team, 2009, Indiana University,Northeastern University, and University of Michigan.

[61] M. Huisman and M. A. J. van Duijn, “Software for statistical analysis ofsocial networks,” in Proc. 6th Int. Conf. Logic Methodology, Amsterdam,The Netherlands, 2005, pp. 1–21.

[62] L. C. Freeman, “Computer programs and social network analysis,” Con-nections, vol. 11, pp. 26–31, 1988.

[63] P. J. M. Van Laarhoven and E. H. L. Aarts, Simulated Annealing: Theoryand Applications. Norwell, MA: Kluwer, 1987.

[64] R. Davidson and D. Harel, “Drawing graphs nicely using simulated an-nealing,” ACM Trans. Graph. (TOG), vol. 15, no. 4, pp. 301–331, Oct.1996.

[65] M. E. J. Newman, “Mixing patterns in networks,” Phys. Rev. E, vol. 67,no. 2, p. 026126, 2003.

[66] M. E. J. Newman, “Assortative mixing in networks,” Phys. Rev. Lett.,vol. 89, no. 20, pp. 208 701-1–208 701-4, 2002.

[67] K. Healy and A. Schussman, “The ecology of open source softwaredevelopment,” presented at the Annu. Meeting American SociologicalAssociation, Atlanta, GA, 2003.

[68] Y. Gao, M. V. Antwerp, S. Christley, and G. Madey, “A research collabo-ratory for open source software research,” in Proc. 29th Int. Conf. Softw.Eng. Workshops, 2007, p. 124, IEEE Computer Society.

[69] M. Van Antwerp and G. Madey, “Advances in the sourceforge researchdata archive (SRDA),” in Proc. 4th Int. Conf. WoPDaSD, Milan, Italy,2008, pp. 1–6.

[70] Y. Gao and G. Madey, “Network analysis of the SourceForge.net commu-nity,” in Proc. 3rd Int. Conf. OSS, IFIP WG 2.13, Limerick, Ireland, 2007,pp. 187–200.

[71] J. Xu, S. Christley, and G. Madey, “The open source software communitystructure,” in Proc. NAACSOS, Notre Dame, IN, 2005, pp. 1–4.

[72] J.-L. Guillaume and M. Latapy, “Bipartite structure of all complex net-works,” Inf. Process. Lett., vol. 90, no. 5, pp. 215–221, Jun. 2004.

[73] J.-L. Guillaume and M. Latapy, “Bipartite graphs as models of complexnetworks,” Phys. A, Statist. Mech. Appl., vol. 371, no. 2, pp. 795–813,Nov. 2006.

[74] L. Li, D. Alderson, R. Tanaka, J. C. Doyle, and W. Willinger, “Towardsa theory of scale-free graphs: Definition, properties, and implications,”Internet Math., vol. 2, no. 4, pp. 431–523, 2005.

[75] B. Anderson, C. Butts, and K. Carley, “The interaction of size and densitywith graph-level indices,” Social Netw., vol. 21, no. 3, pp. 239–267, Jul.1999.

[76] J. M. Kleinberg, “Navigation in a small world,” Nature, vol. 406, no. 6798,pp. 813–845, Aug. 2000.

[77] D. J. Watts, P. S. Dodds, and M. E. J. Newman, “Identity and searchin social networks,” Science, vol. 296, no. 5571, pp. 1302–1305,May 2002.

[78] L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman,“Search in power-law networks,” Phys. Rev. E, Statist. Nonlin. Soft MatterPhys., vol. 64, no. 4, pp. 046 135-1–046 135-8, Oct. 2001.

Page 15: IEEE TRANSACTIONS ON SYSTEMS, MAN, AND …IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 1 Building Smaller Sized Surrogate Models of Complex Bipartite

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LE AND PANCHAL: BUILDING SMALLER-SIZED SURROGATE MODELS OF COMPLEX BIPARTITE NETWORKS 15

Qize Le received the B.Sc. degree in automationfrom Xiamen University, Xiamen, China, in 2007.Currently, he is a Ph.D. student in mechanical en-gineering at Washington State University, Pullman.

Currently, he is a Research Assistant in Collec-tive Systems Laboratory, Washington State Univer-sity, Pullman. His research interest focuses on theanalysis and modeling of the product structure andcommunity structure in open source software andhardware development.

Mr. Le is a Student Member of American Societyof Mechanical Engineers and a member of Tau Beta Phi Engineering HonorSociety.

Jitesh H. Panchal received B.Tech. in mechanicalengineering from Indian Institute of Technology atGuwahati, India, in 2000, and M.S. and Ph.D. de-grees in mechanical engineering from Georgia In-stitute of Technology, Atlanta, in 2003 and 2005,respectively.

He is currently an Assistant Professor at theSchool of Mechanical and Materials Engineering atWashington State University, Pullman. He served asa Visiting Assistant Professor at Georgia Institute ofTechnology, Savannah from 2006 to 2008. He is a

coauthor of the book titled Integrated Design of Multiscale, MultifunctionalMaterials and Products (Burlington, MA:Butterworth-Heinemann, 2009). Hisresearch interests are in the field of Engineering Systems Design. Specifically,his current research focus is on collective systems innovation and multilevelsystems design.

Dr. Panchal is a member of the American Society of Mechanical Engineers(ASME) and the American Society of Engineering Education. He is a recipientof National Science Foundation’s CAREER award and Young Engineer Awardfrom the ASME Computers and Information in Engineering division.