users/bridieem/desktop/summer project/write up & final ... projects/bridie... · 2...

23
Network graphs using Matlab Bridie Edwards-Mowforth The University of Edinburgh Summer 2016

Upload: vanliem

Post on 28-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

Network graphs using Matlab

Bridie Edwards-Mowforth

The University of Edinburgh

Summer 2016

Page 2: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

Contents

1 Download data 2

2 Adjacency matrix 3

3 Plot graph 5

4 Cut graph 7

5 Graph function 8

6 Nicer graph 9

7 Connected nodes 13

8 Nodes in loops 14

9 Count 3 and 4 length loops 15

10 Count loops of any length 16

11 Conclusion 21

1

Page 3: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

1 Download data

The initial purpose of this project was to download some sets of data from various social media sites toinvestigate interactions and links between people and their communications online. Downloading a .csv fileof data on interactions between YouTube users from December 2008 [1] provided a data set which proved tobe a network sufficient for using Matlab as a tool to investigate ways of plotting the network graph andinvestigating connections and loops, which became the focus of the project.

The specific file used is the contact network between 15,088 YouTube users, the format is 3 columns, wherethe first two columns give the two users in question and the third gives the intensity of their interaction.Disregarding weighted graphs, it is safe to assume that every entry in the third column is 1. This networkis symmetric, meaning each interaction is only shown once. For example, 58,3,1 will not show up if 3,58,1 isalready recorded. This data then needed to be read into Matlab and assigned to a variable, A.

filename = ’1-edges.csv’;

A = csvread(filename);

2

Page 4: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

2 Adjacency matrix

The next step was to create an adjacency matrix from this data, meaning looking to create a symmetricsquare matrix where, when the numbered nodes run along the rows and columns, there is a logical 1 if it istrue that two nodes are connected and a 0, logical false, otherwise.

One way to start is by letting the amount of nodes to be considered be denoted by nodes, then create atemporary square matrix B of size nodes which will then be populated with 1’s where required to show theconnections. Run through the .csv file and for each row, if there is a 1 in the third column then take thenodes in the first and second columns j,k and enter a 1 in the place in the zeros matrix where they intersect:B(j,k) = 1. Since the .csv file doesn’t show mirrored connections separately, this gives a triangular logi-cal matrix which should be reflected and added to itself to create the symmetric adjacency matrix C we want.

B = zeros(nodes);

for i = 1:length(A)

if A(i,3) == 1

j = A(i,1);

k = A(i,2);

B(j,k) = 1;

end

end

C = triu(B) + triu(B)’;

There is the option of replacing the main diagonal of our adjacency matrix with a vector of ones to haveeach node recorded as connected to itself, which can be useful when plotting points with no connections toothers on a graph of the data.

C1 = C + diag(ones(1,nodes));

It is now possible to find all the connections to any particular node by extracting the row or column for therelevant node from the matrix and finding the entries which are equal to 1. The degree of a node is foundby taking the sum of the row or column for that node.

By forming a much smaller scale example of a network, see Figure 1 it is easier to see how the connectionsare contained in the adjacency matrix.

A = [1 2 1

1 3 1

1 7 1

1 6 1

2 3 1

2 10 1

5 6 1

5 8 1

6 7 1

6 9 1

8 9 1];

3

Page 5: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

C =

0 1 1 0 0 1 1 0 0 0

1 0 1 0 0 0 0 0 0 1

1 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 1 0 0

1 0 0 0 1 0 1 0 1 0

1 0 0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0 1 0

0 0 0 0 0 1 0 1 0 0

0 1 0 0 0 0 0 0 0 0

1

2

3

4

5

6

7

8

9

10

Figure 1: 10 node example network.

4

Page 6: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

3 Plot graph

It is now possible to plot the data as a network graph by using the gplot function in Matlab, this requiresthe adjacency matrix created and a set of arbitrary (for now) coordinates to be assigned to the nodes.

A good starting point was to have the nodes arranged in a grid as close to square as possible just forthe sake of visualising the data, Youtube_1_graph.m generates this graph. This can be done by takingthe size of the square grid to be the floor of the square root of the amount of nodes and then adding onthe remaining nodes, meshgrid is helpful to generate the list of coordinates coord of all the points in the grid.

co = floor(sqrt(nodes));

[Z,W] = meshgrid(linspace(0,co-1,co),linspace(0,co-1,co));

rem = length(C) - co^2;

coord = [Z(:) W(:);ones(rem,1)*co [0:rem-1]’];

This works well for checking that the data is being plotted correctly and giving some order, as seen in Figure2 which plots the 10 nodes from above, it works reasonably well with small examples. However, for morenodes, see Figure 3, it would be much more helpful to be able to pull the coordinates of the nodes aroundso that the features of the graph such as loops and connected and separate sets of nodes etc. are morediscernible.

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 2: 10 nodes set to roughly a square grid of coordinates

5

Page 7: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

Figure 3: 15088 nodes set to roughly a square grid of coordinates

6

Page 8: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

4 Cut graph

It can be useful to look at a subset of the data, as seen in Figure 4. This better shows some connectionsin a small group, hence facilitating observations on how to better arrange the nodes to give the graph morevisual intuitive meaning, this is the purpose of Youtube_1_graph_cut.m.

By setting maxNode to the greatest amount of nodes we want to consider, say 100, we can look at the first100 nodes and their connections but disregard all connections involving nodes which fall outside of the first100.

maxNode = 100;

mask = ( A(:,1) < maxNode ) & ( A(:,2) < maxNode );

nodes = maxNode;

0 1 2 3 4 5 6 7 8 90

1

2

3

4

5

6

7

8

9

Figure 4: First 100 nodes set to grid coordinates.

7

Page 9: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

5 Graph function

Matlab has a function which only requires the input of an adjacency matrix and automatically positionsthe nodes on the graph with coordinates that minimise crossings of lines and clearly show the features ofthe graph. Compare Figure 5, which uses plot(graph(C)), with Figure 4.

Figure 5: First 100 nodes.

8

Page 10: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

6 Nicer graph

With a smaller graph such as Figure 4, we can see where, while using the grid coordinates with gplot, im-provements in appearance could be made. There appear to be unnecessary crossovers of connecting edges andcertain groups of heavily connected nodes which could be arranged closer together to more clearly show theirconnections to each other, and separation from other nodes. The aim of Youtube_1_graph_cut_nice_trial.mis to improve on some of these issues.

One way to go about this is to think about the idea of unconnected nodes repelling each other while connectednodes are attracted to one another. This is effectively making use of spring systems and electrical forcesto calculate how much we should move certain nodes by and in what direction. The strategy explained insection 12.2 [2] is a fairly straightforward way of approaching the task; trying to fit the pseudocode at theend of the section to the code so far used for in gplot to create a new set of coordinates newcoord thatwould better suit the network connections. It certainly isn’t a perfect solution, as such, it is a long way offthe result of the graph function, compare Figure 6 to Figure 7. However, it produces more intuitive graphsfor some maxNode than using grid coordinates, and the suggested constants c1,c2,c3,c4 can be altered topull the nodes around further.

9

Page 11: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

c1 = 2; c2 = 1; c3 = 1; c4 = 0.1;

%c4 determines space between nodes of degree 0 and nodes with at least on

%connection

newcoord = coord;

t = 600;

%how many times to run the loop

for l = 1:t

for m = 1:nodes

for n = 1:nodes

if C(m,n) == 1

%if there is a connecting edge

datt = [newcoord(m,1) - newcoord(n,1), newcoord(m,2) - newcoord(n,2)];

d = norm(datt);

%distance between connected vectors (to be attracted)

Fatt = c1*log(d/c2)*datt;

%force to attract vectors by

newcoord(n,:) = newcoord(n,:) + c4*Fatt;

%new coordinates moved by attractive force

elseif C(m,n) == 0

drep = [newcoord(m,1) - newcoord(n,1), newcoord(m,2) - newcoord(n,2)];

d = norm(drep);

%distance vector between unconnected nodes (to be repelled)

if(d>0)

Frep = c3/(d.^2)*drep;

%force to repel vectors by

newcoord(n,:) = newcoord(n,:) + c4*Frep;

%new coordinates moved by replusive force

end

end

end

end

end

gplot(E,newcoord,’-r*’)

10

Page 12: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

Figure 6: 100 nodes, zoomed on most connected set, graph function.

11

Page 13: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

-16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6

7

7.5

8

8.5

9

9.5

10

10.5

11

11.5

12

Figure 7: 100 nodes, zoomed on most connected set, gplot.

12

Page 14: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

7 Connected nodes

When looking at the graph with manipulated coordinates, for some maxNode there are multiple sets of con-nected nodes which are entirely unconnected to others, it might be useful to look at isolating these. Oneoption is to focus on the node with the highest degree in order to hopefully look at the set of nodes linkedwith the most connections. The following are two ways of doing this.

The first option in Youtube_connected_nodes.m is to identify this node and then consider all the nodesconnected to this one.Find the nodes connected to a specific one by extracting the associated row or column from the adjacencymatrix and finding where the 1’s are. Then go through all of those nodes and find their connections, continueto do this step while the condition that there is at least one connection is still true. The method buildsup a three column matrix in the same format as the original .csv file; with each stage taking the nodesfrom the second column and them becoming nodes in the first column so we then consider their connectednodes etc., only if that node does not already appear in column 1. This matrix is then used to create a newadjacency matrix only involving those connected to that of the greatest degree. The code can be altered tolook at the nodes connected to any node by setting the p (node of max degree in the code) to the desired node.

The second approach considered was to choose a node, p, to look at its connections (again the node ofmaximum degree is used but this can be easily changed) and, starting with the original adjacency matrix,set any rows and columns without a 1 showing a connection to node p to vectors of zeros. Note that we haveto tell it not to set row and column p to zeros as the adjacency matrix we are working with doesn’t set p tobe connected to itself. This is done rather than deleting the rows and columns so the adjacency matrix staysa symmetric square. All the rows and columns of the adjacency matrix which we have set to be all zeros arethen deleted to produce a reduced adjacency matrix which we can plot to show only a specific collection ofconnected nodes.The code in Youtube_connected_nodes_simple.m only goes as far as this first iteration but ideally wouldcontinue to delete rows and columns not connected to any of the nodes connected to p and so on while thereare still rows and columns to delete.

13

Page 15: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

8 Nodes in loops

As a way to approach finding the loops in the set of data, the adjacency matrix can first be reduced so itonly includes nodes which are involved in loops. If a node is included in any loop then it must have degree ofat least 2, so while there exists any node of degree less than or equal to 1, we delete these rows and columnsfrom the adjacency matrix. It is necessary to recalculate the degree of each node after each stage of reducingthe matrix since by deleting a node from the matrix the degree of another mode may have been reduced toless than 2. Eventually the code leaves an adjacency matrix of nodes all with degree greater than or equalto 2, all included in at least one loop.

deg = sum(C,2);

while any(deg <= 1) == 1

C(:,deg <= 1) = [];

C(deg <= 1,:) = [];

%delete rows and columns if the degree of that node <= 1

deg=sum(C,2);

end

Now to perform further tasks involving loops, it isn’t necessary to consider the much larger initial adjacencymatrix involving all the nodes (up to maxNode). The function file create_C_loops.m creates this reducedadjacency matrix for nodes in loops for the set of nodes up to the desired maxNode, which is the input forthe function.

14

Page 16: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

9 Count 3 and 4 length loops

To go about identifying a loop of length 3, start with the adjacency matrix reduced to just nodes in loops.One approach seen in Youtude_loops.m is to choose one node A, of degree n, and look at all those con-nected to it A1, A2, ..., An. The next step is to choose one of these nodes Ai and test if this is connectedto any Aj , where i 6= j. If there is a connection then we have a loop of length 3 involving nodes A, Ai and Aj .

The code runs through the reduced adjacency matrix, taking each node A,B, ... to be the starting node inturn. For node A we find the vector of nodes connected to it (positions of ones in row A) A1, A2, ..., An.This is then used to make a 2 column vector of all possible pairs from the list of nodes connected to A (usemeshgrid as with generating grid coordinates). To see if we have a loop we need to check whether the Ai, Aj

entry in the adjacency matrix is a 1, if we have this connection then the nodes A,Ai, Aj are a loop. It isnecessary to sort these three nodes into ascending order (or by some other consistent rule) for comparisonto previously found loops, if this loop is unique, it is added to a list of loops of length 3. The length of thislist of unique loops gives the amount of loops of length 3 in the data we are considering.

% find loops of 3 nodes

three_loop = [0 0 0];

for i = 1:length(C)

%choose a node

connected_3 = find(C(i,:));

%have to check all possible pairs within the vector connected_3

[p3,q3] = meshgrid(connected_3, connected_3);

pairs_3 = [p3(:) q3(:)];

for i1 = 1:length(pairs_3)

if C(pairs_3(i1,1),pairs_3(i1,2)) == 1

%then i and this pair connected to i are a loop of 3 nodes

if ismember([sort([i pairs_3(i1,1) pairs_3(i1,2)]);ones(length(three_loop)-1,3)],three_l

three_loop = [three_loop;i pairs_3(i1,1) pairs_3(i1,2)];

end

end

end

end

amount_of_three_loops = length(three_loop) - 1

%-1 because first element was just set to zeros as a placeholder

This idea can be extended to find loops of length 4 by putting in an extra stage of finding the nodes connectedto each Ai and then comparing each of these to the set of connections to the starting node A to see if theyare connected and so form a loop.

To continue to follow this strategy to find loops of length n for some n ∈ N would lead to a lot of nestedfor loops being required, which is not ideal and quickly becomes very inefficient.

15

Page 17: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

10 Count loops of any length

In Youtube_loops_fx.m the strategy is to take a point vn which is the last node in a path v1, v2, ..., vn andfind all the nodes connected to it, denoted as w1, w2, ..., wm. These give rise to all the possible ways toextend the path, listed in a matrix V .

V =

v1 v1 v1 . . . v1v2 v2 v2 . . . v2...

......

. . ....

vn vn vn . . . vnw1 w2 w3 . . . wm

The function add_connected.m creates this matrix from an input vector of the current path

v1v2...vn

.

The next stage considers all the possible new paths there would be if each column of V was taken andtreated as the input to add_connected.m, this is done in add_connected_all.m which effectively runsadd_connected.m repeatedly. This is done by running a for loop through the columns of V , however, thiswould lead to multiple matrices. To use this to count loops of length n, it is necessary to be able to loop thiscode n times, so the input needs to be of the same format as the output. In order to have a single outputwhen generating multiple matrices, they are joined together in the form of a cell array. Now let the outputof add_connected.m be a cell array with only one element V so there are only cell arrays in and out. Thismeans that it is possible to run the code n times, each time concatenating cell arrays so the output becomesquickly larger, as would be expected when we are considering a tree branching out at each node where wehave to consider all the connections at each node.

16

Page 18: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

The function add_connected.m:

function M = add_connected(v,maxNode)

% will use where v is a single value (node v1 in path), can use column

% vector of existing path

% the function will require C, so define C first

C = create_C_loops(maxNode);

w = find(C(v(size(v,1)),:));

%finds nodes connected to last node in path of nodes in v

v_matrix = [repmat(v,1,size(w,2));w];

%each column of v_matrix is a new path with length(v)+1 nodes

n = size(v_matrix,1);

% input vector v is of length n-1, w is effectively row n

if n > 2

nCols = size(v_matrix,2);

toDelete = false(nCols,1);

for iCol = 1: nCols

visited = any(v_matrix(n,iCol) == v_matrix(2:n-2,iCol));

if(visited == 1)

toDelete(iCol) = true;

end

end

v_matrix(:,toDelete) = [];

end

if(isempty(v_matrix))

M = {};

else

M = {v_matrix};

end

17

Page 19: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

and add_connected_all.m:

function M_new = add_connected_all(M,maxNode)

% input M is a cell array, as is output

if isempty(M)

M_new = {};

else

% just so we have something to start concatenating with

M_new = {[]};

% go through each matrix m in the cell array M

for i1 = 1:size(M,2)

m = M{i1};

cell_array = {zeros(size(m,1) + 1)};

% just so we have something to start concatenating with.

for i = 1:size(m,2)

% go through each column in each matrix m

cell_array = [cell_array,add_connected(m(:,i),maxNode)];

% concatenate cell arrays

end

cell_array(1) = [];

%deletes the first all zero entry

M_new = [M_new,cell_array];

end

end

end

The result is a large array where every column is a path of the length n as required, which each need to betested to see whether it is a loop. This is done by checking to see if the last elements in the columns w1,...,m

match with the first element v1, if so then there is a loop. However, any ‘loops’ where there is a repetitionof an element within the column vector, excluding the last and first elements, must be removed. It is alsonecessary to consider that all the different permutations of the loop will be found in this way. A loop is justa set of n nodes we can rotate and reflect, so long as connections are preserved, the n nodes in the loop canbe treated as the dihedral group Dn, which has 2n permutations. Hence a factor of 2n should be taken outfrom the amount of loops found.

The function Youtube_loops_fx.m counts the amount of loops of size n, one of the required inputs, byincreasing the value of no_of_lps by 1 with each loop found. If the maxNode is set to be 60, this is anexample where it is relatively simple to look at the graph in Figure 8 and test that the code returns thecorrect number of loops of various sizes, see below. The graph is drawn using the graph function built intoMatlab as this makes it easier to see the loops clearly.

18

Page 20: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

>> Youtube_loops_fx(3,60)

ans =

For the first 60 nodes, the number of loops of length 3 is 3.

>> Youtube_loops_fx(4,60)

ans =

For the first 60 nodes, the number of loops of length 4 is 2.

>> Youtube_loops_fx(5,60)

ans =

For the first 60 nodes, the number of loops of length 5 is 2.

>> Youtube_loops_fx(6,60)

ans =

For the first 60 nodes, the number of loops of length 6 is 2.

>> Youtube_loops_fx(7,60)

ans =

For the first 60 nodes, the number of loops of length 7 is 2.

>> Youtube_loops_fx(8,60)

ans =

For the first 60 nodes, the number of loops of length 8 is 1.

>> Youtube_loops_fx(9,60)

ans =

For the first 60 nodes, the number of loops of length 9 is 0.

19

Page 21: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

Figure 8: Loops in first 60 nodes of data.

20

Page 22: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

11 Conclusion

The initial aim of this project was to look into some theories and concepts within the field of graph theory,using Matlab to illustrate them with some data sets. I quickly found that the Matlab side would beenough of a challenge on its own and so this became my focus. Prior to undertaking this project I had seen alittle of programming in university courses; a basic introduction to Python and started to work with Matlab

in Facets of Mathematics and Computing and Numerics. In this project, my attempts to understand thegraph function and create my own version led me to go beyond the second year undergraduate curriculumwith more complex methods including recursive programming, where the solution to a problem depends onsolutions to smaller instances of the same problem. I have enjoyed improving my capabilities with Matlab

and it has been a valuable experience, giving me confidence to feel comfortable tackling other languages,such as Maple and Mosel in subsequent courses.

21

Page 23: Users/bridieem/Desktop/Summer project/Write up & final ... Projects/Bridie... · 2 Adjacencymatrix The next step was to create an adjacency matrix from this data, meaning looking

References

[1] Arizona State University, Dataset: YouTube, http://socialcomputing.asu.edu/datasets/YouTube

[2] Stephen G. Kobourov, University of Arizona, Force-Directed Drawing Algorithms: 12.2, Spring Systems

and Electrical Forces, http://cs.brown.edu/ rt/gdhandbook/chapters/force-directed.pdf.

22