ego-network analysis with r

12
1 Data and R code This handout shows and discusses several pieces of R code. You should have downloaded R_intro.R, which includes all the code shown in these pages. You can access and run the code by opening R_intro.R in your R GUI (e.g. RStu- dio). This code uses real-world data collected in 2012 with a personal network survey among 102 Sri Lankan immigrants in Milan, Italy. The data files include ego-level data (sex, age, educational level, etc. for each Sri Lankan respondent), alter attributes (alter’s nationality, country of residence, emotional closeness to ego etc.), and information on alter-alter ties in the form of adjacency matrices. Each personal network has a fixed size of 45 alters. The relevant data files are all located in the Data folder that you downloaded. The data files used in this document (and in R_intro.R) are the following: ego_data_trimmed.csv: These are the first few rows of an ego-level csv data file. adj_28.csv: This is the adjacency matrix for ego ID 28’s personal net- work. elist_28.csv: The edge list for ego ID 28’s personal network. alter.data_28.csv: The alter attributes in ego ID 28’s personal network. Information about data variables and categories is available in Codebook.xlsx, also located in the Data folder. NOTE: Before running the R code in R_intro.R, you should make sure that the Data folder with all the data files listed above is in your current working directory (use getwd() and setwd() to check/set your working directory – see examples below). Also make sure to run the code in R_intro.R line by line : if you skip one or more lines, the following lines may return errors. 2 Starting R Upon opening R, you normally want to do two things: Check and/or set your current working directory: R will look for files and save files in this directory by default. E.g. to set your working directory to “/Users/Mario/Documents/Rworkshop”, just run setwd("/Users/Mario/Documents/Rworkshop"). – Windows users: Note that you have to input the directory path with forward slashes (or double backward slashes), not with single backslashes as in a typical Windows path. I.e. setwd("C:/Users/Mario/Documents/Rworkshop") or setwd("C:\\Users\\Mario\\Documents\\Rworkshop") will work; setwd("C:\Users\Mario\Documents\Rworkshop") won’t work. 2

Upload: rvacca

Post on 13-Dec-2015

70 views

Category:

Documents


6 download

DESCRIPTION

Sample materials for Ego-network analysis with R

TRANSCRIPT

Page 1: Ego-network analysis with R

1 Data and R codeThis handout shows and discusses several pieces of R code. You should havedownloaded R_intro.R, which includes all the code shown in these pages. Youcan access and run the code by opening R_intro.R in your R GUI (e.g. RStu-dio).

This code uses real-world data collected in 2012 with a personal networksurvey among 102 Sri Lankan immigrants in Milan, Italy. The data files includeego-level data (sex, age, educational level, etc. for each Sri Lankan respondent),alter attributes (alter’s nationality, country of residence, emotional closeness toego etc.), and information on alter-alter ties in the form of adjacency matrices.Each personal network has a fixed size of 45 alters. The relevant data files areall located in the Data folder that you downloaded. The data files used in thisdocument (and in R_intro.R) are the following:

• ego_data_trimmed.csv: These are the first few rows of an ego-level csvdata file.

• adj_28.csv: This is the adjacency matrix for ego ID 28’s personal net-work.

• elist_28.csv: The edge list for ego ID 28’s personal network.

• alter.data_28.csv: The alter attributes in ego ID 28’s personal network.

Information about data variables and categories is available in Codebook.xlsx,also located in the Data folder.

NOTE: Before running the R code in R_intro.R, you should make sure thatthe Data folder with all the data files listed above is in your current workingdirectory (use getwd() and setwd() to check/set your working directory – seeexamples below). Also make sure to run the code in R_intro.R line by line: ifyou skip one or more lines, the following lines may return errors.

2 Starting RUpon opening R, you normally want to do two things:

• Check and/or set your current working directory: R will look for filesand save files in this directory by default.

– E.g. to set your working directory to “/Users/Mario/Documents/Rworkshop”,just run setwd("/Users/Mario/Documents/Rworkshop").

– Windows users: Note that you have to input the directory pathwith forward slashes (or double backward slashes), not with singlebackslashes as in a typical Windows path. I.e.setwd("C:/Users/Mario/Documents/Rworkshop") orsetwd("C:\\Users\\Mario\\Documents\\Rworkshop") will work;setwd("C:\Users\Mario\Documents\Rworkshop") won’t work.

2

Page 2: Ego-network analysis with R

3.2 Vector and matrix objects

• Vectors are the most basic objects you use in R. Vectors can be numeric(numerical data), logical (TRUE/FALSE data), character (string data).

• The basic function to create a vector is c() (concatenate).

• Other useful functions to create vectors: rep() and seq(). Also keep inmind the “:” shortcut: c(1, 2, 3, 4) is the same as 1:4.

• The length (number of elements) is a basic property of vectors (length()).

• When we print() vectors, the numbers in square brackets indicate thepositions of vector elements.

• To create a matrix: matrix(). Its main arguments are: the cell values(within c()), number of rows (nrow) and number of columns (ncol). Val-ues are arranged in a nrow x ncol matrix by column. See ?matrix.

• When we print() matrices, the numbers in square brackets indicate therow and column numbers.

# Let's create a simple vector.x <- c(1, 2, 3, 4)

# Display it.x

## [1] 1 2 3 4

# Shortcut for the same thing.y <- 1:4

y

## [1] 1 2 3 4

# What's the length of x?length(x)

## [1] 4

# The function rep() replicates values into a vector.rep(1, times= 10)

## [1] 1 1 1 1 1 1 1 1 1 1

9

Page 3: Ego-network analysis with R

# (NOTE that we didn't assign the vector above, it was just# printed and lost).

# Also vectors themselves can be repeated.x

## [1] 1 2 3 4

rep(x, times= 10)

## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4## [29] 1 2 3 4 1 2 3 4 1 2 3 4

# Note the difference between "times" and "each" in rep().rep(x, each= 10)

## [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3## [29] 3 3 4 4 4 4 4 4 4 4 4 4

# Repeat x's values the exact amount of times that gives a vector# of length 10.rep(x, length.out= 10)

## [1] 1 2 3 4 1 2 3 4 1 2

# seq() is another very useful function to create vectors.seq(from=1, to= 10, by= 2)

## [1] 1 3 5 7 9

# Note that the following vectors are the same.1:10

## [1] 1 2 3 4 5 6 7 8 9 10

seq(from= 1, to= 10, by= 1)

## [1] 1 2 3 4 5 6 7 8 9 10

# A vector of 100 elements, whose values go from 1 to 100seq(1, 100, length.out=30)

## [1] 1.000000 4.413793 7.827586 11.241379 14.655172## [6] 18.068966 21.482759 24.896552 28.310345 31.724138## [11] 35.137931 38.551724 41.965517 45.379310 48.793103## [16] 52.206897 55.620690 59.034483 62.448276 65.862069## [21] 69.275862 72.689655 76.103448 79.517241 82.931034## [26] 86.344828 89.758621 93.172414 96.586207 100.000000

10

Page 4: Ego-network analysis with R

## [1] "list"

# How many rows?nrow(airquality)

## [1] 153

# How many columns?length(airquality)

## [1] 6

# Variable names.names(airquality)

## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month"## [6] "Day"

# What kind of variables are those?str(airquality)

## 'data.frame': 153 obs. of 6 variables:## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...

7.3 Indexing and subsetting data frames

• Matrix notation. Data frames can be indexed like a matrix, with the[ , ] notation: df[2,3], df[2, ], df[ ,3].

• List notation. Because data frames are lists, anything that applies tolists applies to data frames too. So just like any list, a data frame canbe indexed with the [ ] (no comma), the [[ ]], or the $ notation (seesection 7.1).

• When you extract a data frame’s column (or row), be aware of the dif-ference between extracting the column itself as a vector versus extractinganother data frame of one column. df[[i]], df$column_name and df[,i](with the comma) will return the element (variable) itself, which may bea numeric vector, a character vector, a factor etc. In contrast, df[i] willreturn another data frame.

34

Page 5: Ego-network analysis with R

mark

paul

anna

theo

kelsie

mario

# Print the graph (summary info).gr

## IGRAPH DN-- 6 6 --## + attr: name (v/c)

# The graph is Directed, Named. It has 6 vertices and 6 edges. It# has a vertex attribute called "name".

# Get a graph from an external adjacency matrix.## Read in the adjacency matrix. This is a personal network## adjacency matrix.adj <- as.matrix(read.csv("./Data/Alter_ties/adj_28.csv",

row.names=1))head(adj)

## X2801 X2802 X2803 X2804 X2805 X2806 X2807 X2808 X2809## 2801 NA 1 1 1 1 1 1 1 1## 2802 1 NA 1 1 1 1 1 1 1## 2803 1 1 NA 1 1 1 1 1 1## 2804 1 1 1 NA 1 1 1 1 1## 2805 1 1 1 1 NA 1 1 1 1## 2806 1 1 1 1 1 NA 1 1 1## X2810 X2811 X2812 X2813 X2814 X2815 X2816 X2817 X2818## 2801 1 1 1 1 1 1 0 0 2## 2802 1 1 1 1 0 1 0 0 0## 2803 1 1 1 0 0 0 0 0 0## 2804 1 2 0 0 0 0 0 0 0## 2805 1 1 0 0 0 0 0 0 0## 2806 1 1 1 0 0 0 0 0 0## X2819 X2820 X2821 X2822 X2823 X2824 X2825 X2826 X2827

43

Page 6: Ego-network analysis with R

28012802

2803

2804

2805

28062807

28082809

2810

2811

2812

28132814

2815

2816

2817

2818

2819

2820

2821

2822

2823

2824

2825

2826

2827

2828

2829

2830

2831

28322833

2834

2835

2836

28372838

2839

2840

2841

2842

2843

2844

2845

# Print the graph.gr

## IGRAPH UNW- 45 259 --## + attr: name (v/c), weight (e/n)

# The graph is Undirected, Named, Weighted. It has 45 vertices# and 259 edges. It has a vertex attribute called "name", and an# edge attribute called "weight".

# Get the graph from an external edge list, plus a data set with# vertex attributes.## Read in the edge list. This is a personal network edge list.elist <- read.csv("./Data/elist_28.csv")head(elist)

## from to weight## 1 2801 2802 1## 2 2801 2803 1## 3 2801 2804 1## 4 2801 2805 1## 5 2801 2806 1## 6 2801 2807 1

## Read in the vertex attribute data set. This is an alter## attribute data set.vert.attr <- read.csv("./Data/Alter_attributes/alter.data_28.csv")head(vert.attr)

## alter_ID ego_ID alter_num sex relation nationality## 1 2801 28 1 2 1 1

45

Page 7: Ego-network analysis with R

3 Writing functions in R

• In R, a function is a piece of code that operates on one or multiplearguments, and returns an output (the function value in R terminology).

• In R everything is a function. Everything that R does, it does it byfunctions. Many R functions have default values for their arguments: ifyou don’t specify the argument’s value, the function will use the default.

• Besides using existing functions, you can write your own R functions.This makes R extremely powerful and flexible. Once you write a functionand define its arguments, you can run that function on any argumentvalues you want — provided that the function actually works on thoseargument values. For example, if a function takes an igraph network asan argument, you’ll be able to run that function on any network you like,provided that the network is an igraph object.

• Functions, combined with loops or with other methods (more on this inthe following sections), are the best way to run exactly the same codeon many different objects (e.g. many different networks). Functions arecrucial for code reproducibility in R. If you write functions, you won’tneed to re-write (copy and paste) the same code over and over again —you just write it once in a function, then run the function any time and onany arguments you need. This yields clearer, shorter and more readablecode.

• New functions are also commonly used to redefine existing functionsby pre-setting the value of specific arguments. For example, if you want allyour plots to have red as color, you can take R’s existing plotting functionplot, and wrap it in a new function that always executes plot with theargument col="red". Your function would be something like my.plot <-function(...) plot(..., col="red")

• Tips and tricks with functions:

– stopifnot() is useful to check that function arguments are of thetype that was intended by the function author. It stops the function ifa certain condition is not met by a function argument (e.g. argumentis not of a certain object class).

– return() allows you to explicitly set the output that the functionwill return (clearer code). It is also used to stop function executionunder certain conditions.

– if is a flow control tool that is frequently used within functions: itspecifies what the function should do if a certain condition is metat one point.

– When you want to write a function, it’s a good idea to first trythe code on a “real” existing object in your workspace. If the code

13

Page 8: Ego-network analysis with R

# In fact, whatever we can do on an element of a list, we can put# in a function and do it simultaneously on all the elements of# that list via lapply/sapply.

# Put the code above into a function.family.deg <- function(gr) {

deg <- degree(gr, V(gr)[relation==1])mean(deg)

}## Run the function on all the 102 personal networks.family.degrees <- sapply(graph.list, family.deg)

# This generated a vector with average degree of close family# members for all egos. Vector names are ego IDs.head(family.degrees)

## 28 29 33 35 39 40## 14.80000 14.80000 19.33333 24.25000 14.50000 21.00000

# ***** EXERCISE 1:# Write a function to calculate the max betweenness of an alter# in a personal network. Use a graph in graph.list to try the# function. Then lapply the function to graph.list. Finally,# sapply the function to graph.list. Which is better in this# case, lapply or sapply? HINTS: ?betweenness# *****

# ***** EXERCISE 2:# Write a function that returns the nationality of the alter with# maximum betweenness in a personal network. Use a graph in# graph.list to try the function. Sapply the function to# graph.list. HINT: If multiple alters have betweenness equal to# the maximum betweenness, just pick the first alter.# *****

5 Split-apply-combining in R

• In many different scenarios of data analysis we need to apply what has beencalled the “Split-Apply-Combine” strategy. Split-apply-combining iswhat we do whenever we have a single object with all our data and:

1. We split the object into pieces according to one or multiple (combi-nations of) categorical variables (split).

2. We apply exactly the same kind of calculation on each piece, identi-cally and independently (apply).

23

Page 9: Ego-network analysis with R

# Extract the values of alter.attr.df$closeness corresponding to# ego_ID 28. Calculate the mean of these values. Based on this# code, run aggregate() to calculate the mean alter closeness for# all egos.# *****

# ***** EXERCISE 2:# Subset alter.attr.df to the rows corresponding to ego_ID 28.# Using the result data frame, calculate the average $closeness# of Italian alters (i.e. $nationality==3). Based on this code,# run ddply() to calculate the average closeness of Italian# alters for all egos.# *****

# ***** EXERCISE 3:# Get the graph for ego ID 28 from graph.list. Calculate the# max betweenness of Italian alters on this graph. Based on this# code, use ldply() to calculate that for all egos, and put the# results into a data frame.# *****

6 A case study: Sri Lankan immigrants’ embed-

dedness in Italian society

In this section, we’ll apply the tools presented in the previous sections to aspecific case study about the embeddedness of Sri Lankan immigrants in Italiansociety. We will demonstrate the use of R in the four main tasks of personalnetwork analysis:

1. VisualizationWe’ll write an R function that plots a personal network with specific graph-ical parameters. Using a for loop, we’ll then run this function on all ourpersonal networks, and we’ll export the output to external figure files.

2. Compositional analysisWe’ll calculate multiple measures on network composition for all our per-sonal networks simultaneously using plyr.

3. Structural analysisWe’ll write an R function that calculates multiple structural measures ona personal network, and we’ll run it simultaneously on all the personalnetworks using lapply and plyr.

4. Association with ego-level variablesWe’ll merge the results of personal network compositional and structural

30

Page 10: Ego-network analysis with R

# Let's now plot the first 10 personal networks. We won't print# the plots in the GUI, but we'll export each of them to a# separate png file.for (i in 1:10) {

# Get the graphgr <- graph.list[[i]]

# Get the graph's ego IDego_ID <- names(graph.list)[i]

# Set seed for reproducibility (so we always get the same# network layout).set.seed(613)

# Open png device to print the plot to an external png# file (note that the ego ID is written to the file# name).png(file= paste("plot.", ego_ID, ".png", sep=""),

width= 800, height= 800)

# Run plotting functionmy.plot(gr)

33

Page 11: Ego-network analysis with R

0

5

10

15

0 5 10 15 20N.ita

avg.deg.ita

as.factor(educ) 1 2 3

## Print plot to external png filepng("N.avg.deg.ita.3.png", width= 800, height= 600)print(p)dev.off()

## pdf## 2

# Do Sri Lankans who arrived to Italy more recently have# less Italians in their personal network?## To avoid warnings from ggplot(), let's remove cases with NA## values on the relevant variables.index <- complete.cases(ego.data[,c("arr_it", "prop.ita")])data <- ego.data[index,]## Get and save plotp <- ggplot(data= data, aes(x= arr_it, y= prop.ita)) +

geom_jitter(w= 1, h= 0.01, shape= 1, size= 3) +geom_smooth(method="loess")

## Print plot in R GUIprint(p)

39

Page 12: Ego-network analysis with R

0.0

0.1

0.2

0.3

0.4

1980 1990 2000 2010arr_it

prop.ita

## Print plot to external png filepng("prop.ita.arr.png", width= 800, height= 600)print(p)dev.off()

## pdf## 2

# Do Sri Lankans with more Italians in their personal network# have a higher income?## To avoid warnings from ggplot(), let's remove cases with NA## values on the relevant variables.index <- complete.cases(ego.data[,c("prop.ita", "income")])data <- ego.data[index,]## Get and save plotp <- ggplot(data= data, aes(x= prop.ita, y= income)) +

geom_jitter(w= 0.01, h= 50, shape= 1, size= 3) +geom_smooth(method="loess")

## Print plot in R GUIprint(p)

40