collective classification in network...

14
N etworks have become ubiquitous. Communication networks, finan- cial transaction networks, net- works describing physical systems, and so- cial networks are all becoming increasing- ly important in our day-to-day life. Often, we are interested in models of how nodes in the network influence each other (for example, who infects whom in an epi- demiological network), models for pre- dicting an attribute of interest based on observed attributes of objects in the net- work (for example, predicting political af- filiations based on online purchases and interactions), or we might be interested in identifying important nodes in the net- work (for example, critical nodes in com- munication networks). In most of these scenarios, an important step in achieving our final goal is classifying, or labeling, the nodes in the network. Given a network and a node v in the network, there are three distinct types of correlations that can be utilized to deter- mine the classification or label of v: (1) The correlations between the label of v and the observed attributes of v. (2) The correlations between the label of v and the observed attributes (including observed la- bels) of nodes in the neighborhood of v. (3) The correlations between the label of v and the unobserved labels of objects in the neighborhood of v. Collective classification refers to the combined classification of a set of interlinked objects using all three types of information just described. Many applications produce data with correlations between labels of intercon- nected nodes. The simplest types of corre- lation can be the result of homophily (nodes with similar labels are more likely to be linked) or the result of social influ- ence (nodes that are linked are more like- ly to have similar labels), but more com- plex dependencies among labels often ex- ist. Within the machine-learning commu- nity, classification is typically done on each object independently, without tak- ing into account any underlying network that connects the nodes. Collective classi- fication does not fit well into this setting. For instance, in the web page classifica- tion problem where web pages are inter- connected with hyperlinks and the task is to assign each web page with a label that best indicates its topic, it is common to as- sume that the labels on interconnected web pages are correlated. Such intercon- nections occur naturally in data from a va- riety of applications such as bibliographic data, email networks, and social networks. Traditional classification techniques would ignore the correlations represented by these interconnections and would be hard pressed to produce the classification accuracies possible using a collective clas- sification approach. Although traditional exact probabilistic inference algorithms such as variable elimination and the junction tree algo- rithm harbor the potential to perform col- lective classification, they are practical on- ly when the graph structure of the net- work satisfies certain conditions. In general, exact inference is known to be Articles FALL 2008 93 Copyright © 2008, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Collective Classification in Network Data Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad Many real-world applications produce networked data such as the worldwide web (hypertext documents connected through hyperlinks), social networks (such as people connected by friendship links), communication networks (com- puters connected through communica- tion links), and biological networks (such as protein interaction networks). A recent focus in machine-learning re- search has been to extend traditional machine-learning classification tech- niques to classify nodes in such net- works. In this article, we provide a brief introduction to this area of research and how it has progressed during the past decade. We introduce four of the most widely used inference algorithms for classifying networked data and em- pirically compare them on both syn- thetic and real-world data.

Upload: others

Post on 16-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

Networks have become ubiquitous.Communication networks, finan-cial transaction networks, net-

works describing physical systems, and so-cial networks are all becoming increasing-ly important in our day-to-day life. Often,we are interested in models of how nodesin the network influence each other (forexample, who infects whom in an epi-demiological network), models for pre-dicting an attribute of interest based onobserved attributes of objects in the net-work (for example, predicting political af-filiations based on online purchases andinteractions), or we might be interested inidentifying important nodes in the net-work (for example, critical nodes in com-munication networks). In most of thesescenarios, an important step in achievingour final goal is classifying, or labeling, thenodes in the network.

Given a network and a node v in thenetwork, there are three distinct types ofcorrelations that can be utilized to deter-mine the classification or label of v: (1)The correlations between the label of vand the observed attributes of v. (2) Thecorrelations between the label of v and theobserved attributes (including observed la-bels) of nodes in the neighborhood of v.(3) The correlations between the label of vand the unobserved labels of objects in theneighborhood of v. Collective classificationrefers to the combined classification of aset of interlinked objects using all threetypes of information just described.

Many applications produce data withcorrelations between labels of intercon-

nected nodes. The simplest types of corre-lation can be the result of homophily(nodes with similar labels are more likelyto be linked) or the result of social influ-ence (nodes that are linked are more like-ly to have similar labels), but more com-plex dependencies among labels often ex-ist.

Within the machine-learning commu-nity, classification is typically done oneach object independently, without tak-ing into account any underlying networkthat connects the nodes. Collective classi-fication does not fit well into this setting.For instance, in the web page classifica-tion problem where web pages are inter-connected with hyperlinks and the task isto assign each web page with a label thatbest indicates its topic, it is common to as-sume that the labels on interconnectedweb pages are correlated. Such intercon-nections occur naturally in data from a va-riety of applications such as bibliographicdata, email networks, and social networks.Traditional classification techniqueswould ignore the correlations representedby these interconnections and would behard pressed to produce the classificationaccuracies possible using a collective clas-sification approach.

Although traditional exact probabilisticinference algorithms such as variableelimination and the junction tree algo-rithm harbor the potential to perform col-lective classification, they are practical on-ly when the graph structure of the net-work satisfies certain conditions. Ingeneral, exact inference is known to be

Articles

FALL 2008 93Copyright © 2008, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602

Collective Classification in Network Data

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher,

and Tina Eliassi-Rad

� Many real-world applications producenetworked data such as the worldwideweb (hypertext documents connectedthrough hyperlinks), social networks(such as people connected by friendshiplinks), communication networks (com-puters connected through communica-tion links), and biological networks(such as protein interaction networks).A recent focus in machine-learning re-search has been to extend traditionalmachine-learning classification tech-niques to classify nodes in such net-works. In this article, we provide a briefintroduction to this area of researchand how it has progressed during thepast decade. We introduce four of themost widely used inference algorithmsfor classifying networked data and em-pirically compare them on both syn-thetic and real-world data.

Page 2: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

NP-hard and there is no guarantee that real-worldnetwork data satisfy the conditions that make ex-act inference tractable for collective classification.As a consequence, most of the research in collec-tive classification has been devoted to the devel-opment of approximate inference algorithms.

In this article we provide an introduction to fourpopular approximate inference algorithms used forcollective classification: iterative classification,Gibbs sampling, loopy belief propagation, andmean-field relaxation labeling. We provide an out-line of the basic algorithms by providingpseudocode, explain how one could apply them toreal-world data, provide theoretical justifications(if any exist), and discuss issues such as featureconstruction and various heuristics that may leadto improved classification accuracy. We providecase studies, on both real-world and synthetic da-ta, to demonstrate the strengths and weaknesses ofthese approaches. All of these algorithms have arich history of development and application tovarious problems relating to collective classifica-tion, and we provide a brief discussion of thiswhen we examine related work. Collective classifi-cation has been an active field of research for thepast decade, and as a result, there are numerousother approximate inference algorithms besidesthe four we describe here. We provide pointers tothese works in the related work section. In the nextsection, we begin by introducing the required no-tation and define the collective classification prob-lem formally.

Collective Classification: Notationand Problem Definition

Collective classification is a combinatorial prob-lem, in which we are given a set of nodes, V ={V1,….Vn} and a neighborhood function N, whereNi � V \ {Vi}, which describes the underlying net-work structure. Each node in V is a random vari-able that can take a value from an appropriate do-main. V is further divided into two sets of nodes:X, the nodes for which we know the correct values(observed variables), and Y, the nodes whose val-ues need to be determined. Our task is to label thenodes Yi � Y with one of a small number of labels,L = {L1,…, Lq}; we will use the shorthand yi to de-note the label of node Yi.

We explain the notation further using a webpage classification example, motivated by Cravenand colleagues (1998), that will serve as a runningexample throughout the article. Figure 1 shows anetwork of web pages with hyperlinks. In this ex-ample, we will use the words (and phrases) con-tained in the web pages as node attributes. Forbrevity, we abbreviate the node attributes, thus,“ST” stands for “student,” “CO” stands for“course,” “CU” stands for “curriculum,” and “AI”

stands for “artificial intelligence.” Each web pageis indicated by a box, the corresponding topic ofthe web page is indicated by an ellipse inside thebox, and each word in the web page is representedusing a circle inside the box. The observed randomvariables X are shaded, whereas the unobservedones Y are not. We will assume that the domain ofthe unobserved label variables, L, in this case, is aset of two values: “student homepage” (abbreviat-ed to “SH”) and “course homepage” (abbreviatedto “CH”). Figure 1 shows a network with two un-observed variables (Y1 and Y2), which require pre-diction, and seven observed variables (X3, X4, X5,X6, X7, X8, and X9). Note that some of the observedvariables happen to be labels of web pages (X6 andX8) for which we know the correct values. Thus,from the figure, it is easy to see that the web pageW1, whose unobserved label variable is represent-ed by Y1, contains two words “ST” and “CO” andhyperlinks to web pages W2, W3, and W4.

As mentioned in the introduction, due to thelarge body of work done in this area of research, wehave a number of approaches for collective classi-fication. At a broad level of abstraction, these ap-proaches can be divided into two distinct types,one in which we use a collection of unnormalizedlocal conditional classifiers and one in which wedefine the collective classification problem as oneglobal objective function to be optimized. We nextdescribe these two approaches, and for each ap-proach, we describe two approximate inference al-gorithms. For each topic of discussion, we will tryto mention the relevant references so that the in-terested reader can follow up for a more in-depthview.

Approximate Inference Algorithmsfor Approaches Based on Local

Conditional Classifiers Two of the most commonly used approximate in-ference algorithms following this approach are theiterative classification algorithm (ICA) (Neville andJensen 2000, Lu and Getoor 2003) and Gibbs sam-pling (GS), and we next describe these in turn.

Iterative Classification The basic premise behind ICA is extremely simple.Consider a node Yi � Y whose value we need to de-termine and suppose we know the values of all theother nodes in its neighborhood Ni (note that Nican contain both observed and unobserved vari-ables). Then, ICA assumes that we are given a localclassifier f that takes the values of Ni as argumentsand returns the best label value for Yi from the classlabel set L. For local classifiers f that do not returna class label but a goodness or likelihood value giv-en a set of attribute values and a label, we simplychoose the label that corresponds to the maximum

Articles

94 AI MAGAZINE

Page 3: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

goodness or likelihood value; in other words, wereplace f with argmaxl�Lf. This makes the local clas-sifier f an extremely flexible function, and we canuse anything ranging from a decision tree to anSVM in its place. Unfortunately, it is rare in prac-tice that we know all values in Ni, which is whywe need to repeat the process iteratively, in each it-eration, labeling each Yi using the current best es-timates of Ni and the local classifier f, and contin-uing to do so until the assignments to the labelsstabilize.

Most local classifiers are defined as functionswhose argument consists of a fixed-length vectorof attribute values. Going back to the example weintroduced in the last section in figure 1, assumethat we are looking at a snapshot of the state of thelabels after a few ICA iterations have elapsed andthe label values assigned to Y1 and Y2 in the last it-

eration are “SH” and “CH,” respectively. a′1 in fig-ure 1 denotes one attempt to pool all values of N1into one vector. Here, the first entry in a′1 that cor-responds to the first neighbor of Y1 is a “1,” denot-ing that Y1 has a neighbor that is the word “ST”(X3), and so on. Unfortunately, this not only re-quires putting an ordering on the neighbors, butsince Y1 and Y2 have a different number of neigh-bors, this type of encoding results in a′1 consistingof a different number of entries than a′2. Since thelocal classifier can take only vectors of a fixedlength, this means we cannot use the same localclassifier to classify both a′1 and a′2.

A common approach to circumvent such a situ-ation is to use an aggregation operator such ascount, mode, or prop. Figure 1 shows the two vectorsa1 and a2 that are obtained by applying the countoperator to a′1 and a′2, respectively. The count op-

Articles

FALL 2008 95

ST CO

CH

AI

X6

X7

X3 X4

Y1

CUX5

Y2

ST

SH

X8

X9

1010010011

N13 = CHN13 = SHN12 = CHN12 = SHN11 = CHN11 = SHAICUCOST

a′1

10010100

N21 = SHAICUCOST

N11

N12N13

a′2

N21

N22

Aggregation

210011

CHSHAICUCOST

a1

110100

CHSHAICUCOST

a2

W1

W2

W3

W4

Neighbor Labels

Neighbor Labels

N21 = CH N22 = SH N22 = CH

Figure 1. A Small Web Page Classification Problem. Each box denotes a web page, each directed edge between a pair of boxes denotes a hyperlink, each oval node denotes a random variable,each shaded oval denotes an observed variable, whereas an unshaded oval node denotes an unobserved variable whose value needs to bepredicted. Assume that the set of label values is L = {′SH′, ′CH′}. The figure shows a snapshot during a run of ICA. Assume that during thelast ICA labeling iteration we chose the following labels: y1 =′ SH′ and y2 =′ CH′. a′1 and a′2 show what may happen if we try to encode therespective values into vectors naively, that is, we get variable-length vectors. The vectors a1 and a2, obtained after applying count aggrega-tion, show one way of getting around this issue to obtain fixed-length vectors. See the article for more explanation.

Page 4: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

erator simply counts the number of neighbors as-signed “SH” and “CH” and adds these entries tothe vectors, while the prop operator returns theproportion of neighbors with each label. Thus weget one new value for each entry in the set of labelvalues. Assuming that the set of label values doesnot change from one unobserved node to another,this results in fixed-length vectors that can now beclassified using the same local classifier. Thus, a1,for instance, contains two entries beside the entriescorresponding to the local word attributes, encod-ing that Y1 has one neighbor labeled “SH” (X8),and two neighbors currently labeled “CH” (Y2 andX6). Algorithm 1 depicts the ICA algorithm aspseudocode where we use ai to denote the vectorencoding the values in Ni obtained after aggrega-tion. Note that in the first ICA iteration, all labelsyi are undefined, and to initialize them we simplyapply the local classifier to the observed attributesin the neighborhood of Yi; this is referred to as“bootstrapping” in algorithm 1.

Gibbs Sampling Gibbs sampling (GS) (Gilks, Richardson, andSpiegelhalter 1996) is widely regarded as one of themost accurate approximate inference procedures.It was originally proposed by Geman and Geman(1984) in the context of image restoration. Unfor-tunately, it is very slow, and a common issue whileimplementing GS is to determine when the proce-dure has converged. Even though there are teststhat can help one determine convergence, they areusually expensive or complicated to implement.

Due to the issues with traditional GS, researchersin collective classification (Macskassy and Provost2007; McDowell, Gupta, and Aha 2007) use a sim-plified version where they assume, just like in the

case of ICA, that we have access to a local classifierf that can be used to estimate the conditional prob-ability distribution of the labels of Yi given all thevalues for the nodes in Ni. Note that, unlike tradi-tional GS, there is no guarantee that this condi-tional probability distribution is the correct condi-tional distribution to sample from. At best, we canonly assume that the conditional probability dis-tribution given by the local classifier f is an ap-proximation of the correct conditional probabilitydistribution. Neville and Jensen (2007) providemore discussion and justification for this line ofthought in the context of relational dependencynetworks, where they use a similar form of GS forinference.

The pseudocode for GS is shown in algorithm 2.The basic idea is to sample for the best label esti-mate for Yi given all the values for the nodes in Niusing local classifier f for a fixed number of itera-tions (a period known as “burn-in”). After that, notonly do we sample for labels for each Yi � Y but wealso maintain count statistics as to how manytimes we sampled label l for node Yi. After collect-ing a predefined number of such samples, we out-put the best label assignment for node Yi by choos-ing the label that was assigned the maximumnumber of times to Yi while collecting samples. Forall our experiments (that we report later) we setburn-in to 200 iterations and collected 800 sam-ples.

Feature Construction and Further Optimizations One of the benefits of both ICA and GS is the factthat it is fairly simple to make use of any local clas-sifier. There is some evidence to indicate that somelocal classifiers tend to produce higher accuraciesthan others, at least in the application domainswhere such experiments have been conducted. Forinstance, Lu and Getoor (2003) report that on bib-liography data sets and web page classificationproblems logistic regression outperforms naïveBayes.

Recall that, to represent the values of Ni, we de-scribed the use of an aggregation operator. In theexample, we used the count operator to aggregatevalues of the labels in the neighborhood, but countis by no means the only aggregation operator avail-able. Past research has used a variety of aggregationoperators including minimum, maximum, mode, ex-ists, and proportion. The choice of which aggrega-tion operator to use depends on the applicationdomain and relates to the larger question of rela-tional feature construction where we are interestedin determining which features to use so that clas-sification accuracy is maximized. In addition, val-ues derived from the graph structure of the net-work in the data, such as the clustering coefficientand betweenness centrality, may be beneficial to

Articles

96 AI MAGAZINE

for each node Yi � Y do // bootstrapping// compute label using only observed nodes in Ni

compute ai using only X � Niyi ← f(ai)

end forrepeat // iterative classification

generate ordering O over nodes in Yfor each node Yi � O do

// compute new estimate of yicompute ai using current assignements to Niyi ← f(ai)

end foruntil all class labels have stabilized or a threshold number ofiterations have elapsed

Algorithm 1. Iterative Classification Algorithm (ICA).

Page 5: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

the accuracy of the classification task. Within theinductive logic programming community, aggre-gation has been studied as a means for proposi-tionalizing a relational classification problem(Knobbe, deHaas, and Siebes 2001; Kramer, Lavrac,and Flach 2001). Within the statistical relationallearning community, Perlich and Provost (2003,2006) have studied aggregation extensively, andPopescul and Ungar (2003) have worked on featureconstruction using techniques from inductive log-ic programming. In addition, Macskassy andProvost (2007) have investigated approaches thatonly make use of the labels of neighbors.

Other aspects of ICA that have been the subjectof investigation include the ordering strategy todetermine in which order to visit the nodes to re-label in each ICA iteration. There is some evidenceto suggest that ICA is fairly robust to a number ofsimple ordering strategies such as random order-ing, visiting nodes in ascending order of diversityof its neighborhood class labels, and labelingnodes in descending order of label confidences.However, there is also some evidence that certainmodifications to the basic ICA procedure tend toproduce improved classification accuracies. For in-stance, both Neville and Jensen (2005) and Mc-Dowell, Gupta, and Aha (2007) propose a strategywhere only a subset of the unobserved variables areutilized as inputs for feature construction. Morespecifically, in each iteration, they choose the top-k most confident predicted labels and use onlythose unobserved variables in the following itera-tion’s predictions, thus ignoring the less confidentpredicted labels. In each subsequent iteration theyincrease the value of k so that in the last iterationall nodes are used for prediction. McDowell andcolleagues report that such a “cautious” approachleads to improved accuracies.

Approximate Inference Algorithmsfor Approaches based on

Global Formulations An alternate approach to performing collectiveclassification is to define a global objective func-tion to optimize. In what follows, we will describeone common way of defining such an objectivefunction and this will require some more notation.

We begin by defining a pairwise Markov randomfield (pairwise MRF) (Taskar, Abbeel, and Koller2002). Intuitively, a pairwise MRF is a graph whereeach node denotes a random variable and everyedge denotes a correlation between a pair of ran-dom variables. Let G = (V, E) denote a graph of ran-dom variables as before where V consists of twotypes of random variables, the unobserved vari-ables, Y, which need to be assigned values from la-bel set L, and observed variables X whose valueswe know.

To quantify the exact nature of each correlationin an MRF we use small functions usually referredto as clique potentials. A clique potential is simply afunction defined over a small set of random vari-ables that maps joint assignments of the set of ran-dom variables to goodness values, numbers that re-flect how favorable the assignment is. Let Ψ denotea set of clique potentials. Ψ contains three distincttypes of functions:

(1) For each Yi � Y, ψi � Ψ is a mapping ψi :L → �≥ 0, where �≥ 0 is the set of nonnegative realnumbers.

(2) For each (Yi, Xj) � E, ψij � Ψ is a mapping ψij : L→ �≥ 0.

(3) For each (Yi, Yj) � E, ψij � Ψ is a mapping ψij : L� L → �≥ 0.

Let x denote the values assigned to all the ob-served variables in G and let xi denote the value as-signed to Xi. Similarly, let y denote any assignmentto all the unobserved variables in G and let yi de-note a value assigned to Yi. For brevity of notationwe will denote by φi the clique potential obtainedby computing

Articles

FALL 2008 97

for each node Yi � Y do // bootstrapping// compute label using only observed nodes in Ni

compute ai using only X � Niyi ← f(ai)

end forfor n = 1 to B do // burn-in

generate ordering O over nodes in Yfor each node Yi � O do

compute ai using current assignements to Niyi ← f(ai)

end forend forfor each node Yi � Y do // initialize sample counts

for each label l � L doc[i, l] = 0

end forend forfor n = 1 to S do // collect samples

generate ordering O over nodes in Yfor each node Yi � O do

compute ai using current assignements to Niyi ← f(ai)c[i, yi] ← c[i, yi] + 1

end forend forfor each node Yi � Y do // compute final labels

yi ← argmaxl�L c[i, l] end for

Algorithm 2. Gibbs Sampling Algorithm (GS).

Page 6: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

.

We are now in a position to define a pairwise MRF.

Definition 1. A pairwise Markov random field(MRF) is given by a pair �G, Ψ� where G is a graphand Ψ is a set of clique potentials with φi and ψij asdefined previously. Given an assignment y to all theunobserved variables Y, the pairwise MRF is associ-ated with the probability distribution

where x denotes the observed values of X and

In figure 2, we show our running example aug-mented with clique potentials. MRFs are defined onundirected graphs and thus we have dropped thedirections on all the hyperlinks in the example.Thus, ψ1 and ψ2 denote two clique potentials de-fined on the unobserved variables (Y1 and Y2) in thenetwork. Similarly, we have one ψ defined for eachedge that involves at least one unobserved variableas an end point. For instance, ψ13 defines a map-

ZY

( ) = ( ) ( ,' ( , )

xy∑ ∏ ∏∈ ∈

′ ′ ′Yi

i i Yi Yj E ij i jy y yφ ψ ).

P y y yYi

i i Yi Yj E ij i j( | ) =1( )

( ) ( , )( , )

y xxZ Y∈ ∈∏ ∏φ ψ

φ ψ ψi i i i Yi Xj E ij iy y y( ) = ( ) ( )( , )∈∏ ping from L (which is set to {SH, CH} in the exam-

ple) to nonnegative real numbers. There is only oneedge between the two unobserved variables in thenetwork, and this edge is associated with the cliquepotential ψ12 that is a function over two arguments.Figure 2 also shows how to compute the φ cliquepotentials. Essentially, given an unobserved vari-able Yi, one collects all the edges that connect it toobserved variables in the network and multipliesthe corresponding clique potentials along with theclique potential defined on Yi itself. Thus, as the fig-ure shows, φ2 = ψ2 � ψ25 � ψ26.

Given a pairwise MRF, it is conceptually simpleto extract the best assignments to each unobservedvariable in the network. For instance, we may adoptthe criterion that the best label value for Yi is sim-ply the one corresponding to the highest marginalprobability obtained by summing over all othervariables from the probability distribution associat-ed with the pairwise MRF. Computationally, how-ever, this is difficult to achieve since computingone marginal probability requires summing over anexponentially large number of terms, which is whywe need approximate inference algorithms.

Articles

98 AI MAGAZINE

ST CO

CH

AI

X6

X7

X3 X4

Y1

CUX5

Y2

ST

SH

X8

X9

0.9CHCH

0.1SHCH

0.1CHSH

0.9SHSH

Ψ12y2y1

0.9CH

0.1SH

Ψ26y2

0.2CH

0.8SH

Ψ18y1

0.4CH

0.6SH

Ψ13y1

0.6CH

0.4SH

Ψ14y1

Φ1 = Ψ1* Ψ13 * Ψ14 * Ψ16 * Ψ18 =

0.0216CH

0.0096SH

Φ1y1

Φ2 = Ψ2* Ψ25 * Ψ26 =

0.405CH

0.005SH

Φ2y2

Y1

Y2

m1 2

(y2)= Σy1

Φ1(y

1) Ψ

12(y

1,y

2)

m2 1

(y1) = Σy2

Φ2(y

2) Ψ

12(y

1,y

2)

0.9CH

0.1SH

Ψ25y2

0.5CH

0.5SH

Ψ1y1

0.5CH

0.5SH

Ψ2y2

0.9CH

0.1SH

Ψ16y1

Figure 2. A Small Web Page Classification Problem Expressed as a Pairwise Markov Random Field with Clique Potentials. The figure also shows the message-passing steps followed by LBP. See text for more explanation.

Page 7: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

We describe two approximate inference algo-rithms in this article. Both of them adopt a similarapproach to avoiding the computational complex-ity of computing marginal probability distribu-tions. Instead of working with the probability dis-tribution associated with the pairwise MRF direct-ly (definition 1) they both use a simpler “trial”distribution. The idea is to design the trial distri-bution so that once we fit it to the MRF distribu-tion then it is easy to extract marginal probabili-ties from the trial distribution (as easy as readingoff the trial distribution). This is a general principlethat forms the basis of a class of approximate in-ference algorithms known as variational methods(Jordan et al. 1999).

We are now in a position to discuss loopy beliefpropagation (LBP) and mean-field relaxation label-ing (MF).

Loopy Belief Propagation Intuitively, loopy belief propagation (LBP) is an it-erative message-passing algorithm in which eachrandom variable Yi in the MRF sends a messagemi→j to a neighbouring random variable Yj depict-ing its belief of what Yj’s label should be. Thus, amessage mi→j is simply a mapping from L to �≥ 0.The point that differentiates LBP from other gos-sip-based message protocols is that when comput-ing a message mi→j to send from Yi to Yj, LBP dis-counts the message mi→j (the message Yj sent to Yiin the previous iteration) in an attempt to stop amessage computed by Yj from going back to Yj. LBPapplied to pairwise MRF ⟨G, Ψ⟩ can be concisely ex-pressed as the following set of equations:

(1)

(2)

where mi→j is a message sent by Yi to Yj and α de-notes a normalization constant that ensures thateach message and each set of marginal probabili-ties sum to 1; more precisely,

The algorithm proceeds by making each Yi � Yjcommunicate messages with its neighbors in Ni �Y until the messages stabilize (equation 1). Afterthe messages stabilize, we can calculate the mar-ginal probability of assigning Yi with label yi bycomputing bi(yi) using equation 2. The algorithm isdescribed more precisely in algorithm 3. Figure 2shows a sample round of message-passing steps fol-lowed by LBP on the running example.

LBP has been shown to be an instance of a vari-ational method. Let bi(yi) denote the marginal

yji j j yi

i im y b y∑ ∑→ ( ) = 1 ( ) = 1and .

b y y m y yi i i iYj i

j i i i( ) = ( ) ( ),αφ∈ ∩

→∏ ∀ ∈N Y

L

Yk i Yj

k i i jm y y∈ ∩

→∏ ∀ ∈N Y

L\

( ),

m y y y yi j jyi

ij i j i i→∈

∑( ) = ( , ) ( )α ψ φL

probability associated with assigning unobservedvariable Yi the value yi and let bij(yi, yj) denote themarginal probability associated with labeling theedge (Yi ,Yj) with values (yi, yj). Then Yedidia, Free-man, and Weiss (2005) showed that the followingchoice of trial distribution

and subsequently minimizing the Kullback-Leiblerdivergence between the trial distribution from thedistribution associated with a pairwise MRF givesus the LBP message-passing algorithm with somequalifications. Note that the trial distribution ex-plicitly contains marginal probabilities as vari-ables. Thus, once we fit the distribution, extractingthe marginal probabilities is as easy as readingthem off.

Relaxation Labeling through Mean-Field Approach Another approximate inference algorithm that canbe applied to pairwise MRFs is mean-field relax-ation labeling (MF). The basic algorithm can be de-scribed by the following fixed-point equation:

b

b y y

b y

Yi Yj Eij i j

Yi

i ii

( ) =

( , )

( )

( , )

| |y∈

∩ −

∏Y

Y N 11

Articles

FALL 2008 99

for each (Yi, Yj) � E(G) s.t. Yi, Yj � Y dofor yj � L do

mi→ j (yj) ← 1end for

end forrepeat // perform message passing

for each (Yi, Yj) � E(G) s.t. Yi, Yj � Y dofor each yj � L do

end forend for

until mi→ j (yj) stop showing any changefor each Yi � Y do // compute beliefs

for each yi � L do

end forend for

b y y m yi i i i Yj ij i i( ) ( ) ( )←

∈ ∩ →∏αφN Y

Yk i Yjk i im y

∈ ∩ →∏ N Y\( )

m y y y yi j j yiij i j i i→ ← ∑( ) ( , ) ( )α ψ φ

Algorithm 3. Loopy Belief Propagation (LBP).

Page 8: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

where bj(yj) denotes the marginal probability of as-signing Yj � Y with label yj and α is a normaliza-tion constant that ensures

The algorithm simply computes the fixed-pointequation for every node Yj and keeps doing so un-til the marginal probabilities bj(yj) stabilize. Whenthey do, we simply return bj(yj) as the computedmarginals. The pseudocode for MF is shown in al-gorithm 4.

MF can also be justified as a variational methodin almost exactly the same way as LBP. In this case,however, we choose a simpler trial distribution:

We refer the interested reader to the work ofWeiss and collegues (Weiss 2001; Yedidia, Freeman,and Weiss 2005) for more details.

Experiments In our experiments, we compared the four collec-tive classification algorithms (CC) discussed in theprevious sections and a content-only classifier(CO), which does not take the network into ac-count, along with two choices of local classifiers ondocument classification tasks. The two local classi-fiers we tried were naïve Bayes (NB) and logistic re-gression (LR). This gave us eight different classi-fiers: CO with NB, CO with LR, ICA with NB, ICAwith LR, GS with NB, GS with LR, MF, and LBP. Thedata sets we used for the experiments includedboth real-world and synthetic data sets.

b y y y yj j j jYi j yi

ijbi yi

i j( ) = ( ) ( , )( )αφ ψ∈ ∩ ∈∏ ∏N Y L

,,

b b yYi

i i( ) = ( )y∈

∏Y

.

yjj jb y∑ ( ) = 1.

Features Used For CO classifiers, we used the words in the docu-ments for observed attributes. In particular, weused a binary value to indicate whether or not aword appears in the document. In ICA and GS, weused the same local attributes (that is, words) fol-lowed by the count aggregation to count the num-ber of each label value in a node’s neighborhood.Finally, for LBP and MF, we used pairwise MRF withclique potentials defined on the edges and unob-served nodes in the network.

Experimental Setup Because we are dealing with network data, tradi-tional approaches to experimental data prepara-tion such as using random sampling for construct-ing training and test sets may not be directly ap-plicable. To test the effectiveness of collectiveclassification, we want to have splits such that areasonable number of neighbors are unlabeled. Inorder to achieve this, we use a snowball samplingevaluation strategy (SS). In this strategy, we con-struct splits for test data by randomly selecting theinitial node and expanding around it. We do notexpand randomly; instead, we select nodes basedon the class distribution of the whole corpus; thatis, the test data is stratified. The nodes selected bythe SS are used as the test set while the rest are usedfor training. We repeat this process k times to ob-tain k test-train pairs of splits. Besides experiment-ing on test splits created using SS, we also experi-mented with splits created using the standard k-fold cross-validation methodology where wechoose nodes randomly to create splits and refer tothis as RS.

When using SS, some of the objects may appearin more than one test split. In that case, we needto adjust the accuracy computation so that we donot overcount those objects. A simple strategy is toaverage the accuracy for each instance first andthen take the average of the averages. Further, tohelp the reader compare the SS results with the RSresults, we also provide accuracies averaged per in-stance and across all instances that appear in atleast one SS split. We denote these numbers usingthe term matched cross-validation (M).

Learning the Classifiers One aspect of the collective classification problemthat we have not discussed so far is how to learnthe various classifiers described in the previous sec-tions. Learning refers to the problem of determin-ing the parameter values for the local classifier, inthe case of ICA and GS, and the entries in theclique potentials, in the case of LBP and MF, whichcan then be subsequently used to classify unseentest data. For all our experiments, we learned theparameter values from fully labeled data sets creat-ed through the splits generation methodology de-

Articles

100 AI MAGAZINE

for each Yj � Y do // initialize messagesfor each yj � L do

bj(yj) ← 1end for

end forrepeat // perform message passing

for each Yj � Y dofor each Yj � L do

end forend for

until all bj(yj) stop changing

b y y y yj j j j Yi j yiijbi yi

i j( ) ( ) ( , ),

( )←∈ ∩ ∈∏αφ ψN Y L

Algorithm 4. Mean-Field Relaxation Labeling (MF).

Page 9: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

scribed above using gradient-based optimizationapproaches. For a more detailed discussion, see, forexample, Taskar, Abbeel, and Koller (2002), andSen and Getoor (2007).

Real World Data Sets We experimented with two real-world biblio-graphic data sets: Cora (McCallum et al. 2000) andCiteSeer (Giles, Bollacker, and Lawrence 1998). TheCora data set contains a number of machine-learn-ing papers divided into one of seven classes whilethe CiteSeer data set has six class labels. For bothdata sets, we performed stemming and stop wordremoval beside removing the words with docu-ment frequency less than 10. The final corpus has2708 documents, 1433 distinct words in the vo-cabulary, and 5429 links in the case of Cora; and3312 documents, 3703 distinct words in the vo-cabulary, and 4732 links in the case of CiteSeer. Foreach data set, we performed both RS evaluation(with 10 splits) and SS evaluation (averaged over10 runs).

Results. The accuracy results for the real-world da-ta sets are shown in table 1. The accuracies are sep-arated by sampling method and base classifier. Thehighest accuracy at each partition is in bold. Weperformed a t-test (paired where applicable, andWelch t-test otherwise) to test statistical signifi-cance between results. Here are the main results:

First, do CC algorithms improve over CO coun-terparts? In both data sets, CC algorithms outper-formed their CO counterparts in all evaluationstrategies (SS, RS, and M). The performance differ-ences were significant for all comparisons exceptfor the NB (M) results for CiteSeer.

Second, does the choice of the base classifier af-fect the results of the CC algorithms? We observeda similar trend for the comparison between NB andLR. LR (and the CC algorithms that used LR as abase classifier) outperformed NB versions in all da-ta sets, and the difference was statistically signifi-cant for Cora.

Third, is there any CC algorithm that dominatesthe other? The results for comparing CC algo-rithms are less clear. In the NB partition, ICA-NBoutperformed GS-NB significantly for Cora usingSS and M, and GS-NB outperformed ICA-NB forCiteSeer SS. Thus, there was no clear winner be-tween ICA-NB and GS-NB in terms of performance.In the LR portion, again the differences betweenICA-LR and GS-LR were not significant for all datasets. As for LBP and MF, they often slightly outper-formed ICA-LR and GS-LR, but the differences werenot significant.

Fourth, how do SS results and RS results com-pare? Finally, we take a look at the numbers underthe columns labeled M. First, we would like to re-mind the reader that even though we are compar-ing the results on the test set that is the intersec-

tion of the two evaluation strategies (SS and RS),different training data could have been potential-ly used for each test instance, thus the comparisoncan be questioned. Nonetheless, we expected thematched cross-validation results (M) to outperformSS results simply because each instance had morelabeled data around it from RS splitting. The dif-ferences were not big (around 1 or 2 percent); how-ever, they were significant. These results tell us thatthe evaluation strategies can have a big impact onthe final results, and care must be taken while de-signing an experimental setup for evaluating CCalgorithms on network data (Gallagher and Eliassi-Rad 2007).

Synthetic Data We implemented a synthetic data generator fol-lowing Sen and Getoor (2007). The process pro-ceeds as follows: At each step, we either add a linkbetween two existing nodes or create a node basedon the ld parameter (such that higher ld valuemeans higher link density, that is, more links inthe graph) and link this new node to an existingnode; when we are adding a link, we choose thesource node randomly, but we choose the destina-tion node using the dh parameter (which varies ho-mophily by specifying the percentage, on average,of a node’s neighbor that is of the same type) aswell as the degree of the candidates (preferential at-tachment). When we are generating a node, wesample a class for it and generate attributes basedon this class using a binomial distribution. Then,we add a link between the new node and one ofthe existing nodes, again using the homophily pa-rameter and the degree of the existing nodes. In allof these experiments, we generated 1000 nodes,where each node is labeled with one of five possi-ble class labels and has 10 attributes. We experi-mented with varying degree of homophily andlink density in the graphs generated.

Results. The results for varying values of dh, ho-mophily, are shown in figure 3a. When homophi-ly in the graph was low, both CO and CC algo-rithms performed equally well, which was the ex-pected result based on similar work. As weincreased the amount of homophily, all CC algo-rithms drastically improved their performanceover CO classifiers. With homophily at dh = .9, forexample, the difference between our best perform-ing CO algorithm and our best performing CC al-gorithm is about 40 percent. Thus, for data setsthat demonstrate some level of homophily, usingCC can significantly improve performance.

We present the results for varying the ld, linkdensity, parameter in figure 3(b). As we increasedthe link density of the graphs, we saw that accura-cies for all algorithms went up, possibly becausethe relational information became more signifi-cant and useful. However, the LBP accuracy had a

Articles

FALL 2008 101

Page 10: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

sudden drop when the graph became very dense.The reason behind this result is the well knownfact that LBP has convergence issues when thereare many short loops in the graph.

Practical Issues In this section, we discuss some of the practical is-sues to consider when applying the various CC al-gorithms. First, although MF and LBP performanceis in some cases a bit better than that of ICA andGS, MF and LBP were also the most difficult towork with in both learning and inference. Choos-ing the initial weights so that the weights will con-verge during training is nontrivial. Most of thetime, we had to initialize the weights with theweights we got from ICA in order to get the algo-rithms to converge. Thus, the results reported fromMF and LBP needed some extra help from ICA toget started. We also note that of the two, we hadthe most trouble with MF being unable to con-verge or, when it did, not converging to the globaloptimum. Our difficulty with MF and LBP is con-sistent with previous work (Weiss 2001, Mooij andKappen 2004, Yanover and Weiss 2002) and shouldbe taken into consideration when choosing to ap-ply these algorithms.

In contrast, ICA and GS parameter initializationsworked for all data sets we used, and we did nothave to tune the initializations for these two algo-rithms. They were the easiest to train and testamong all the collective classification algorithmsevaluated.

Finally, while ICA and GS produced very similar

results for almost all experiments, ICA is a muchfaster algorithm than GS. In our largest data set,CiteSeer, for example, ICA-NB took 14 minutes torun while GS-NB took over 3 hours. The large dif-ference is due to the fact that ICA converges in justa few iterations, whereas GS has to go through sig-nificantly more iterations per run due to the initialburn-in stage (200 iterations), as well as the need torun a large number of iterations to get a sufficient-ly large sampling (800 iterations).

Related Work Even though collective classification has gained at-tention only in the past five to seven years, initiat-ed by the work of Jennifer Neville and DavidJensen (Neville and Jensen 2000; Jensen, Neville,and Gallagher 2004; Neville and Jensen 2007) andthe work of Ben Taskar and colleagues (Taskar, Se-gal, and Koller 2001; Getoor et al. 2001; Taskar,Abbeel, and Koller 2002; Taskar, Guestrin, andKoller 2003), the general problem of inference forstructured output spaces has received attention fora considerably longer period of time from variousresearch communities including computer vision,spatial statistics, and natural language processing.In this section, we attempt to describe some of thework that is most closely related to the work de-scribed in this article; however, due to the wide-spread interest in collective classification, our list issure to be incomplete.

One of the earliest principled approximate in-ference algorithms, relaxation labeling (Hummel

Articles

102 AI MAGAZINE

Cora CiteSeer

Algorithm SS RS M SS RS M

CO-NB 0.7285 0.7776 0.7476 0.7427 0.7487 0.7646

ICA-NB 0.8054 0.8478 0.8271 0.7540 0.7683 0.7752

GS-NB 0.7613 0.8404 0.8154 0.7596 0.7680 0.7737

CO-LR 0.7356 0.7695 0.7393 0.7334 0.7321 0.7532

ICA-LR 0.8457 0.8796 0.8589 0.7629 0.7732 0.7812

GS-LR 0.8495 0.8810 0.8617 0.7574 0.7699 0.7843

LBP 0.8554 0.8766 0.8575 0.7663 0.7759 0.7843

MF 0.8555 0.8836 0.8631 0.7657 0.7732 0.7888

Table 1. Accuracy Results for the Cora and CiteSeer Data Sets.

For Cora, the CC algorithms outperformed their CO counterparts significantly. LR versions significantly outperformedNB versions. ICA-NB outperformed GS-NB for SS and M; the other differences between ICA and GS were not significant(both NB and LR versions). Even though MF outperformed ICA-LR, GS-LR, and LBP, the differences were not statisti-cally significant. For CiteSeer, the CC algorithms significantly outperformed their CO counterparts except for ICA-NBand GS-NB for matched cross-validation. CO and CC algorithms based on LR outperformed the NB versions, but thedifferences were not significant. ICA-NB outperformed GS-NB significantly for SS, but the rest of the differences betweenLR versions of ICA and GS, LBP, and MF were not significant.

Page 11: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

and Zucker 1983), was developed by researchers incomputer vision in the context of object labelingin images. Due to its simplicity and appeal, relax-ation labeling was a topic of active research forsome time, and many researchers developed dif-ferent versions of the basic algorithm (Li, Wang,and Petrou 1994). Mean-field relaxation labeling(Weiss 2001; Yedidia, Freeman, and Weiss 2005),discussed in this article, is a simple instance of thisgeneral class of algorithms. Besag (1986) also con-sidered statistical analysis of images and proposed

a particularly simple approximate inference algo-rithm called iterated conditional modes, which isone of the earliest descriptions and a specific ver-sion of the iterative classification algorithm pre-sented in this article. Besides computer vision, re-searchers working with an iterative decodingscheme known as “Turbo Codes” came up withthe idea of applying Pearl’s belief propagation al-gorithm on networks with loops. This led to thedevelopment of the approximate inference algo-rithm that we, in this article, refer to as loopy belief

Articles

FALL 2008 103

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Link Density

Acc

ura

cyCO-NBICA-NBGS-NBCO-LRICA-LRGS-LRLBPMF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

CO-NBICA-NBGS-NBCO-LRICA-LRGS-LRLBPMF

Homophily

Acc

ura

cy

a

b

Figure 3. Algorithm Accuracy.(a) Accuracy of algorithms through different values for dh varying the levels of homophily. When the homophily is verylow, both CO and CC algorithms perform equally well, but as we increase homophily, CC algorithms improve over the COclassifier. (b) Accuracy of algorithms through different values for ld varying the levels of link density. As we increase linkdensity, ICA and GS improve their performance the most. Next comes MF. However, LBP has convergence issues due to in-creased cycles and in fact performs worse than CO for high link density.

Page 12: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

propagation (LBP) (also known as sum product al-gorithm).

Another area that often uses collective classifica-tion techniques is document classification.Chakrabarti, Dom, and Indyk (1998) were amongthe first to apply collective classification to a cor-pora of patents linked by hyperlinks and reportedthat considering attributes of neighboring docu-ments actually hurts classification performance.Slattery and Craven (1998) also considered theproblem of document classification by construct-ing features from neighboring documents using aninductive logic programming rule learner (Slatteryand Craven 1998). Yang, Slattery, and Ghani(2002) conducted an in-depth investigation overmultiple data sets commonly used for documentclassification experiments and identified differentpatterns. Since then, collective classification has al-so been applied to various other applications suchas part-of-speech tagging (Lafferty, McCallum, andPereira 2001), classification of hypertext docu-ments using hyperlinks (Taskar, Abbeel, and Koller2002), link prediction in friend-of-a-friend net-works (Taskar et al. 2003), classification of email“speech acts” (Carvalho and Cohen 2005), coun-terterrorism analysis (Macskassy and Provost2005), and targeted marketing (Hill, Provost, andVolinsky 2007).

Besides the four approximate inference algo-rithms discussed in this article, there are other al-gorithms that we did not discuss, such as graph-cuts based formulations, and formulations basedon linear programming relaxations. More recently,there have been some attempts to extend collec-tive classification techniques to the semisupervisedlearning scenario (Xu et al. 2006; Macskassy 2007).

Conclusion In this article, we gave a brief description of fourpopular collective classification algorithms. We ex-plained the algorithms, showed how to applythem to various applications using examples andhighlighted various issues that have been the sub-ject of investigation in the past. Most of the infer-ence algorithms available for practical tasks relat-ing to collective classification are approximate. Webelieve that a better understanding of when thesealgorithms perform well will lead to more wide-spread application of these algorithms to more re-al-world tasks and that this should be a subject offuture research. Most of the current applications ofthese algorithms have been on homogeneous net-works with a single type of unobserved variablethat share a common domain. Even though ex-tending these ideas to heterogeneous networks isconceptually simple, we believe that a further in-vestigation into techniques that do so will lead tonovel approaches to feature construction and a

deeper understanding of how to improve the clas-sification accuracies of approximate inference al-gorithms. Collective classification has been a topicof active research for the past decade, and we hopethat articles such as this one will help more re-searchers gain introduction to this area, thus pro-moting further research into the understanding ofexisting approximate inference algorithms, andperhaps help develop new, improved inference al-gorithms.

Acknowledgements

We would like to thank Luke McDowell for his use-ful and detailed comments. We would also like tothank Jen Neville, David Jensen, Foster Provost,and the reviewers for their comments. This mate-rial is based upon work supported in part by theNational Science Foundation under Grant No.0308030. In addition, this work was partially per-formed under the auspices of the U.S. Departmentof Energy by Lawrence Livermore National Labo-ratory under Contract DE-AC52-07NA27344.

References Besag, J. 1986. On the Statistical Analysis of Dirty Pic-tures. Journal of the Royal Statistical Society Series B. 48(3):259–302.

Carvalho, V., and Cohen, W. W. 2005. On the CollectiveClassification of Email Speech Acts. In Proceedings of the28th Annual International ACM Special Interest Group on In-formation Retrieval Conference on Research and Developmentin Information Retrieval, 345–352. New York: Associationfor Computing Machinery.

Chakrabarti, S.; Dom, B.; and Indyk, P. 1998. EnhancedHypertext Categorization Using Hyperlinks. In Proceed-ings of the ACM SIGMOD International Conference on Man-agement of Data, 308–318. New York: Association forComputing Machinery.

Craven, M.; DiPasquo, D.; Freitag, D.; McCallum, A;Mitchell, T; Nigam, K.; Slattery, S. 1998. Learning to Ex-tract Symbolic Knowledge from the World Wide Web. InProceedings of the Fifteenth National Conference on ArtificialIntelligence. Menlo Park, CA: AAAI Press.

Gallagher, B., and Eliassi-Rad, T. 2007. An Evaluation ofExperimental Methodology for Classifiers of RelationalData. Paper presented at the Workshop on MiningGraphs and Complex Structures, IEEE International Con-ference on Data Mining, Omaha, NE, 28–31 October.

Geman, S. and Geman, D. 1984. Stochastic Relaxation,Gibbs Distributions and the Bayesian Restoration of Im-ages. IEEE Transactions on Pattern Analysis and Machine In-telligence 6(6): 721–742.

Getoor, L.; Segal, E.; Taskar, B.; and Koller, D. 2001. Prob-abilistic Models of Text and Link Structure for HypertextClassification. Paper presented at the IJCAI Workshop onText Learning: Beyond Supervision. Seattle, WA, 6 Au-gust.

Giles, C. L.; Bollacker, K.; and Lawrence, S. 1998. CiteSeer:An Automatic Citation Indexing System. In Digital Li-braries 98: The Third ACM Conference on Digital Libraries.New York: Association for Computing Machinery.

Articles

104 AI MAGAZINE

Page 13: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

Gilks, W. R.; Richardson, S.; and Spiegelhalter, D. J. 1996.Markov Chain Monte Carlo in Practice: Interdisciplinary Sta-tistics. Boca Raton, FL: Chapman and Hall/CRC Press.

Hill, S.; Provost, F.; and Volinsky, C. 2007. Learning andInference in Massive Social Networks. Paper presented atthe Fifth International Workshop on Mining and Learn-ing with Graphs, Firenze, Italy, August 1–3.

Hummel, R., and Zucker, S. 1983. On the Foundations ofRelaxation Labeling Processes. IEEE Transactions on Pat-tern Analysis and Machine Intelligence 5(3): 267–287.

Jensen, D.; Neville, J.; and Gallagher, B. 2004. Why Col-lective Inference Improves Relational Classification. InProceedings of the 10th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining. New York:Association for Computing Machinery.

Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L.K. 1999. An Introduction to Variational Methods forGraphical Models. Machine Learning 37(2): 183–23.

Knobbe, A.; deHaas, M.; and Siebes, A. 2001. Proposition-alisation and Aggregates. Paper presented at the Fifth Eu-ropean Conference on Principles of Data Mining andKnowledge Discovery. Freiburg, Germany, September 3–5.

Kramer, S.; Lavrac, N.; and Flach, P. 2001. Propositional-ization Approaches to Relational Data Mining. In Rela-tional Data Mining, ed. S. Dzeroski and N. Lavrac. NewYork: Springer-Verlag.

Lafferty, J. D.; McCallum, A; and Pereira, F. C. N. 2001. Con-ditional Random Fields: Probabilistic Models for Segment-ing and Labeling Sequence Data. In Proceedings of the Eigh-teenth International Conference on Machine Learning (ICML2001). San Francisco: Morgan Kaufmann Publishers.

Li, S. Z.; Wang, H.; Petrou, M. 1994. Relaxation Labelingof Markov Random Fields. In Proceedings of the 12th IAPRInternational Conference Pattern Recognition. Los Alamitos,CA: IEEE Computer Society.

Lu, Q., and Getoor, L. 2003. Link Based Classification. InMachine Learning, Proceedings of the Twentieth Internation-al Conference (ICML 2003). Menlo Park, CA: AAAI Press.

Macskassy, S. 2007. Improving Learning in NetworkedData by Combining Explicit and Mined Links. In Pro-ceedings of the Twenty-Second Conference on Artificial Intel-ligence. Menlo Park, CA: AAAI Press.

Macskassy, S., and Provost, F. 2005. Suspicion Scoring Basedon Guilt-by-Association, Collective Inference, and FocusedData Access. Paper presented at the International Confer-ence on Intelligence Analysis. McLean, VA, 2 May.

Macskassy, S., and Provost, F. 2007. Classification in Net-worked Data: A Toolkit and a Univariate Case Study. Jour-nal of Machine Learning Research. 8(May): 935—983

McCallum, A. K.; Nigam, K.; Rennie, J.; and Seymore, K.2000. Automating the Construction of Internet Portalswith Machine Learning. Information Retrieval Journal 3(2):127–163.

McDowell, L. K.; Gupta, K. M.; and Aha, D. W. 2007. Cau-tious Inference in Collective Classification. In Proceedingsof the Twenty-Second Conference on Artificial Intelligence.Menlo Park, CA: AAAI Press.

Mooij, J., and Kappen, H. J. 2004. Validity Estimates forLoopy Belief Propagation on Binary Real-World Net-works. Paper presented at the Conference on Advances inNeural Information Processing Systems 17. December13–18, 2004, Vancouver, British Columbia, Canada.

Neville, J., and Jensen, D. 2000. Iterative Classification inRelational Data. In Learning Statistical Models from Re-lational Data: Papers from the AAAI Workshop. TechnicalReport WS-00-06. Menlo Park, CA: AAAI Press.

Neville, J., and Jensen, D. 2007. Relational DependencyNetworks. Journal of Machine Learning Research 8: 653–692.

Perlich, C., and Provost, F. 2003. Aggregation-Based Fea-ture Invention and Relational Concept Classes. Paper pre-sented at the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, Washington,D.C., August.

Perlich, C., and Provost, F. 2006. Distribution-Based Ag-gregation for Relational Learning with Identifier Attrib-utes. Machine Learning Journal 62(1–2): 65–105.

Popescul, A., and Ungar, L. 2003. Structural Logistic Re-gression for Link Analysis. Paper presented at the SecondKDD Workshop on Multi-Relational Data Mining. August27, Washington, D.C.

Sen, P., and Getoor, L. 2007. Link-based Classification,Technical Report, CS-TR-4858, University of Maryland,College Park, MD.

Slattery, S., and Craven, M. 1998. Combining Statisticaland Relational Methods for Learning in Hypertext Do-mains. In Inductive Logic Programming, 8th InternationalWorkshop, ILP-98. Lecture Notes in Computer Science1446. Berlin: Springer.

Taskar, B.; Abbeel, P.; and Koller, D. 2002. DiscriminativeProbabilistic Models for Relational Data. In Proceedings ofthe 18th Conference in Uncertainty in Artificial Intelligence.San Francisco: Morgan Kaufmann, Publishers.

Taskar, B.; Guestrin, C; and Koller, D. 2003. Max-MarginMarkov Networks. Advances in Neural Information Process-ing Systems 16. Cambridge, MA: The MIT Press.

Taskar, B.; Segal, E; and Koller, D. 2001. Probabilistic Clas-sification and Clustering in Relational Data. In Proceed-ings of the Seventeenth International Joint Conference on Ar-tificial Intelligence. San Francisco: Morgan Kaufmann Pub-lishers.

Taskar, B.; Wong, M. F.; Abbeel, P.; and Koller, D. 2003.Link Prediction in Relational Data. In Proceedings of theSeventeenth Annual Neural Information Processing SystemsConference. Cambridge, MA: The MIT Press.

Weiss, Y. 2001. Comparing the Mean Field Method andBelief Propagation for Approximate Inference in MRFs. InAdvanced Mean Field Methods, ed. M. Opper and D. Saad.Cambridge, MA: The MIT Press.

Xu, L.; Wilkinson, D.; Southey, F.; and Schuurmans, D.2006. Discriminative Unsupervised Learning of Struc-tured Predictors. In Machine Learning, Proceedings of theTwenty-Third International Conference (ICML 2006). ACMInternational Conference Proceeding Series 148. NewYork: Association for Computing Machinery.

Yang, Y.; Slattery, S.; and Ghani, R. 2002. A Study of Ap-proaches to Hypertext Categorization. Journal of Intelli-gent Information Systems 18(2): 219–241.

Yanover, C., and Weiss, Y. 2002. Approximate Inferenceand Protein-Folding. In Proceedings of the Sixteenth Annu-al Neural Information Processing Systems Conference. Cam-bridge, MA: The MIT Press.

Yedidia, J. S.; Freeman, W. T.; Weiss, Y. 2005. Construct-ing Free-Energy Approximations and Generalized Belief

Articles

FALL 2008 105

Page 14: Collective Classification in Network Dataweb.cs.ucla.edu/.../2014Spring_CS7280/Papers/Classification/collective_classification.pdftation and define the collective classification prob-lem

Spring Symposium Series Call for Participation

AAAI presents the 2009 Spring Symposium Series, tobe held Monday - Wednesday, March 23-25, 2008, atStanford University. The topics of the nine symposiawill be:

� Agents that Learn from Human Teachers. Andrea L.Thomaz (aaaiss09lfh at easychair.org)

� Benchmarking of Qualitative Spatial and TemporalReasoning Systems. Bernhard Nebel (aaai09bench-qsr at informatik.uni-freiburg.de)

� Experimental Design for Real-World Systems.Katherine Tsui (aaai-sss-2009 at cs.uml.edu)

� Human Behavior Modeling. Tanzeem Choudhury(aaai_ss09_hbm at cs.dartmouth.edu)

� Intelligent Event Processing. Nenad Stojanovic(nstojano at fzi.de)

� Intelligent Narrative Technologies II. SandyLouchart, Manish Mehta, and David Roberts (int2 atcc.gatech.edu)

� Learning by Reading and Learning to Read. SergeiNirenburg and Tim Oates (sergei, oates at um-bc.edu)

� Social Semantic Web: Where Web 2.0 Meets Web3.0. Jie Bao (dingl at cs.rpi.edu)

� Technosocial Predictive Analytics. Antonio Sanfilip-po (antonio.sanfilippo at pnl.gov)

Submissions for the symposia are due on October 3,2008. Notification of acceptance will be given by No-vember 7, 2008. Material to be included in the work-ing notes of the symposium must be received by Jan-uary 16, 2009. The complete Call for Participation isavailable at www.aaai.org/Symposia/ Spring/sss09.php. Registration information will be available by De-cember 15, 2008.

Please contact AAAI [email protected]

with any questions.

Propagation Algorithms. IEEE Transactions on InformationTheory. 51(7): 2282–2312.

Prithviraj Sen (www.cs.umd.edu/~sen)is a Ph.D. student in the Departmentof Computer Science at the Univeristyof Maryland, College Park. His primaryinterests lie in the fields of machinelearning and probabilistic databases.He received a BTech (Comp. Sc.) from

the Indian Institute of Technology, Bombay, in 2003 andan MS (Comp. Sc.) from University of Maryland in 2006.

Galileo Mark Namata (www.cs.umd.edu/~namatag )is a Ph.D. student atthe University of Maryland, CollegePark. He received his B.S. in computerscience from Georgetown Universityin 2003 and his M.S. in computer sci-ence from the University of Maryland

in 2008. His research interests include artificial intelli-gence, machine learning, knowledge discovery, and datamining with a particular focus on network data.

Mustafa Bilgic received his MS degreein computer science from the Univer-sity of Maryland at College Park in2006, where he is currently a Ph.D.candidate. His research interests are inthe areas of machine learning, infor-mation acquisition, and decision theo-ry.

Lise Getoor (www.cs.umd.edu/~getoor)received her Ph.D. in computer sciencefrom Stanford University in 2001. Sheis an associate professor in the Depart-ment of Computer Science and amember of the Institute for AdvancedComputer Studies, University of Mary-land, College Park. Her research inter-

ests include machine learning, reasoning under uncer-tainty, databases, and artificial intelligence.

Brian Gallagher is a computer scien-tist in the Science and TechnologyComputing Division at Lawrence Liv-ermore National Laboratory. He re-ceived a B.A. in computer science fromCarleton College in 1998 and an M.S.in computer science from the Univer-sity of Massachusetts Amherst in 2004.

His research interests include artificial intelligence, ma-chine learning, and knowledge discovery and data min-ing. His current research focuses on classification andanalysis in network-structured data.

Tina Eliassi-Rad is a staff computerscientist at the Center for Applied Sci-entific Computing within LawrenceLivermore National Laboratory. She re-ceived her Ph.D. in computer sciencesat the University of Wisconsin-Madi-son in 2001. Broadly speaking, her re-search interests include artificial intel-

ligence, machine learning, knowledge discovery, and da-ta mining. Her work has been applied to the world wideweb, scientific simulation data, and complex networks.

Articles

106 AI MAGAZINE