infinite hidden relational models · respective attributes and represents an area of growing...

8
Infinite Hidden Relational Models Zhao Xu Institute for Computer Science University of Munich Munich, Germany Volker Tresp, Kai Yu Information and Communications Corporate Technology, Siemens AG Munich, Germany Hans-Peter Kriegel Institute for Computer Science University of Munich Munich, Germany Abstract Relational learning analyzes the probabilis- tic constraints between the attributes of en- tities and relationships. We extend the ex- pressiveness of relational models by introduc- ing for each entity (or object) an infinite- dimensional latent variable as part of a Dirichlet process (DP) mixture model. We discuss inference in the model, which is based on a DP Gibbs sampler, i.e., the Chinese restaurant process. We extended the Chi- nese restaurant process to be applicable to relational modeling. We discuss how infor- mation is propagated in the network of la- tent variables, reducing the necessity for ex- tensive structural learning. In the context of a recommendation engine our approach re- alizes a principled solution for recommenda- tions based on features of items, features of users and relational information. Our ap- proach is evaluated in three applications: a recommendation system based on the Movie- Lens data set, the prediction of gene func- tion using relational information and a med- ical recommendation system. 1 Introduction Relational learning (Dzeroski & Lavrac, 2001; Raedt & Kersting, 2003; Wrobel, 2001; Friedman et al., 1999) is an object oriented approach that clearly distinguishes between entities (e.g, objects), relationships and their respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation system: based on the attributes of two entities, i.e. of the user and the item, one wants to predict relational attributes like the preference (rating, willingness to purchase, ...) of this user for this item. In many circumstances, the attributes of the entities are rather weak predictors in which case one can exploit the known relationship attributes to predict unknown entity or relationship attributes (Yu et al., 2004). In recommendation sys- tems, the latter situation is often referred to as col- laborative filtering. Although the unique identifier of an entity to which a relationship exists might often be used as a feature, it has the disadvantage that it does not permit the generalization to new entities. From this point of view it is more advantageous to intro- duce a latent variable representing unknown attributes of the entities, which is the approach pursued in this paper. Attributes of entities are now children of the corresponding entity latent variable and attributes of relationships are children of the latent variables of the entities participating in the relationship. By introduc- ing the latent variables the ground network forms a re- lational network of latent variables. Thus, our hidden relational model can be viewed on as a direct gener- alization of hidden Markov models used in speech or hidden Markov random fields used in vision (such mod- els are discussed, for example, in Yedidia et al. 2005). As in those models, information can propagate across the network of latent variables in the hidden relational model, which reduces the need to extensive structural model selection. Structural model selection is a major problem in relational learning due to the exponentially many features an attribute might depend on. Thus in- formation about my grandfather can propagate to me via the latent variable of my father. Since each entity class might have a different number of states in its latent variables, it is natural to allow the model to determine the appropriate number of latent states in a self-organized way. This is possible by em- bedding the model in Dirichlet process (DP) mixture models, which can be interpreted as a mixture models with an infinite number of mixture components but where the model, based on the data, automatically re- duces the complexity to an appropriate finite number of components. The DP mixture model also allows us to view our infinite hidden relational model as a Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Upload: others

Post on 23-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

Infinite Hidden Relational Models

Zhao XuInstitute for Computer Science

University of MunichMunich, Germany

Volker Tresp, Kai YuInformation and Communications

Corporate Technology, Siemens AGMunich, Germany

Hans-Peter KriegelInstitute for Computer Science

University of MunichMunich, Germany

Abstract

Relational learning analyzes the probabilis-tic constraints between the attributes of en-tities and relationships. We extend the ex-pressiveness of relational models by introduc-ing for each entity (or object) an infinite-dimensional latent variable as part of aDirichlet process (DP) mixture model. Wediscuss inference in the model, which is basedon a DP Gibbs sampler, i.e., the Chineserestaurant process. We extended the Chi-nese restaurant process to be applicable torelational modeling. We discuss how infor-mation is propagated in the network of la-tent variables, reducing the necessity for ex-tensive structural learning. In the context ofa recommendation engine our approach re-alizes a principled solution for recommenda-tions based on features of items, features ofusers and relational information. Our ap-proach is evaluated in three applications: arecommendation system based on the Movie-Lens data set, the prediction of gene func-tion using relational information and a med-ical recommendation system.

1 Introduction

Relational learning (Dzeroski & Lavrac, 2001; Raedt &Kersting, 2003; Wrobel, 2001; Friedman et al., 1999) isan object oriented approach that clearly distinguishesbetween entities (e.g, objects), relationships and theirrespective attributes and represents an area of growinginterest in machine learning. A simple example of arelational system is a recommendation system: basedon the attributes of two entities, i.e. of the user andthe item, one wants to predict relational attributes likethe preference (rating, willingness to purchase, ...) ofthis user for this item. In many circumstances, the

attributes of the entities are rather weak predictorsin which case one can exploit the known relationshipattributes to predict unknown entity or relationshipattributes (Yu et al., 2004). In recommendation sys-tems, the latter situation is often referred to as col-laborative filtering. Although the unique identifier ofan entity to which a relationship exists might often beused as a feature, it has the disadvantage that it doesnot permit the generalization to new entities. Fromthis point of view it is more advantageous to intro-duce a latent variable representing unknown attributesof the entities, which is the approach pursued in thispaper. Attributes of entities are now children of thecorresponding entity latent variable and attributes ofrelationships are children of the latent variables of theentities participating in the relationship. By introduc-ing the latent variables the ground network forms a re-lational network of latent variables. Thus, our hiddenrelational model can be viewed on as a direct gener-alization of hidden Markov models used in speech orhidden Markov random fields used in vision (such mod-els are discussed, for example, in Yedidia et al. 2005).As in those models, information can propagate acrossthe network of latent variables in the hidden relationalmodel, which reduces the need to extensive structuralmodel selection. Structural model selection is a majorproblem in relational learning due to the exponentiallymany features an attribute might depend on. Thus in-formation about my grandfather can propagate to mevia the latent variable of my father.

Since each entity class might have a different numberof states in its latent variables, it is natural to allow themodel to determine the appropriate number of latentstates in a self-organized way. This is possible by em-bedding the model in Dirichlet process (DP) mixturemodels, which can be interpreted as a mixture modelswith an infinite number of mixture components butwhere the model, based on the data, automatically re-duces the complexity to an appropriate finite numberof components. The DP mixture model also allowsus to view our infinite hidden relational model as a

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Page 2: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

generalization of nonparametric hierarchical Bayesianmodeling to relational models (compare also, Xu etal., 2005). The combination of the hidden relationalmodel and the DP mixture model is the infinite hiddenrelational model.

After presenting related work we will briefly introduceour preferred framework for describing relational mod-els, i.e., the directed acyclic probabilistic entity rela-tionship (DAPER) model. In Section 4 we will de-scribe infinite hidden relational models and in Sec-tion 5 we introduce a modified Chinese restaurantsampling process to accommodate for the relationalstructure. In the subsequent sections we describe ex-perimental results applying infinite hidden relationalmodels to encode movie preference for users, to pre-dict the functional class of a gene, and on a medicalexample, modeling the relationship between patients,diagnosis and procedures. In Section 9 we will presentconclusions.

2 Related Work

Our approach can be related to some existing work.(Getoor et al., 2000) refined probabilistic relationalmodels with class hierarchies, which specialized dis-tinct probabilistic dependency for each subclass.(Rosen-Zvi et al., 2004) introduced an author-topicmodel for documents. The model implicitly exploredthe two relationships between documents and authorsand document and words. (Kemp et al., 2004) showeda relational model with latent classes. (Carbonetto etal., 2005) introduced the nonparametric BLOG model,which specifies nonparametric probabilistic distribu-tions over possible worlds defined by first-order logic.These models demonstrated good performance in cer-tain applications. However, most are restricted to do-mains with simple relations. The proposed model goesbeyond that by considering multiple related entities.In addition, the nonparametric nature allows the com-plexity of the model to be tuned by the model basedon the available data set.

3 The DAPER Model

The DAPER model (Heckerman et al., 2004) formu-lates a probabilistic framework for an entity relation-ship database model. The DAPER model consistsof entity classes, relationship classes, attribute classesand arc classes, as well as local distribution classesand constraint classes. Figure 1 shows an example ofa DAPER model for a universe of students, coursesand grades. The entity classes specify classes of ob-jects in the real world, e.g. Student and Course shownas rectangles in Figure 1. The relationship class repre-

Figure 1: An example of DAPER model over univer-sity domain from Heckerman et al. (2004).

sents interaction among entity classes. It is shown as adiamond-shaped node with dashed lines linked to therelated entity classes. For example, the relationship,Take(s, c) indicates that a student s takes a class c.Attribute classes describe properties of entities or rela-tionships. Attribute classes are connected to the cor-responding entity/relationship class by a dashed line.For example, associated with courses is the attributeclass Course.Difficulty. The attribute class θ in Fig-ure 1 represents the parameters specifying the proba-bility of student’s grade in different configurations (i.e.course’s difficulty and student’s IQ). The arc classesshown as solid arrows from “parent” to “child” repre-sent probabilistic dependencies among correspondingattributes. For example, the solid arrow from Stu-dent.IQ to Course.Grade specifies the fact that stu-dent’s grade probabilistically depends on student’s IQ.For more details please refer to (Heckerman et al.,2004). A relationship attribute might have the spe-cial attribute Exist with Exist= 0 indicating that therelationship does not exist (Getoor et al., 2003). Givenparticular instantiations of entities and relationships aground Bayesian network can be formed which consistsof all attributes in the domain linked by the resultingarc classes.

4 Infinite Hidden Relational Models

4.1 Hidden Relational Models

Figure 2: Infinite hidden relational Model on movierecommendation.

An example of an hidden relational model is shown

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Page 3: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

in Figure 2. The example shows a movie recom-mendation system with entity classes User and Movieand relationship class Like. Furthermore, there arethe attributes UserAttributes, MovieAttributes and R(rating) and various parameters and hyperparameters.The first innovation in our approach is to introduce foreach entity a latent variable, in the example denotedas Zu and Zm. They can be thought of as unknown at-tributes of the entities and are the parents of both theentity attributes and the relationship attributes. Theunderlying assumption is that if the latent variablewas known, both entity attributes and relationship at-tributes can be well predicted. The most important re-sult from introducing the latent variables is that nowinformation can propagate through the ground net-work of interconnected latent variables. Let us con-sider the prediction of the relationship attribute R. Ifboth the associated user and movie have strong knownattributes, those will determine the state of the la-tent variables and the prediction for R is mostly basedon the entity attributes. In terms of a recommendersystem we would obtain a content-based recommen-dation system. Conversely, if the known attributesare weak, then the states of the latent variables forthe user might be determined by the relationship at-tributes in relations to other movies and the states ofthose movies’ latent variables. With the same argu-ment, the states of the latent variables for the moviemight be determined by the relationship attributes inrelations to other users and the states of those users’latent variables. So by introducing the latent variables,information can globally distribute in the ground net-work defined by the relationship structure. This re-duces the need for extensive structural learning, whichis particularly difficult in relational models due to thehuge number of potential parents. Note that a simi-lar propagation of information can be observed in hid-den Markov models in speech systems or in the hiddenMarkov random fields used in image analysis (Yedidiaet a., 2005). In fact the hidden relational model canbe viewed as a generalization of both for relationalstructures.

We now complete the model by introducing the param-eters. First we consider the user parameters. Assumethat Zu has ru states and that πu = (πu

1 , . . . , πuru)

are multinomial parameters with P (Zu = i) = πui

(πui ≥ 0,

∑i πu

i = 1). The multinomial parame-ters are drawn from a Dirichlet prior with πu ∼Dir(·|αu

0/ru, . . . , αu0/ru).

In the experiments all user attributes are assumed tobe discrete and independent given Zu. Thus, a partic-ular user attribute Au with r states is a sample froma multinomial distribution with P (Au = i) = φi and

(φ1, . . . , φr) ∼ Gu0 = Dir(·|β∗

1 , . . . , β∗r ).

It is also convenient to re-parameterize

β0 =

r∑k=1

β∗k βk =

β∗k

β0k = 1, . . . , r

and β = {β1, . . . , βr}. In the application, we assumea neutral prior with βk = 1/r, which represents ourprior belief in the fact that the multinomial parame-ters should be equal. β0 is a parameter indicating howstrongly we believe that the prior distribution repre-sented by β should be true. We finely tune β0 using across validation procedure.

The parameters for the entity class Movie and the re-lationship class Like are defined in an equivalent way.Note, that for the relationship attribute R, ru × rm

parameter vectors γ are generated.

4.2 Infinite Hidden Relational Models

The latent variables play a key role in our model andin many applications, we would expect that the modelmight require a large number of states for the latentvariables. Consider again the movie recommendationsystem. With little information about past ratings allusers might look the same (movies are globally liked ordisliked), with more information available, one mightdiscover certain clusters in the users (action movie afi-cionados, comedy aficionados, ...) but with an increas-ing number of past ratings the clusters might show in-creasingly detailed structure ultimately indicating thateveryone is an individual. It thus makes sense to per-mit an arbitrary number of latent states by using aDirichlet process mixture model. This permits themodel to decide itself about the optimal number ofstates for the latent variables. In addition, the infinitehidden relational model can now also be viewed as adirect generalization of a nonparametric hierarchicalBayesian approach. For an introduction to Dirichletprocesses see for example Teh et al. (2004) and Tresp(2006) . For our discussion is suffices to say that we ob-tain an infinite hidden relational model by simply let-ting the number of states approach infinity, ru → ∞,rm → ∞. Although a model with infinite numbers ofstates and parameters cannot be represented, it turnsout that sampling in such model is elegant and simple,as shown in the next section.

In the Dirichlet mixture model, α0 determines the ten-dency of the model to either use a large number or asmall number of states in the latent variables, which isalso apparent from the sampling procedures describedbelow. In our experiments, we found that α0 is ratheruncritical and was fixed for all models to be equal toten.

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Page 4: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

5 Sampling in the Infinite HiddenRelational Model

Although a Dirichlet process mixture model containsan infinite number of parameters and states, the sam-pling procedure only deals with a growing but finiterepresentation. This sampling procedure is based onthe Chinese restaurant process (CRP) where a stateof a latent variable is identified as a cluster, i.e., atable in a restaurant. We will now describe how theCRP is applied to the infinite hidden relational model.The procedure differs from the standard model by thesampling of the relational attribute where two CRPprocesses are coupled. We omit the description of thesampling of the attributes which is straightforward,given parameter samples.

• The first user is assigned to the user cluster 1, Zu1 = 1;

an associated parameter is generated φ1 ∼ Gu0 . Sim-

ilarly, the first movie is assigned to the movie clus-ter 1, Zm

1 = 1; an associated parameter is generatedθ1 ∼ Gm

0 .

• The parameter γ1,1 ∼ GR0 is drawn for the attributes

of the relation R between the first user cluster and thefirst movie cluster.

• With probability 1/(1 + αu0 ), the second user is also

assigned at the user cluster 1, Zu2 = 1, and inherits

φ1 and γ1,1; with probability αu0/(1 + αu

0 ) the user isassigned at cluster 2, Zu

2 = 2, and new parameters aregenerated φ2 ∼ Gu

0 and γ2,1 ∼ GR0 .

• Equivalently, the second movie is generated. Withprobability 1/(1+αm

0 ), the second movie is assigned atthe movie cluster 1, Zm

2 = 1, and inherits θ1 and γ1,1;with probability αm

0 /(1+αm0 ) the movie is assigned at

cluster 2, Zm2 = 2, and new parameters are generated

θ2 ∼ Gm0 , γ1,2 ∼ GR

0 and γ2,2 ∼ GR0 (if the second

user cluster has appeared so far).

• We continue this process, after Nu users and Nm

movies have been generated, Ku user clusters and Km

movie clusters appear, Nui users are assigned to the

user cluster i, Nmj movies are assigned to the movie

cluster j.

• New user:

– The user Nu + 1 is assigned with probabilityNu

iNu+αu

0to a previously existing cluster i and

inherits φi and {γi,l}Km

l=1 . Thus: ZuNu+1 = i,

Nui ← Nu

i + 1;

– With probabilityαu

0Nu+αu

0the user is assigned to

a new cluster Ku + 1. Thus: ZuNu+1 = Ku + 1,

NuKu+1 = 1.

– For the new user cluster, new parameters aregenerated: φKu+1 ∼ Gu

0 and γKu+1,l ∼ GR0 ,

l = 1 : Km. Ku ← Ku + 1.

• New movie:

– The generative process for the movie Nm + 1 isequivalent.

– The movie Nm + 1 is assigned with probabilityNm

j

Nm+αm0

to a previously existing cluster j and

inherits θj and {γl,j}Ku

l=1. Thus: ZmNm+1 = j,

Nmj ← Nm

j + 1;

– With probabilityαm

0Nm+αm

0the movie is assigned

to a new cluster Km +1. Thus: ZmNm+1 = Km +

1, NmKm+1 = 1.

– For the new movie cluster, new parameters aregenerated: θKm+1 ∼ Gm

0 and γl,Km+1 ∼ GR0 ,

l = 1 : Ku. Km ← Km + 1.

The previous procedure generates samples from thegenerative model. Now we consider sampling from amodel given data, i.e. given a set of movie attributes,user attributes and ratings. We assume that the modelhas U users and M movies and that some instances ofAu, Am, R are known. The goal is now to generatesamples of the parameters φ, θ, γ, the latent variablesZu and Zm, which allows us to then make predictionsabout unknown attributes. We exploit Gibbs samplinginference based on the Chinese restaurant procedureas described in the Appendix. Note, that since theattributes appear as children, unknown attributes canbe marginalized out and thus removed from the model,greatly reducing the complexity. Although the DPmixture model contains an infinite number of states,in the Gibbs sampling procedure only a finite num-ber of states is ever occupied, providing an estimate ofthe true underlying number of clusters (Tresp, 2006).Details on the Gibbs sampler can be found in the Ap-pendix.

6 Experiment on MovieLens

We first evaluate our model on the MovieLens datawhich contains movie ratings from a large number ofusers (Sarwar et al. 2000). The task is to predictwhether a user likes a movie. There are two entityclasses (User and Movie) and one relationship class(Like: users like movies). The User class has sev-eral attribute classes such as Age, Gender, Occupa-tion. The Movie class has attribute classes such asPublished-year, Genres and so on. The relationshiphas an additional attribute R with two states: R = 1indicates that the user likes the movie and R = 0 in-dicates otherwise. The model is shown as Figure 2.In the data set, there are totally 943 users and 1680movies. In addition, user ratings on movies are orig-inally recorded on a five-point scale, ranging from 1to 5. We transfer the ratings to be binary, yes if arating higher than the average rating of the user, andvice versa. Model performance is evaluated using pre-diction accuracy. The experimental results are shownin Table 1. First we did experiments ignoring the at-tributes of the users and the items. We achieved an

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Page 5: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

Table 1: The accuracy of predicting relationships be-tween Users and Movies

Method Accuracy(%)E1: Collaborative filtering 1 64.22E2: Collaborative filtering 2 64.66E3: Infinite hidden relational modelwithout attributes

69.97

E4: Infinite hidden relational model 70.3E5: Content based SVM 54.85

accuracy of 69.97% (E3). This is significantly betterin comparison to approaches using one-sided collabo-rative filtering by generalizing across users (E1) lead-ing to an accuracy of 64.22% or by generalizing acrossitems (E2) leading to an accuracy of 64.66%. Whenwe added information about the attributes of the usersand the model, the prediction accuracy only improvedinsignificantly to 70.3% (E4): the reason is that theattributes are weak predictors of preferences as indi-cated by the bad performance of the SVM prediction(54.85% accuracy, E5) which is solely based on theattributes of the users and the items.

7 Experiment on Medical Data

The second experiment is concerned with a medi-cal domain. The proposed model is shown in Fig-ure 3(a). The domain includes three entity classes(Patient, Diagnosis and Procedure) and two relation-ship classes (Make: physician is making a diagnosisand Take:patient taking a procedure). A patient typi-cally has both multiple procedures and multiple diag-noses. The Patient class has several attribute classesincluding Age, Gender, PrimaryComplaint. To reducethe complexity of Figure 3(a), patient characteristicsare grouped together as PatientAttributes (these at-tributes are not aggregated in learning and inference).The DiagnosisAttributes contain the class of the diag-nosis as specified in the ICD-9 code and the Procedure-Attributes contain the class of the procedure as speci-fied in the CPT4 code. Both the relationships betweenthe patients and the procedures and the relationshipsbetween the patients and the diagnoses are modeled asexistence uncertainty. Rpa,pr = 1 means that the pa-tient received the procedure and Rpa,pr = 0 indicatesotherwise. Equivalently, Rpa,dg = 1 means that thepatient received the diagnosis and Rpa,dg = 0 indicatesotherwise. In the data, there are totally 14062 pa-tients, 703 diagnoses and 367 procedures. The infinitehidden relational model contains three DPs, one foreach entity class. We compare our approach with twomodels. The first one is a relational model using refer-ence uncertainty (Getoor et al., 2003) without a latentvariable structure. The second comparison model is a

content based Bayesian network. In this model, onlythe attributes of patients and procedures determine ifa procedure is prescribed.

We test model performances by predicting the appli-cation of procedures. ROC curve is used as evaluationcriteria. In the experiment we selected the top N pro-cedures recommended by the various models. Sensi-tivity indicates how many percent of the actually be-ing performed procedures were correctly proposed bythe model. (1-specificity) indicates how many of theprocedures that were not actually performed were rec-ommended by the model. Along the curves, the N wasvaried from left to right as N = 5, 10, . . . , 50.

In the experiment we predict a relation between a pa-tient and a procedure given her first procedure. Thecorresponding ROC curves (averaged over all patients)for the experiments are shown in Figure 4. The infinitehidden relational model (E3) exploiting all relationalinformation and all attributes gave best performance.When we remove the attributes of the entities, theperformance degrades (E2). If, in addition, we onlyconsider the one-sided collaborative effect, the perfor-mance is even worse (E1). (E5) is the pure content-based approach using the Bayesian network. The re-sults show that entity attributes are a reasonable pre-dictor but that the performance of the full model can-not be achieved. (E4) shows the results of relationalmodel using relational uncertainty, which gave goodresults but did not achieve the performance of the in-finite hidden relational model. Figures 5 shows thecorresponding plots for a selected class of patients; pa-tients with prime complaint respiratory problem. Theresults exhibit similar trends.

Figure 4: ROC curves for predicting procedures.

8 Experiment on Gene Data

The third evaluation is performed on the yeast genomedata set of KDD Cup 2001 (Cheng et al. 2002). The

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Page 6: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

Figure 3: Infinite hidden relational model for (a) a medical database and (b) a gene database.

Figure 5: ROC curves for predicting procedures ona subset of patients with prime complaint respiratoryproblem.

goal is to evaluate the proposed algorithm to model thedependencies between relationships instantiated fromvarious relationship classes. The genomes in several or-ganisms have been sequenced. Genes code for proteins,the proteins localize in various parts of the cells andinteract with one another to perform crucial functions.Traditionally, the functions of genes/proteins are pre-dicted by comparing with characterized genes/proteinsin sequence similarity. But only 52% of 6449 yeastproteins have been characterized. Of the remaining,only 4% show strong similarity with the known onesat the sequence level. It is therefore necessary to inte-grate other information to characterize genes/proteins.The goal of our experiment is to predict functions ofgenes based on the information not only at the gene-level but also at the protein-level. The data set con-sists of two relational tables that are produced fromthe original seven relational tables. One table speci-

fies a variety of properties of genes or proteins. Theseproperties include chromosome, essential, phenotype,motif, class, complex and function. Chromosome ex-presses the chromosome on which the gene appears.Essential specifies whether organisms with a muta-tion in this gene can survive. Phenotype representsthe observed characteristics of organisms with differ-ences in this gene. Class means the structural cate-gory of the protein for which this gene codes. Motifexpresses the information about the amino acid se-quence of the protein. The value of property complexspecifies how the expression of the gene can complexwith others to form a larger protein. The other tablecontains the information about interactions betweengenes. A gene typically has multiple complexes, phe-notypes, classes, motifs and functions, respectively butonly one property essential and one property chromo-some. An example gene is shown in Table 2. To keepthe multi-relational nature of the data, we restore theoriginal data structure. There are six entity classes(Gene, Complex, Phenotype, Class, Motif and Func-tion) and six relationship classes (Interact: genes in-teract with each other, Have: genes have functions,Observe: phenotype are observed for the genes, Form:which kinds of complex is formed for the genes, Belong:genes belong to structural classes, Contain: genes con-tain characteristic motifs). Gene class has attributeclasses such as Essential, Chromosome, etc. The at-tributes of other entity classes are not available in thedata set. A hidden attribute is added into each entityclass. All relationships are modeled as existence un-certainty. Thus each relationship class has additionalattribute R with two states. The state of R indicateswhether the relationship exists or not. The task offunction prediction of genes is therefore transformedto the relationship prediction between genes and func-

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Page 7: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

Table 2: An example gene

Attribute ValueGene ID G234070Essential Non-EssentialClass 1, ATPases 2, MotorproteinsComplex CytoskeletonPhenotype Mating and sporulation defectsMotif PS00017Chromosome 1Function 1, Cell growth, cell division and DNA

synthesis 2, Cellular organization 3,Cellular transport and transprotmech-anisms

Localization Cytoskeleton

tions. The data set totally contains 1243 genes. Asubset (381 genes) is withheld for testing in the KDDCup 2001. The remaining 862 genes are provided toparticipants. In the data, there are 56 complexes, 11phenotypes, 351 motifs, 24 classes and 14 functions.There are two main challenges in the gene data set.First, there are many types of relationships. Second,there are large numbers of objects, but only a smallnumber of known relationships.

The proposed model applied to the gene data is shownin Figure 3(b). The existence of any relationship de-pends on the hidden states of the corresponding enti-ties. The information about a variety of relationshipsof Gene is propagated via the hidden attribute of Gene.The model is optimized using 862 genes, and is appliedon the testing data. The experiment results are shownin Table 3. There were 41 groups that participated inthe KDD Cup 2001 contest. The algorithms includenaive Bayes, k-nearest neighbor, decision tree, neuralnetwork, SVM, and Bayesian networks, etc. and tech-nologies such as feature selection, boosting, cross val-idation, etc., were employed. The performance of ourmodel is comparable to the best results. The winningalgorithm is a relational model based on inductive logicprogramming. The infinite hidden relational model isonly slightly worse (probably not significantly) if com-pared to the winning algorithm.

Table 3: Prediction of gene functions

Model Accuracy (%) True PositiveRate (%)

Infinite model 93.18 72.8Kdd cup winer 93.63 71.0

In the second set of experiments, we investigated theinfluence of a variety of relationships on the predictionof functions. We perform the experiments by ignoringa specific kind of known relationships. The result is

shown in Table 4. When a specific type of known re-lationship is ignored, lower accuracy indicates higherimportance of this type of relationship. One observa-tion is that the most important relationship is Com-plex, specifying how genes complex with another genesto form larger proteins. The second one is the interac-tion relationships between genes. This coincide withthe lesson learned from KDD Cup 2001 that proteininteraction information is less important in functionprediction. This lesson is somewhat surprising sincethere is a general belief in biology that the knowledgeabout regulatory pathways is helpful to determine thefunctions of genes.

Table 4: The importance of a variety of relationshipsin function prediction of genes

Ignored relationships Accuracy(%) ImportanceComplex 91.13 197Interaction 92.14 100Class 92.61 55Phenotype 92.71 45Attributes of gene 93.08 10Motif 93.12 6

9 Conclusions and Extensions

We have introduced the infinite hidden relationalmodel. The model showed encouraging results on anumber of data sets. We hope that infinite hidden re-lational model will be a useful addition to relationalmodeling by allowing for flexible inference in a rela-tional network reducing the need for extensive struc-tural model search. Implicitly, we have assumed aparticular sampling scheme, i.e., that entities are in-dependently sampled out of unspecified populations.In this context our model permits generalization butit might fail if this assumption is not reasonable or ifthe sampling procedure changes in the test set. Wehave focussed on an explicit modeling of the relationbetween pairs of entities but our model can easily begeneralized if more than two entities are involved in arelation. As part of our future work we will explore andcompare different approximate inference algorithms.

References

P. Carbonetto, J. Kisynski, N. de Freitas, and D. Poole(2005). Nonparametric bayesian logic. In Proc. 21th UAI.

J. Cheng, C. Hatzis, H. Hayashi, M. Krogel, S. Morishita,D. Page, and J. Sese (2002). KDD Cup 2001 report.SIGKDD Explorations, 3(2):47-64.

S. Dzeroski, and N. Lavrac, (ed.), (2001). Relational DataMining. Springer.

N. Friedman, L. Getoor, D. Koller, and A. Pfeffer (1999).

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.

Page 8: Infinite Hidden Relational Models · respective attributes and represents an area of growing interest in machine learning. A simple example of a relational system is a recommendation

Learning probabilistic relational models. Proc. 16th IJ-CAI, 1300-1309. Morgan Kaufmann.

L. Getoor, N. Friedman, D. Koller, and B. Taskar (2003).Learning probabilistic models of link structure. Journal ofMachine Learning Research, 3:679-707.

L. Getoor, D. Koller, and N. Friedman (2000). From in-stances to classes in probabilistic relational models. InProc. ICML 2000 Workshop on Attribute-Value and Re-lational Learning: Crossing the Boundarie. Stanford, CA.

D. Heckerman, C. Meek, and D. Koller (2004). Probabilis-tic Models for Relational Data (Technical Report MSR-TR-2004-30). Microsoft.

C. Kemp, T. Griffiths, and J. R. Tenenbaum (2004). Dis-covering Latent Classes in Relational Data (Technical Re-port AI Memo 2004-019).

L. D. Raedt, and K. Kersting (2003). Probabilistic logiclearning. SIGKDD Explor. Newsl., 5:31-48.

M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth(2004). The author-topic model for authors and docu-ments. In Proc. 20th UAI, 487-494. AUAI Press.

B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl(2000). Analysis of recommender algorithms for e-commerce. In Proc. ACM E-Commerce Conference, 158-167. ACM Press.

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei(2004). Hierarchical Dirichlet Processes (Technical Report653). UC Berkeley Statistics.

V. Tresp, and K. Yu (2004). An introduction to non-parametric hierarchical Bayesian modelling. In Proc. theHamilton Summer School on Switching and Learning inFeedback Systems, 290-312. Springer.

V. Tresp (2006). Dirichlet processes and nonparametricBayesian modelling. Online tutorial.

S. Wrobel (2001). Inductive logic programming for knowl-edge discovery in databases. In S. Dzeroski and N. Lavrac(ed.), Relational Data Mining, 74-101. Springer.

Z. Xu, V. Tresp, K. Yu, S. Yu, and H.-P. Kriegel (2005).Dirichlet enhanced relational learning. In Proc. 22ndICML, 1004-1011. ACM Press.

J. Yedidia, W. Freeman, and Y. Weiss (2005). Construct-ing free-energy approximations and generalized belief prop-agation algorithms. IEEE Transactions on InformationTheory, 51(7):2282-2312.

K. Yu, V. Tresp, and S. Yu (2004). A nonparametric hi-erarchical bayesian framework for information filtering. InProc. 27th SIGIR, 353-360. ACM Press.

Appendix: Inference based on Gibbs Sampling

We assume that users are assigned to the first Ku states ofZu and movies are assigned to the first Km states of Zm.We can do this without loss of generality by exploitingexchangeability. Note, that Ku ≤ U and Km ≤ M . Ifduring sampling a state becomes unoccupied that state isremoved from the model and indices are re-assigned. Tosimplify the description of sampling we will assume thatthis does not occur and that currently no state is occupiedby exactly one item (just to simplify book keeping).

Gibbs sampling updates the assignment of users andmovies to the states of the latent variable and re-samplesthe parameters. In detail:

1. Pick a random user j. Assume that for Nuk users,

Zu = k without counting user j.

(a) Then, we assign state Zuj = k with Nu

k > 0 withprobability proportional to

P (Zuj = k|{Zu

l }Ul6=j , Duj , φ, γ, Zm) ∝

Nuk P (Du

j |φk, γk,∗, Zm)

(b) Instead, a new state Ku + 1 is generated withprobability proportional to

P (Zuj = Ku + 1|{Zu

l }Ul6=j , Duj , φ, γ, Zm) ∝

αu0P (Du

j )

(c) In the first case, the j-th model inherits the pa-rameters assigned to state k: φk, γk,1, . . . γk,Km

(d) In the latter case: new parameters are generatedfollowing

P (φKu+1|Duj )

and

P (γKu+1,l|Duj , Zm), l = 1, . . . , M

2. Pick a random movie i. Updates the latent variablesof Zm

i . The sampling is equivalent to the sampling ofZu, above.

3. Occasionally (typically less often than the updates forthe latent variables): Update the parameters, φ, θ, γfrom posterior distribution based on all the data as-signed to the corresponding a state, resp. pairs ofstates.

In the algorithm, we used the following definitions (termsinvolving entity attributes or relationship attributes whichare not known drop out of the equations)

P (Duj |φk, γ, Zm) = P (Au

j |φk)

M∏l=1

P (Rj,l|γk,Zml

)

P (φ|Duj ) ∝ P (Au

j |φ) Gu0 (φ)

P (γ|Duj ) ∝

M∏l=1

P (Rj,l|γ) GR0 (γ)

The algorithm easily generalizes to multiple relations asdescribed in Section 7 and Section 8. Au

j denotes all knownattributes of user j. Definitions for the movies are equiva-lent.

The most expensive term in the algorithm is in step 1 (a)which scales proportional to the number of known entityand relational attributes of the involved entity and is pro-portional to the number of occupied states.

We are currently exploring various sampling schemes anddeterministic approximations. An extensive comparisonwill be available on the web shortly. The results reportedin this paper were obtained by using the deterministic ap-proximation described in Tresp and Yu, 2004.

Proc. 22nd Conf. on Uncertainty in Artificial Intelligence (UAI'06) Cambridge, MA, 2006.