graph-based analytics wei wang department of computer science scalable analytics institute ucla...

Graph-based Analytics

Wei WangDepartment of Computer Science

Scalable Analytics Institute

UCLA

[email protected]

Graphs/NetworksFFSM (ICDM03), SPIN (KDD04),GDIndex (ICDE07)MotifMining (PSB04, RECOMB04, ProteinScience06, SSDBM07, BIBM08)COM(CIKM09), GAIA (SIGMOD10), LTS (ICDE11)CGC (KDD13)

Graphs are everywhere

•Frequent subgraphs•Discriminative subgraphs•Graph classification•Graph clustering

Graph Clustering

• Graphs clusteringDecompose a network into sub-networks based on

some topological propertiesUsually we look for dense sub-networks

Detect protein functional modules in a PPI network

from Nataša Pržulj – Introduction to Bioinformatics. 2011.

Community Detection in Social Network

Collaboration network between scientistsfrom Santo Fortunato –Community detection in graphs

Multi-view Graph clustering

• Graphs collected from multiple sources/domains

• Multi-view graph clusteringRefine clusteringResolve ambiguity

Motivation• Multi-view

Exact one-to-oneComplete mappingThe same size

• More common cases Many-to-manyTolerate partial mappingDifferent sizesMappings are associated

with weights(confidence)

Motivation

• Objective: design algorithm which is FlexibilityRobustness

Suitable for common cases :Many-to-many weighted partial mappings for multi-domain graph clustering.

Flexibility and Robustness

Noisy graphs have little influence on others

Problem Formulation

A(1) A(2) A(3)affinity matrix

Sa,b(i,j) denotes the weight between the a-th

instance in Dj and the b-th instance in Di.

To partition each A(π) into kπ clusters while considering the co-regularized constraints implicitly encoded in cross-domain relationships in S.

Here, , where each

represents the cluster assignment of the a-th instance in domain Dπ

Co-regularized multi-domain graph clustering (CGC)

• Single-domain ClusteringSymmetric Non-negative matrix factorization (NMF).Minimizing:

( ) ( ) ( ) ( ) 2|| ( ) ||TFL A H H . .s t ( ) 0H

( ) ( ) ( ) ( )1* * *[ , ,..., ] n kT

a nH h h h R

( )*ah


• Cross-domain Co-regularizationResidual sum of squares (RSS) loss (when the number of

clusters is the same for different domains).

Clustering disagreement (CD) loss (when the number of clusters is the same or different).


• Residual sum of squares (RSS) loss Directly compare the H(π) inferred in different domains. To penalize the inconsistency of cross-domain cluster partitions for

the l-th cluster in Di, the loss for the b-th instance is

where denotes the set of indices of instances in Di that are

mapped to , and is its cardinality. The RSS loss is

e

( , ) ( , ) ( ) ( ) 2, ,( ( , ) )i j i j j jb l b b lJ E x l h

( )( , )

( , ) ( ) ( , ) ( ), ,( , ) ( )

( )

1( , )

| ( ) | ji jb

i j j i j ib b a a li j j

a N xb

E x l S hN x

( , ) ( )( )i j j

bN x( , ) ( )| ( ) |i j j

bN x( )jbx

( , ) ( , ) ( , ) ( ) ( ) 2,

1 1

|| ||jnk

i j i j i j i jRSS b l F

l b

J J S H H

H(1)

12×2

H(2)

19×2

H(3)

7×2

S(3,2)H(3)

19×2S(1,2)H(1)

19×2

H(1)

C1 C2

A 0.8 0.2

B 0.7 0.3

… … …

C 0.1 0.9

S(3,2)

1 2 … 3 4 5

a 0 0 … 0 0 0.4

…… … … … …

S(1,2)

A B … C

1 0.6 0 … 0

2 0.9 0.8 … 0

…… … … …

3 0 0.1 … 0

4 0 0 … 0.6

5 0 0 … 0

H(2)

C1 C2

1 0.8 0.2

2 0.7 0.3

… … …

3 0.1 0.9

4

5

H(3)

C1 C2

a 0.8 0.2

.. … ..


• Clustering disagreement (CD) Indirectly measure the clustering inconsistency of cross-domain cluster

partitions . Intuition:

• and are mapped to 2A⃝� B⃝� ⃝, and is mapped to 4C⃝� ⃝ . Intuitively, if the similarity between cluster assignments for 2⃝ and 4 ⃝ is small, then the similarity of clustering assignments between and and the A⃝� C⃝�similarity between and should also be small.B⃝� C⃝� The CD loss is ( , ) ( , ) ( ) ( , ) ( ) ( ) ( ) 2|| ( ) ( ) ||i j i j i i j i T j j T

CD FJ S H S H H H

0. 8

0. 4

0. 6

0. 6

0. 9

0. 7

0. 1

0. 70. 60. 90. 8

0. 6


• Objective function (Joint Matrix Optimization):

( )

( ) ( , ) ( , )

0(1 ) 1 ( , )

mind

i i j i j

H d i i j I

o L J

Can be solved with an alternating scheme: optimize the objective with respect to one variable while fixing others.

• Data sets:UCI (Iris, Wine, Ionosphere, WDBC)

Construct two cross-domain relationships: Iris-Wine, Ionosphere-WDBC, (positive/negative instances only mapped to positive/negative instances in another domain)

Newsgroups data (from 20 Newsgroups)comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,

comp.sys.mac.hardwarerec.motorcycles, rec.sport.baseball, rec.sport.hockey

protein-protein interaction (PPI) networks (from BioGrid), gene co-expression networks (from Gene Expression Ominbus), genetic interaction network (from TEAM)

Experimental Study

Experimental Study• Effectiveness (UCI data set)

Experimental Study• Robustness Evaluation (UCI)

Experimental Study

• Performance Evaluation

Experimental Study

• Protein Module Detection by Integrating Multi-Domain Heterogeneous Data

5412 genes490032 genetic markers across 4890 (1952 disease and 2938 healthy) samples.We use 1 million top-ranked genetic marker pairs to construct the network and the test statistics as theweights on the edges

Experimental Study

Protein Module Detection:• Evaluation: standard Gene Set Enrichment

Analysis (GSEA)we identify the most significantly enriched Gene Ontology

categories significance (p-value) is determined by the Fisher’s exact test raw p-values are further calibrated to correct for the multiple

testing problem

GSEA

• The hypergeometric distribution is used to model the probability of observing at least k genes from a cluster of size n by chance in a category containing f genes from a total genome size of g genes.

• For example, if the majority of genes in a cluster appear from one category, then it is unlikely that this happens by chance and the category’s p-value would be close to 0.

Experimental Study• Protein Module Detection:

Comparison of CGC and single-domain graph clustering (k = 100)

Experimental Study• Protein Module Detection:

Summary

• In this project,we developed a flexible co-regularized method,

CGC, to tackle the many-to-many, weighted, partial mappings for multi-domain graph clustering.

CGC utilizes cross-domain relationship as co-regularizing penalty to guide the search of consensus clustering structure.

CGC is robust even when the cross-domain relationships based on prior knowledge are noisy.

• SIGKDD’13

Comments and Questions

• [email protected]

graph-based analytics wei wang department of computer science scalable analytics institute ucla...

Documents

multi domain graph clustering

weightsconfidence slide

cgc clustering disagreement

domain cluster partitions

cross domain relationships

th cluster

coregularized constraints

bth instance