graph-based analytics wei wang department of computer science scalable analytics institute ucla...

26
Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA [email protected]

Upload: britney-davidson

Post on 25-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Graph-based Analytics

Wei WangDepartment of Computer Science

Scalable Analytics Institute

UCLA

[email protected]

Page 2: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Graphs/NetworksFFSM (ICDM03), SPIN (KDD04),GDIndex (ICDE07)MotifMining (PSB04, RECOMB04, ProteinScience06, SSDBM07, BIBM08)COM(CIKM09), GAIA (SIGMOD10), LTS (ICDE11)CGC (KDD13)

Graphs are everywhere

•Frequent subgraphs•Discriminative subgraphs•Graph classification•Graph clustering

Page 3: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Graph Clustering

• Graphs clusteringDecompose a network into sub-networks based on

some topological propertiesUsually we look for dense sub-networks

Page 4: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Detect protein functional modules in a PPI network

from Nataša Pržulj – Introduction to Bioinformatics. 2011.

Page 5: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Community Detection in Social Network

Collaboration network between scientistsfrom Santo Fortunato –Community detection in graphs

Page 6: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Multi-view Graph clustering

• Graphs collected from multiple sources/domains

• Multi-view graph clusteringRefine clusteringResolve ambiguity

Page 7: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Motivation• Multi-view

Exact one-to-oneComplete mappingThe same size

• More common cases Many-to-manyTolerate partial mappingDifferent sizesMappings are associated

with weights(confidence)

Page 8: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Motivation

• Objective: design algorithm which is FlexibilityRobustness

Suitable for common cases :Many-to-many weighted partial mappings for multi-domain graph clustering.

Flexibility and Robustness

Noisy graphs have little influence on others

Page 9: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Problem Formulation

A(1) A(2) A(3)affinity matrix

Sa,b(i,j) denotes the weight between the a-th

instance in Dj and the b-th instance in Di.

To partition each A(π) into kπ clusters while considering the co-regularized constraints implicitly encoded in cross-domain relationships in S.

Page 10: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Here, , where each

represents the cluster assignment of the a-th instance in domain Dπ

Co-regularized multi-domain graph clustering (CGC)

• Single-domain ClusteringSymmetric Non-negative matrix factorization (NMF).Minimizing:

( ) ( ) ( ) ( ) 2|| ( ) ||TFL A H H . .s t ( ) 0H

( ) ( ) ( ) ( )1* * *[ , ,..., ] n kT

a nH h h h R

( )*ah

Page 11: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Co-regularized multi-domain graph clustering (CGC)

• Cross-domain Co-regularizationResidual sum of squares (RSS) loss (when the number of

clusters is the same for different domains).

Clustering disagreement (CD) loss (when the number of clusters is the same or different).

Page 12: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Co-regularized multi-domain graph clustering (CGC)

• Residual sum of squares (RSS) loss Directly compare the H(π) inferred in different domains. To penalize the inconsistency of cross-domain cluster partitions for

the l-th cluster in Di, the loss for the b-th instance is

where denotes the set of indices of instances in Di that are

mapped to , and is its cardinality. The RSS loss is

e

( , ) ( , ) ( ) ( ) 2, ,( ( , ) )i j i j j jb l b b lJ E x l h

( )( , )

( , ) ( ) ( , ) ( ), ,( , ) ( )

( )

1( , )

| ( ) | ji jb

i j j i j ib b a a li j j

a N xb

E x l S hN x

( , ) ( )( )i j j

bN x( , ) ( )| ( ) |i j j

bN x( )jbx

( , ) ( , ) ( , ) ( ) ( ) 2,

1 1

|| ||jnk

i j i j i j i jRSS b l F

l b

J J S H H

Page 13: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

H(1)

12×2

H(2)

19×2

H(3)

7×2

S(3,2)H(3)

19×2S(1,2)H(1)

19×2

H(1)

C1 C2

A 0.8 0.2

B 0.7 0.3

… … …

C 0.1 0.9

S(3,2)

1 2 … 3 4 5

a 0 0 … 0 0 0.4

…… … … … …

S(1,2)

A B … C

1 0.6 0 … 0

2 0.9 0.8 … 0

…… … … …

3 0 0.1 … 0

4 0 0 … 0.6

5 0 0 … 0

H(2)

C1 C2

1 0.8 0.2

2 0.7 0.3

… … …

3 0.1 0.9

4

5

H(3)

C1 C2

a 0.8 0.2

.. … ..

Page 14: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Co-regularized multi-domain graph clustering (CGC)

• Clustering disagreement (CD) Indirectly measure the clustering inconsistency of cross-domain cluster

partitions . Intuition:

• and are mapped to 2A⃝� B⃝� ⃝, and is mapped to 4C⃝� ⃝ . Intuitively, if the similarity between cluster assignments for 2⃝ and 4 ⃝ is small, then the similarity of clustering assignments between and and the A⃝� C⃝�similarity between and should also be small.B⃝� C⃝� The CD loss is ( , ) ( , ) ( ) ( , ) ( ) ( ) ( ) 2|| ( ) ( ) ||i j i j i i j i T j j T

CD FJ S H S H H H

0. 8

0. 4

0. 6

0. 6

0. 9

0. 7

0. 1

0. 70. 60. 90. 8

0. 6

Page 15: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Co-regularized multi-domain graph clustering (CGC)

• Objective function (Joint Matrix Optimization):

( )

( ) ( , ) ( , )

0(1 ) 1 ( , )

mind

i i j i j

H d i i j I

o L J

Can be solved with an alternating scheme: optimize the objective with respect to one variable while fixing others.

Page 16: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

• Data sets:UCI (Iris, Wine, Ionosphere, WDBC)

Construct two cross-domain relationships: Iris-Wine, Ionosphere-WDBC, (positive/negative instances only mapped to positive/negative instances in another domain)

Newsgroups data (from 20 Newsgroups)comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,

comp.sys.mac.hardwarerec.motorcycles, rec.sport.baseball, rec.sport.hockey

protein-protein interaction (PPI) networks (from BioGrid), gene co-expression networks (from Gene Expression Ominbus), genetic interaction network (from TEAM)

Experimental Study

Page 17: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Experimental Study• Effectiveness (UCI data set)

Page 18: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Experimental Study• Robustness Evaluation (UCI)

Page 19: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Experimental Study

• Performance Evaluation

Page 20: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Experimental Study

• Protein Module Detection by Integrating Multi-Domain Heterogeneous Data

5412 genes490032 genetic markers across 4890 (1952 disease and 2938 healthy) samples.We use 1 million top-ranked genetic marker pairs to construct the network and the test statistics as theweights on the edges

Page 21: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Experimental Study

Protein Module Detection:• Evaluation: standard Gene Set Enrichment

Analysis (GSEA)we identify the most significantly enriched Gene Ontology

categories significance (p-value) is determined by the Fisher’s exact test raw p-values are further calibrated to correct for the multiple

testing problem

Page 22: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

GSEA

• The hypergeometric distribution is used to model the probability of observing at least k genes from a cluster of size n by chance in a category containing f genes from a total genome size of g genes.

• For example, if the majority of genes in a cluster appear from one category, then it is unlikely that this happens by chance and the category’s p-value would be close to 0.

Page 23: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Experimental Study• Protein Module Detection:

Comparison of CGC and single-domain graph clustering (k = 100)

Page 24: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Experimental Study• Protein Module Detection:

Page 25: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Summary

• In this project,we developed a flexible co-regularized method,

CGC, to tackle the many-to-many, weighted, partial mappings for multi-domain graph clustering.

CGC utilizes cross-domain relationship as co-regularizing penalty to guide the search of consensus clustering structure.

CGC is robust even when the cross-domain relationships based on prior knowledge are noisy.

• SIGKDD’13

Page 26: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu

Comments and Questions

[email protected]