graph-based analytics wei wang department of computer science scalable analytics institute ucla...
TRANSCRIPT
Graph-based Analytics
Wei WangDepartment of Computer Science
Scalable Analytics Institute
UCLA
Graphs/NetworksFFSM (ICDM03), SPIN (KDD04),GDIndex (ICDE07)MotifMining (PSB04, RECOMB04, ProteinScience06, SSDBM07, BIBM08)COM(CIKM09), GAIA (SIGMOD10), LTS (ICDE11)CGC (KDD13)
Graphs are everywhere
•Frequent subgraphs•Discriminative subgraphs•Graph classification•Graph clustering
Graph Clustering
• Graphs clusteringDecompose a network into sub-networks based on
some topological propertiesUsually we look for dense sub-networks
Detect protein functional modules in a PPI network
from Nataša Pržulj – Introduction to Bioinformatics. 2011.
Community Detection in Social Network
Collaboration network between scientistsfrom Santo Fortunato –Community detection in graphs
Multi-view Graph clustering
• Graphs collected from multiple sources/domains
• Multi-view graph clusteringRefine clusteringResolve ambiguity
Motivation• Multi-view
Exact one-to-oneComplete mappingThe same size
• More common cases Many-to-manyTolerate partial mappingDifferent sizesMappings are associated
with weights(confidence)
Motivation
• Objective: design algorithm which is FlexibilityRobustness
Suitable for common cases :Many-to-many weighted partial mappings for multi-domain graph clustering.
Flexibility and Robustness
Noisy graphs have little influence on others
Problem Formulation
A(1) A(2) A(3)affinity matrix
Sa,b(i,j) denotes the weight between the a-th
instance in Dj and the b-th instance in Di.
To partition each A(π) into kπ clusters while considering the co-regularized constraints implicitly encoded in cross-domain relationships in S.
Here, , where each
represents the cluster assignment of the a-th instance in domain Dπ
Co-regularized multi-domain graph clustering (CGC)
• Single-domain ClusteringSymmetric Non-negative matrix factorization (NMF).Minimizing:
( ) ( ) ( ) ( ) 2|| ( ) ||TFL A H H . .s t ( ) 0H
( ) ( ) ( ) ( )1* * *[ , ,..., ] n kT
a nH h h h R
( )*ah
Co-regularized multi-domain graph clustering (CGC)
• Cross-domain Co-regularizationResidual sum of squares (RSS) loss (when the number of
clusters is the same for different domains).
Clustering disagreement (CD) loss (when the number of clusters is the same or different).
Co-regularized multi-domain graph clustering (CGC)
• Residual sum of squares (RSS) loss Directly compare the H(π) inferred in different domains. To penalize the inconsistency of cross-domain cluster partitions for
the l-th cluster in Di, the loss for the b-th instance is
where denotes the set of indices of instances in Di that are
mapped to , and is its cardinality. The RSS loss is
e
( , ) ( , ) ( ) ( ) 2, ,( ( , ) )i j i j j jb l b b lJ E x l h
( )( , )
( , ) ( ) ( , ) ( ), ,( , ) ( )
( )
1( , )
| ( ) | ji jb
i j j i j ib b a a li j j
a N xb
E x l S hN x
( , ) ( )( )i j j
bN x( , ) ( )| ( ) |i j j
bN x( )jbx
( , ) ( , ) ( , ) ( ) ( ) 2,
1 1
|| ||jnk
i j i j i j i jRSS b l F
l b
J J S H H
H(1)
12×2
H(2)
19×2
H(3)
7×2
S(3,2)H(3)
19×2S(1,2)H(1)
19×2
H(1)
C1 C2
A 0.8 0.2
B 0.7 0.3
… … …
C 0.1 0.9
S(3,2)
1 2 … 3 4 5
a 0 0 … 0 0 0.4
…… … … … …
S(1,2)
A B … C
1 0.6 0 … 0
2 0.9 0.8 … 0
…… … … …
3 0 0.1 … 0
4 0 0 … 0.6
5 0 0 … 0
H(2)
C1 C2
1 0.8 0.2
2 0.7 0.3
… … …
3 0.1 0.9
4
5
H(3)
C1 C2
a 0.8 0.2
.. … ..
Co-regularized multi-domain graph clustering (CGC)
• Clustering disagreement (CD) Indirectly measure the clustering inconsistency of cross-domain cluster
partitions . Intuition:
• and are mapped to 2A⃝� B⃝� ⃝, and is mapped to 4C⃝� ⃝ . Intuitively, if the similarity between cluster assignments for 2⃝ and 4 ⃝ is small, then the similarity of clustering assignments between and and the A⃝� C⃝�similarity between and should also be small.B⃝� C⃝� The CD loss is ( , ) ( , ) ( ) ( , ) ( ) ( ) ( ) 2|| ( ) ( ) ||i j i j i i j i T j j T
CD FJ S H S H H H
0. 8
0. 4
0. 6
0. 6
0. 9
0. 7
0. 1
0. 70. 60. 90. 8
0. 6
Co-regularized multi-domain graph clustering (CGC)
• Objective function (Joint Matrix Optimization):
( )
( ) ( , ) ( , )
0(1 ) 1 ( , )
mind
i i j i j
H d i i j I
o L J
Can be solved with an alternating scheme: optimize the objective with respect to one variable while fixing others.
• Data sets:UCI (Iris, Wine, Ionosphere, WDBC)
Construct two cross-domain relationships: Iris-Wine, Ionosphere-WDBC, (positive/negative instances only mapped to positive/negative instances in another domain)
Newsgroups data (from 20 Newsgroups)comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,
comp.sys.mac.hardwarerec.motorcycles, rec.sport.baseball, rec.sport.hockey
protein-protein interaction (PPI) networks (from BioGrid), gene co-expression networks (from Gene Expression Ominbus), genetic interaction network (from TEAM)
Experimental Study
Experimental Study• Effectiveness (UCI data set)
Experimental Study• Robustness Evaluation (UCI)
Experimental Study
• Performance Evaluation
Experimental Study
• Protein Module Detection by Integrating Multi-Domain Heterogeneous Data
5412 genes490032 genetic markers across 4890 (1952 disease and 2938 healthy) samples.We use 1 million top-ranked genetic marker pairs to construct the network and the test statistics as theweights on the edges
Experimental Study
Protein Module Detection:• Evaluation: standard Gene Set Enrichment
Analysis (GSEA)we identify the most significantly enriched Gene Ontology
categories significance (p-value) is determined by the Fisher’s exact test raw p-values are further calibrated to correct for the multiple
testing problem
GSEA
• The hypergeometric distribution is used to model the probability of observing at least k genes from a cluster of size n by chance in a category containing f genes from a total genome size of g genes.
• For example, if the majority of genes in a cluster appear from one category, then it is unlikely that this happens by chance and the category’s p-value would be close to 0.
Experimental Study• Protein Module Detection:
Comparison of CGC and single-domain graph clustering (k = 100)
Experimental Study• Protein Module Detection:
Summary
• In this project,we developed a flexible co-regularized method,
CGC, to tackle the many-to-many, weighted, partial mappings for multi-domain graph clustering.
CGC utilizes cross-domain relationship as co-regularizing penalty to guide the search of consensus clustering structure.
CGC is robust even when the cross-domain relationships based on prior knowledge are noisy.
• SIGKDD’13
Comments and Questions