1 fully automatic cross-associations deepayan chakrabarti (cmu) spiros papadimitriou (cmu)...
Post on 21-Dec-2015
226 views
TRANSCRIPT
1
Fully Automatic Cross-Associations
Deepayan Chakrabarti (CMU)Spiros Papadimitriou (CMU)Dharmendra Modha (IBM)Christos Faloutsos (CMU and IBM)
2
Problem Definition
Products
Cus
tom
ers
Cus
tom
er G
roup
s
Product Groups
Simultaneously group customers and products, or, documents and words, or, users and preferences …
3
Problem Definition
Desiderata:
1. Simultaneously discover row and column groups
2. Fully Automatic: No “magic numbers”
3. Scalable to large matrices
4
Closely Related Work
Information Theoretic Co-clustering [Dhillon+/2003] Number of row and column groups must be
specified
Desiderata:
Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
Scalable to large graphs
5
Other Related Work
K-means and variants: [Pelleg+/2000, Hamerly+/2003]
“Frequent itemsets”: [Agrawal+/1994]
Information Retrieval:[Deerwester+1990, Hoffman/1999]
Graph Partitioning:[Karypis+/1998]
Do not cluster rows and cols simultaneously
User must specify “support”
Choosing the number of “concepts”
Number of partitions
Measure of imbalance between clusters
6
What makes a cross-association “good”?
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups
Good Clustering
1. Similar nodes are grouped together
2. As few groups as necessary
A few, homogeneous
blocks
Good Compression
Why is this better?
implies
7
Main Idea
Good Compression
Good Clusteringimplies
Column groups
Row
gro
ups
pi1 = ni
1 / (ni1 + ni
0)
(ni1+ni
0)* H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi
Binary Matrix
+Σi
8
Examples
One row group, one column group
high low
m row group, n column group
highlow
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi +Σi
9
What makes a cross-association “good”?
Why is this better?
low low
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi +Σi
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups
10
Algorithmsk =
5 row groups
k=1, l=2
k=2, l=2
k=2, l=3
k=3, l=3
k=3, l=4
k=4, l=4
k=4, l=5
l = 5 col groups
11
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-association
Lower the encoding cost
12
Fixed k and ll = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-association
Lower the encoding cost
13
Fixed k and l
Column groups
Row
gro
ups Shuffles:
for each row:
shuffle it to the row group which minimizes the code cost
15
Choosing k and ll = 5
k = 5
Start with initial matrix
Choose better values for k and l
Final cross-association
Lower the encoding cost
Find good groups for fixed k and l
16
Choosing k and ll = 5
k = 5
Split:1. Find the row group R with the maximum entropy per row
2. Choose the rows in R whose removal reduces the entropy per row in R
3. Send these rows to the new row group, and set k=k+1
18
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-association
Lower the encoding cost
Shuffles
Splits
19
Experiments
l = 5 col groups
k = 5 row
groups
“Customer-Product” graph with Zipfian sizes, no noise
20
Experiments
“Quasi block-diagonal” graph with Zipfian sizes, noise=10%
l = 8 col groups
k = 6 row
groups
21
Experiments
“White Noise” graph: we find the existing spurious patterns
l = 3 col groups
k = 2 row
groups
22
Experiments“CLASSIC”
• 3,893 documents
• 4,303 words
• 176,347 “dots”
Combination of 3 sources:
• MEDLINE (medical)
• CISI (info. retrieval)
• CRANFIELD (aerodynamics)
Doc
umen
ts
Words
24
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
insipidus, alveolar, aortic, death, prognosis, intravenous
blood, disease, clinical, cell, tissue, patient
25
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
CISI(Information Retrieval)
providing, studying, records, development, students, rules
abstract, notation, works, construct, bibliographies
26
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
CRANFIELD (aerodynamics)
shape, nasa, leading, assumed, thin
CISI(Information Retrieval)
27
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
CISI(IR)
CRANFIELD (aerodynamics)
paint, examination, fall, raise, leave, based
28
ExperimentsN
SF
Gra
nt P
ropo
sals
Words in abstract
“GRANTS”
• 13,297 documents
• 5,298 words
• 805,063 “dots”
29
Experiments
“GRANTS” graph of documents & words: k=41, l=28
NS
F G
rant
Pro
posa
ls
Words in abstract
30
Experiments
“GRANTS” graph of documents & words: k=41, l=28
The Cross-Associations refer to topics:
• Genetics
• Physics
• Mathematics
• …
31
Experiments
“Who-trusts-whom” graph from epinions.com: k=18, l=16
Ep
inio
ns.
com
use
r
Epinions.com user
32
Experiments
Number of “dots”
Tim
e (
secs
)
Splits
Shuffles
Linear on the number of “dots”: Scalable
33
Conclusions
Desiderata:
Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
Scalable to large matrices
34
Cross-Associations ≠ Co-clustering !Information-theoretic
co-clustering Cross-Associations
1. Lossy Compression.
2. Approximates the original matrix, while trying to minimize KL-divergence.
3. The number of row and column groups must be given by the user.
1. Lossless Compression.
2. Always provides complete information about the matrix, for any number of row and column groups.
3. Chosen automatically using the MDL principle.
37
Fixed k and ll = 5
k = 5
Start with initial matrix
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Find good groups for fixed k and l
swaps swaps
39
Aim
Given any binary matrix a “good” cross-association will have low cost
But how can we find such a cross-association?
l = 5 col groups
k = 5 row
groups
40
Main Idea
sizei * H(pi) +Cost of describing cross-associations
Code Cost Description Cost
Σi Total Encoding Cost =
Good Compression
Better Clusteringimplies
Minimize the total cost