hc-4012, complex network clustering using gpu-based parallel non-negative matrix factorization, by...

Huming Zhu, Maoguo Gong, Baolin Huang [email protected] 2013.11

Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization

Xidian university

openCL COURSE! ID：0222277，0242277 ! Opencl PROGRAMMING，Practice! 2011、2012,2013

2 Parallel Bayesian NMF on GPU

Contents

4 Experiment

5 Conclusion

Complex Network Clustering of NMF 1

3 Sparse BNMF on GPU

5 Xidian University 12/7/13 5

* All pictures are from Internet

Complex Network Clustering


Complex Network Clustering

Network clustering aims to divide a network into several communities. It is required

that the number of edges linking nodes of the same communities should be higher

than the number of edges joining nodes belonging to different communities.

•  Network clustering is essential for understanding how a network is organized and functions.


Non-negative Matrix Factorization (NMF)

" powerful interpretability and close relationship between clustering methods.

" Need a lot of computation power.

"  The NMF problem is defined as a searching for an approximation of the matrix

A with respect to some metric (e.g., the norm) by factoring A into the product

W × H of two reduced matrices W and H.

"  NMF was applied in many areas, image processing,

[1] D. D. Lee, H. S. Seung: Learning the parts of objects by non-negative matrix factorization. Nature 401,pp. 788–791 (1999).


Bayesian NMF

Input : Nonnegative data (observation) matrix A, fixed hyperparameters a, b. Output : Nonnegative matrices W and H Step1 ：Initialize W and H to nonnegative values

Step5. If convergence then stop, otherwise, go to step2.

9 Xidian University 12/7/13


Contents

4 Experiment

5 Conclusion




Parallel Bayesian NMF

• P-BNMF • Sparse-BNMF。


P-BNMF kernel

matrix multiplication

Matrix square sum


"  Update matrix：W*H "  Kernel: mat_mult_AB

Matrix multiplication


sum of square of Matrix



Contents

4 Experiment

5 Conclusion




Sparse-BNMF

Problem

GPU memory 1G，P-BNMF scale limit!

Sparse matrix storage format (CSR) ，Present Sparse-BNMF。

Solution


Sparse-BNMF

CSR :Aj, Av, Ap

CSR column :Aj_column, Av_column, Ap_column


Pseudo-code for A_WH_csr kernel luint row = globalidy; if(row < row_num) {

uint rowStart = Ap[row]; //get the start start position in Aj of this row.

uint rowEnd = Ap[row+1]; //get the end position of this row. int index = rowStart + groupidx * 16 + localid; //the size of group is 16*1

//get the position of this pe(processing elelmet). int col = Aj[index];//get the position in Av of this pe.

int aStart = widthA *groupidy; int aEnd = aStart + widthA -1; int aStep = 16; float Csub = 0.+0.000001; int bStart = col; int bStep = 16*widthB; for(int a = aStart, b = bStart; a < aEnd; a += aStep, b += bStep) { if(rowStart + groupidx * 16 < rowEnd) {//if there exist any nonzero value in this group As[localid]=W[a + localid]; barrier(CLK_LOCAL_MEM_FENCE); } if(rowStart + groupidx * 16+ localid < rowEnd) {// if this pe correspond to a nonzero value for(int k=0; k<16; k++) Bs[k*16+localid]= H[b + k*widthB]; for(int k=0; k<16; k++) Csub += Bs[k*16+localid]*As[k]; } if(rowStart + groupidx * 16+ localid < rowEnd) Av_result[index] =1.0/Csub; }

}



4 Experiment

5 Conclusion



Contents


Machine

" AMD Accelerated Parallel Processing (APP) SDK v2.7, OpenCL 1.2 " Microsoft Visual Studio 2010；

Host Device

Product Name HP xw9400 workstation Product Name AMD Radeon HD 7770

OS Windows 7 .x64 Edition Engine Speed 1000MHz

CPU 4× Dual-Core AMD Opteron 2220 2.80GHz Processing Elements 640

Memory 32GB Memory 1GB GDDR5

Memory Bandwidths 72GB/s

PCI PCI Express® 3.0 x16


synthetic real-world networks Data Vertex Edges Q Data Vertex Edges Q

Benchmark 128 1024 0.450 Facebook 324 4436 0.620

LFR

500 5135 0.813 Email 1133 5451 0.531

1000 9582 0.904 Netscience 1461 2742 0.905

5000 38007 0.908 Power 4941 6594 0.599

10000 148470 0.860 Scientists 6650 59870 0.647

50000 748337 0.900 Hep 7610 15751 0.772

Evaluation Modularity(Q)[1]

1 ( ) ( , )2 2

i jij i j

ij

k kQ A C C

m mδ= −∑

[1]. M. E. J. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2) (2004) 026113.

Q↑，Better Network structure


Network demo

Netscience (part)

Facebook• The netscience network is a network of co-authorship of scientists working on network theory and experiment.


Speedup

Data Vertex K BNMF(s) P-BNMF(s) Sparse-BNMF(s) P-Ratio Sparse-Ratio

Benchmark 128 64 4.165 0.166 0.226 4.37 3.1

LFR

500 128 109.9 0.823 1.096 67.63 51.35 1000 128 712.5 2.98 2.798 187.58 181.6 5000 128 31031.5 109.96 71.167 279.39 417.21

10000 128 186321.7 615.09 334.23 302.92 556.2

50000 128 * * 8250.28 * *

Facebook 324 128 46.25 1.328 1.656 34.82 27.93

Email 1133 128 774.4 3.901 3.042 162.24 189.33

Netscience 1461 128 1253.2 6.725 4.628 166.11 215.81

Power 4941 128 26202.4 108.30 61.787 239.29 404.38 Hep 7610 128 76827.2 271.28 152.66 281.75 491.85

Scientists 6650 128 63254.5 208.2 125.55 303.81 503.84

K is the number of clustering，BNMF(s) serial time,P-Rati: P-BNMF/BNMF speedup Sparse-Ratio:Sparse-BNMF/BNMF speedup。


Speedup

" Netscience " Cluster number K 64~256. " Speedup，Sparse-BNMF better。


"  Using CodeXL to analyze OpenCL kernels on AMD GPUs


Method GlobalWorkSize WorkGroupSize Time

Update_H {1472 128 1} {16 16 1} 6.12726 mat_mult_AB {1472 1472 1} {16 16 1} 10.73615 mat_dot_div {1472 1472 1} {16 16 1} 3.70267

mat_mult_AtB {1472 128 1} {16 16 1} 9.72355 mat_dot_mult {1472 128 1} {16 16 1} 0.30133

mat_squ_sum_row {1472 128 1} {64 1 1} 0.5483 mat_squ_sum_col { 128 1472 1} { 1 64 1} 7.27985

update_invbeta { 128 1 1} { 4 1 1} 0.03763 Update_W { 128 1472 1} {16 16 1} 6.25437

mat_mult_AB {1472 1472 1} {16 16 1} 10.75037

mat_dot_div {1472 1472 1} {16 16 1} 3.64148 mat_mult_ABt { 128 1472 1} {16 16 1} 9.04222 mat_dot_mult { 128 1472 1} {16 16 1} 0.2843

Method GlobalWorkSize WorkGroupSize Time

Update_H {1472 128 1} {16 16 1} 6.11407 A_WH_csr_col {1472 1472 1} { 1 16 1} 7.76119

mat_mult_A_s_col {1461 2048 1} { 1 16 1} 5.36341 mat_dot_mult {1472 128 1} {16 16 1} 0.2917

mat_squ_sum_row {1472 128 1} {64 1 1} 0.55304

mat_squ_sum_col { 128 1472 1} { 1 64 1} 6.99467 update_invbeta {128 1 1} { 4 1 1} 0.03748

Update_W { 128 1472 1} {16 16 1} 6.17718 A_WH_csr {1472 1472 1} {16 1 1} 6.29185

mat_mult_s_Bt {2048 1461 1} {16 1 1} 5.37615

mat_dot_mult { 128 1472 1} {16 16 1} 0.27763

Table1. P-BNMF kernel Table 2.Sparse-BNMF kernel的

" Table 1, bolt kernel，W* H，dot matriply，AtB。 " Table 2, Sparse kernel, A_WH_csr_co和mat_mult_A_s_col。 " CSR is better。

Kernel information provided by CodeXL


PNMF Sparse-BNMF

SIZE small(<10000) big

speedup low high

PNMF VS Sparse-BNMF

# the Sparse-BNMF algorithm can solve the memory limit problem effectively,

# which enables the algorithm to deal with larger scale networks.



4 Experiment

5 Conclusion



Contents


Our work

" Present P-BNMF and Sparse-NMF；

"  P-BNMF；

"  Sparse-BNMF, CSR；

" speedup.

Future

" Portablity。

31

Thank You！

hc-4012, complex network clustering using gpu-based parallel non-negative matrix factorization, by...

Technology

gpu sparse bnmf

int astart

clustering methods

int astep

int col

int bstart

bayesian nmf input

sparsebnmf problemgpu