gpu cuda parallel hierarchical clustering cluster update algorithm

19
GPU Parallel Hierarchical Clustering using CUDA by Marco Janc

Upload: djmj

Post on 03-Mar-2015

291 views

Category:

Documents


1 download

DESCRIPTION

GPU CUDA Hierarchical Clustering update Algorithm

TRANSCRIPT

Page 1: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

GPU Parallel Hierarchical Clustering using CUDA

by Marco Janc

Page 2: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Clustering GPU Kernel

Page 3: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Each Thread processes one cluster ◦ calculates its new value with its neighbor defined by the

relation to the maximum cluster

Input ◦ DDM_(i-1) ◦ Length of DDM_(i-1) ◦ Row and column indices array ◦ Maximum cluster index

Output ◦ DDM_i Length is one triangular number smaller then the length of

DDM_(i-1) ◦ Cluster linear index references Each new cluster consists of two input cluster values, whose

minimum cluster index is saved as a reference.

Clustering GPU Kernel

Page 4: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Clustering GPU Kernel Example

Page 5: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Legend

Clustering GPU Kernel Example (1)

maximum cluster

cluster row or column is equal to maximum cluster row or column

cluster row and column is not equal to maximum cluster row or column, indicates a direct copy of cluster

linIdx Calculates the linear index of the given row and column value with the formula: linIdx = (row - 1) * row * 0.5f + column

thread calculates final value of those two clusters, who are in row- or columnwise relation with the maximum cluster

theroretically unnecessary duplicate identical calculation due to parallelism,

- in output reference index array, indicates maximum cluster reference which is not present in the new DDM, and set to 0, host knows maximum cluster index, to catch it while building the cluster tree

parent

left child / row

right child / column

CPU host tree node

Page 6: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Input Data

length: 10 maximum cluster index: 3

0.938

0.000

1.057 0.539

0.662 0.347 0.274

0.000

0.000

0.000

0

1

2

3

4

4 3 2 1 0 cl\cl

0.938 0 0 1.057 0.539 0 0.662 0.347 0.274 0.000

linearized:

Clustering GPU Kernel Example (2)

Linearized document-document matrix

Row and column indices array ◦ Calculating row index in kernel is vulnerable to low floating

point precision since it includes a square root. ◦ Column index can easily be calculated from a given row index.

Row Index 2 1 2 3 3 4 4 4 3 4

Column Index 0 0 1 2 0 0 2 3 1 1

Page 7: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

update column indices

update row indices

Thread Index 1 0 2 3 6 5 8 9 4 7

Row Index 2 1 2 3 3 4 4 4 3 4

min

New row Index 2 1 2 2 3 4 4 4 1 4

Column Index 0 0 1 2 0 0 2 3 1 1

New column Index 0 0 1 0 0 0 2 0 0 1

min

min

min min min

min

min

min

min min min

Clustering GPU Kernel Example (3)

Get linear index (1)

Note: two clusters are merged at their minimum index

Page 8: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Thread Index 1 0 2 3 6 5 8 9 4 7

1 0 2 1 - 3 5 3 0 4

linIdx linIdx linIdx linIdx linIdx linIdx linIdx linIdx linIdx linIdx

New row Index 2 1 2 2 3 4 4 4 1 4

New column Index 0 0 1 0 0 0 2 0 0 1

Clustering GPU Kernel Example (4)

Get linear index (2)

Upd new row Index 2 1 2 2 3 3 3 3 1 3

Upd new col. Index 0 0 1 0 0 0 2 0 0 1

OutputLinear Index

Note: I. If the new row is greater or equal then the maximum row, it is decreased by 1 II. If I. if cluster is not copied (orange) and if the new column is greater or

equal then the maximum column, it is decreased by 1

Page 9: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

0 0.938 0 0 1.057 0.662 0 0.347 Input DDM_0 0.539 0.274

0 0.738 0 0 0.505 Output DDM_1 0.274

avg

avg

avg

OutputLinear Index 1 0 2 1 - 3 5 3 0 4

Thread Index 1 0 2 3 6 5 8 9 4 7

Clustering GPU Kernel Example (5)

calculate cluster values

Page 10: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

OutputLinear Index 1 0 2 1 - 3 5 3 0 4

Thread Index 1 0 2 3 6 5 8 9 4 7

1 0 2 8 6 output ref Indices 7 Note: • since clusters are merged at their minimum we only need the minimum of the cluster and its neighbor indices

min

min

min

Clustering GPU Kernel Example (6) calculate reference indices

Page 11: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

linear output index 1 0 2 1 - 3 5 3 0 4

0 0 0

Note: • clusters are merged at their minimum cluster • check which child of the cluster equals which child of the maximum cluster • 1 indicates left (row) child • 2 indicates right (column) child • 0 indicates none (direct copy)

copy flag 2 2 2

row index 2 1 2 3 3 4 4 4 3 4

column index 0 0 1 2 0 0 2 3 1 1

thread index 1 0 2 3 6 5 8 9 4 7

Clustering GPU Kernel Example (7) calculate maximum cluster reference position to decrease CPU cluster tree

building costs

Page 12: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

linearized

Output

length: 10

0.938

0.000

1.057 0.539

0.662 0.347 0.274

0.000

0.000

0.000

0

1

2

3

4

4 3 2 1 0 cl\cl

0.938 0 0 1.057 0.539 0 0.662 0.347 0.274 0.000

Clustering GPU Kernel Example (8)

document-document matrix

0.738

0.000

0.274

0.000

0.000

0

1

2

3

3 2 1 0

0.738 0.000 0.000

0.505

0.505 0.274 0.000 length: 6

cl\cl

Page 13: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

update cluster-nodes with gpu calculated values, references and max position values

Update binary CPU Cluster-Tree

0.938 0 0 1.057 0.539 0 0.662 0.347 0.274 0.000

1 0 2 0 2 1 3 0 3 1 3 2 4 0 4 1 4 2 4 3

0

2 1

0.274 0.000

4 1 4 2

0.738

1

0

2

0.505

4 1.05

3 0

1.05

3 0

1.05

3 0

0 0 0 2 2 2

1 0 2 8 6 7

max pos

orig ref

index 0 1 2 3 4 5 8 7 6 9

Note: • a new cpu-thread iterates async over new values, and takes the cluster defined by the index in “orig ref” and adds the maximum cluster at its index defined by “max pos”

Page 14: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Algorithm (CUDA) - parameters

Calculate new values, references and max pos (1)

/*

* @param inVal_g float* input cluster value array

* @param inIdxRow_s int* input Row indices array

* @param inCount_s long long int number of elements [triangular number]

* @param inMaxIdx_s long long int index of the maximum cluster

* @param outValues_g float* output value array

* @param outLinIdxRef_g long long int* output original linear cluster references array

* @param outMaxPos_g int* output new cluster maximum position

0 = no relation

1 = left (row) child is maximum cluster left or right child

2 = right (column) is maximum cluster left or right child

*

* _g = global memory; _s = shared memory

*/

__global__ void calcClusterNewValuesRefMaxPos(const float* inValues_g,

const unsigned int* inIdxRow_g,

const unsigned long long int inCount_s,

const unsigned long long int inMaxIdx_s,

float* outValues_g,

unsigned long long int* outLinIdxRef_g,

unsigned int* outMaxPos_g)

{

//... see next slides

}

Page 15: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Algorithm (CUDA) – initialize cluster objects

Calculate new values, references and max pos (2)

const unsigned long long int blockId = blockIdx.y * gridDim.x + blockIdx.x

+ gridDim.x * gridDim.y * blockIdx.z;

const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;

//maximum cluster is ignored

if(tId >= inCount_s || tId == inMaxIdx_s)

return;

//get maximum cluster / read row indices to calculate column indices

Idx2D clusterMax = Idx2D(inIdxRow_g[inMaxIdx_s], 0);

clusterMax.column = getTriMatCol(inMaxIdx_s, clusterMax.row);

//get cluster of this thread

ElFloat2D cluster = ElFloat2D(inIdxRow_g[tId], 0, inValues_g[tId]);

cluster.column = getTriMatCol(tId, cluster.row);

//relative cluster, init with cluster

ElFloat2D clusterRel = ElFloat2D(cluster.row, cluster.column, cluster.value);

//0 = direct copy of cluster, no relative

//1 = cluster max will be merged right, 2 = left

unsigned int copy = 0;

Page 16: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Algorithm (CUDA) – find neighbor / relative cluster and save max pos

Calculate new values, references and max pos (3)

//find relative cluster

if(cluster.row == clusterMax.column)

{

copy = 1;

clusterRel.set(clusterMax.row, cluster.column);

}

else if(cluster.row == clusterMax.row)

{

copy = 1;

if(clusterMax.column > cluster.column)

clusterRel.set(clusterMax.column, cluster.column);

else

clusterRel.set(cluster.column, clusterMax.column);

}

else if(cluster.column == clusterMax.column)

{

copy = 2;

if(cluster.row > clusterMax.row)

clusterRel.set(cluster.row, clusterMax.row);

else

clusterRel.set(clusterMax.row, cluster.row);

}

else if(cluster.column == clusterMax.row)

{

copy = 2;

clusterRel.set(cluster.row, clusterMax.column);

}

Page 17: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Algorithm (CUDA) – calculate new value and neighbor minimum index

Calculate new values, references and max pos (4)

//merge neighbors at their minimum index and calculate new value

if(copy != 0)

{

clusterRel.value = inValues_g[getMatLinIdx(clusterRel.row, clusterRel.column)];

cluster.row = min(cluster.row, clusterRel.row);

cluster.column = min(cluster.column, clusterRel.column);

cluster.value = 0.5f * (cluster.value + clusterRel.value); //average-linkage

}

//Update Row and Column Indices by reducing them with one

//if they are larger then their cluster max counterparts

if(cluster.row >= clusterMax.row)

{

cluster.row--;

//non-copy clusters dont need column decrease

if(copy == 0 && cluster.column > clusterMax.row)

cluster.column = max(0, cluster.column - 1);

}

//get minimum reference index

const unsigned long long int minRefIdx = min(tId, getMatLinIdx(clusterRel.row,

clusterRel.column));

Page 18: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Algorithm (CUDA) – output new data

Calculate new values, references and max pos (5)

//output at minimum index

if(minRefIdx == tId)

{

//Get output linear index

const unsigned long long int outLinIdx = getMatLinIdx(cluster.row, cluster.column);

outValues_g[outLinIdx] = cluster.value;

outLinIdxRef_g[outLinIdx] = minRefIdx;

outMaxPos_g[outLinIdx] = copy;

}

Page 19: GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm

Algorithm (Java) – update cpu cluster treenodes

Update binary CPU Cluster-Tree

//original cluster list

ArrayList<Cluster> clusters;

//new cluster list with size one triangular number smaller then size of original

ArrayList<Cluster> clustersNew = new ArrayList<Cluster>(lengthOutput);

//cluster max

Cluster clusterMax = clusters.get(clusterMaxIndex);

//new cluster values; original references, new maximum cluster position

float[] newClusterValues; long[] clusterLinIdxRefs; int[] clusterMaxPos;

//iterate over all clusters

for(long i = 0; i < lengthOutput; i++)

{

Cluster cluster = clusters.get(clusterLinIdxRefs[i]);

if(clusterMaxPos[i] != 0) //0 indicates direct copy

{

cluster.setValue(newClusterValues[i]);

if(clusterMaxPos[i] == 1) //1 indicates cluster max will be left

cluster.setCluster1(clusterMax);

else //2 indicates cluster max will be right

cluster.setCluster2(clusterMax);

}

clustersNew.add(cluster);

}

this.clusters = clustersNew;