high performance data mining on multi-core systems

HIGH PERFORMANCE DATA MINING ON MULTI-CORE SYSTEMS

Service Aggregated Linked Sequential Activities:

GOALS: Increasing number of cores accompanied by continued data delugeDevelop scalable parallel data mining algorithms with good multicore and cluster performance; understand software runtime and parallelization method. Use managed code (C#) and package algorithms as services to encourage broad use assuming experts parallelize core algorithms.

CURRENT RESUTS: Microsoft CCR supports MPI, dynamic threading and via DSS Service model of computing; detailed performance measurementsSpeedups of 7.5 or above on 8-core systems for “large problems” with deterministic annealed (avoid local minima) algorithms for clustering, Gaussian Mixtures, GTM (dimensional reduction); extending to new algorithms/applications

SALSA Team Geoffrey Fox Xiaohong Qiu Huapeng Yuan Seung-Hee Bae Indiana University

Technology Collaboration George Chrysanthakopoulos Henrik Frystyk NielsenMicrosoft

Application CollaborationCheminformatics Rajarshi Guha David WildBioinformatics Haiku TangDemographics (GIS) Neil DevadasanIndianan University and IUPUI

SALSA

Speedup = Number of cores/(1+f)f = (Sum of Overheads)/(Computation per core)

Computation Grain Size n . # Clusters KOverheads are

Synchronization: small with CCRLoad Balance: goodMemory Bandwidth Limit: 0 as K Cache Use/Interference: ImportantRuntime Fluctuations: Dominant large n,K

All our “real” problems have f ≤ 0.05 and speedups on 8 core systems greater than 7.6

MPI Exchange Latency in µs (20-30 µs computation between messaging)

Machine OS Runtime Grains Parallelism MPI Latency

Intel8c:gf12(8 core 2.33 Ghz)(in 2 chips)

Redhat MPJE(Java) Process 8 181

MPICH2 (C) Process 8 40.0

MPICH2:Fast Process 8 39.3

Nemesis Process 8 4.21

Intel8c:gf20(8 core 2.33 Ghz)

Fedora MPJE Process 8 157

mpiJava Process 8 111

MPICH2 Process 8 64.2

Intel8b(8 core 2.66 Ghz)

Vista MPJE Process 8 170

Fedora MPJE Process 8 142

Fedora mpiJava Process 8 100

Vista CCR (C#) Thread 8 20.2

AMD4(4 core 2.19 Ghz)

XP MPJE Process 4 185

Redhat MPJE Process 4 152

mpiJava Process 4 99.4

MPICH2 Process 4 39.3

XP CCR Thread 4 16.3

Intel(4 core) XP CCR Thread 4 25.80

0.1

0.2

0.3

0.4

0 0.5 1 1.5 2 2.5 3 3.5 4

FractionalOverheadf

K=10 Clusters

20 Clusters

10000/Grain Size

30 Clusters

DA Clustering Performance

Runtime Fluctuations 2% to 5% overhead

“Main Thread” and Memory M

1m1

0m0

2m2

3m3

4m4

5m5

6m6

7m7

Subsidiary threads t with memory mt

Use Data Decomposition as in classic distributed memory but use shared memory for read variables. Each thread uses a “local” array for written variables to get good cache performance

Parallel Programming Strategy

SALSA

Resolution T0.5

r: Rentersa:Asian

h: Hispanic

p: Total

Resolution T0.5

Deterministic Annealing Clustering of Indiana Census DataDecrease temperature (distance scale) to discover more clusters

GTM Projection of 2 clusters of 335 compounds in 155 dimensions

Stop Press: GTM Projection of PubChem: 10,926,94 compounds in 166 dimension binary property space takes 4 days on 8 cores. 64X64 mesh of GTM clusters interpolates PubChem. Could usefully use 1024 cores! David Wild will use for GIS style 2D browsing interface to chemistryBioinformatics: Annealed Clustering and Euclidean embedding for repetitive sequences, gene/protein families. Use GTM to replace PCA in structure analysis

PCA GTM

Linear PCA v. nonlinear GTM on 6 Gaussians in 3DSALSA

21

1

( ) ln{ ( ) exp[ 0.5( ( ) ( )) / ( ( ))]N

K

kx

F T a x g k E x Y k Ts k

GENERAL FORMULA DAC GM GTM DAGTM DAGM

SALSA

N data points E(x) in D dim. space and Minimize F by EM

• Link of CCR and MPI

(or cross cluster CCR)• Linear Algebra for

C#: (Multiplication, SVD, Equation

Solve) • High Performance

C# Math Libraries

Deterministic Annealing Clustering (DAC)

• a(x) = 1/N or generally p(x) with p(x) =1

• g(k)=1 and s(k)=0.5• T is annealing temperature varied

down from with final value of 1• Vary cluster center Y(k) but can

calculate Pk and (k) (even for matrix (k)) using IDENTICAL formulae for Gaussian mixtures

• K starts at 1 and is incremented by algorithm

Generative Topographic Mapping (GTM)

• a(x) = 1 and g(k) = (1/K)( /2)D/2

• s(k) = 1/ and T = 1• Y(k) = m=1

M Wmm(X(k)) • Choose fixed m(X) = exp( - 0.5 (X-

m)2/2 ) • Vary Wm and but fix values of M

and K a priori• Y(k) E(x) Wm are vectors in original

high D dimension space• X(k) and m are vectors in 2 dim.

mapped space

We need: Large Windows Cluster

Deterministic Annealing Gaussian mixture models

(DAGM)• a(x) = 1• g(k)={Pk/(2(k)2)D/2}1/T

• s(k)= (k)2 (taking case of spherical Gaussian)

• T is annealing temperature varied down from with final value of 1

• Vary Y(k) Pk and (k) • K starts at 1 and is incremented

by algorithm DAGTM: GTM has several natural annealing

versions based on either DAC or DAGM:

under investigationTraditional

Gaussian mixture models GM

As DAGM but set T=1 and fix K

Principal Component Analysis (PCA)

Near Term Future Work: Parallel Algorithms for

• Random Projection Metric Embedding (Bourgain)

• MDS Dimensional Scaling (EM like SMACOF)

• Marquardt Algorithm for Newton’s Method

Later: HMM and SVM, Other embedding

Parallel Dimensional Scaling and Metric

embedding; Generalized Cluster analysis

high performance data mining on multi-core systems

Documents

core algorithms

gf128 core

2amd44 core

g core

inte core

core systems greater

multicore systemsservice

clusters gtm projection