high performance data mining on multi-core systems

4
HIGH PERFORMANCE DATA MINING ON MULTI-CORE SYSTEMS Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data mining algorithms with good multicore and cluster performance; understand software runtime and parallelization method. Use managed code (C#) and package algorithms as services to encourage broad use assuming experts parallelize core algorithms. CURRENT RESUTS: Microsoft CCR supports MPI, dynamic threading and via DSS Service model of computing; detailed performance measurements Speedups of 7.5 or above on 8- core systems for “large problems” with deterministic annealed (avoid local minima) algorithms for clustering, Gaussian Mixtures, GTM (dimensional reduction); extending to new algorithms/applications SALSA Team Geoffrey Fox Xiaohong Qiu Huapeng Yuan Seung-Hee Bae Indiana University Technology Collaboration George Chrysanthakopoulos Henrik Frystyk Nielsen Microsoft Application Collaboration Cheminformatics Rajarshi Guha David Wild Bioinformatics Haiku Tang Demographics (GIS) Neil Devadasan Indianan University and IUPUI SALSA

Upload: dakota

Post on 24-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

S A L S A Team Geoffrey Fox Xiaohong Qiu Huapeng Yuan Seung-Hee Bae Indiana University Technology Collaboration George Chrysanthakopoulos Henrik Frystyk Nielsen Microsoft Application Collaboration Cheminformatics Rajarshi Guha David Wild Bioinformatics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High Performance Data Mining On Multi-core systems

HIGH PERFORMANCE DATA MINING ON MULTI-CORE SYSTEMS

Service Aggregated Linked Sequential Activities:

GOALS: Increasing number of cores accompanied by continued data delugeDevelop scalable parallel data mining algorithms with good multicore and cluster performance; understand software runtime and parallelization method. Use managed code (C#) and package algorithms as services to encourage broad use assuming experts parallelize core algorithms.

CURRENT RESUTS: Microsoft CCR supports MPI, dynamic threading and via DSS Service model of computing; detailed performance measurementsSpeedups of 7.5 or above on 8-core systems for “large problems” with deterministic annealed (avoid local minima) algorithms for clustering, Gaussian Mixtures, GTM (dimensional reduction); extending to new algorithms/applications

SALSA Team Geoffrey Fox Xiaohong Qiu Huapeng Yuan Seung-Hee Bae Indiana University

Technology Collaboration George Chrysanthakopoulos Henrik Frystyk NielsenMicrosoft

Application CollaborationCheminformatics Rajarshi Guha David WildBioinformatics Haiku TangDemographics (GIS) Neil DevadasanIndianan University and IUPUI

SALSA

Page 2: High Performance Data Mining On Multi-core systems

Speedup = Number of cores/(1+f)f = (Sum of Overheads)/(Computation per core)

Computation Grain Size n . # Clusters KOverheads are

Synchronization: small with CCRLoad Balance: goodMemory Bandwidth Limit: 0 as K Cache Use/Interference: ImportantRuntime Fluctuations: Dominant large n,K

All our “real” problems have f ≤ 0.05 and speedups on 8 core systems greater than 7.6

MPI Exchange Latency in µs (20-30 µs computation between messaging)

Machine OS Runtime Grains Parallelism MPI Latency

Intel8c:gf12(8 core 2.33 Ghz)(in 2 chips)

Redhat MPJE(Java) Process 8 181

MPICH2 (C) Process 8 40.0

MPICH2:Fast Process 8 39.3

Nemesis Process 8 4.21

Intel8c:gf20(8 core 2.33 Ghz)

Fedora MPJE Process 8 157

mpiJava Process 8 111

MPICH2 Process 8 64.2

Intel8b(8 core 2.66 Ghz)

Vista MPJE Process 8 170

Fedora MPJE Process 8 142

Fedora mpiJava Process 8 100

Vista CCR (C#) Thread 8 20.2

AMD4(4 core 2.19 Ghz)

XP MPJE Process 4 185

Redhat MPJE Process 4 152

mpiJava Process 4 99.4

MPICH2 Process 4 39.3

XP CCR Thread 4 16.3

Intel(4 core) XP CCR Thread 4 25.80

0.1

0.2

0.3

0.4

0 0.5 1 1.5 2 2.5 3 3.5 4

FractionalOverheadf

K=10 Clusters

20 Clusters

10000/Grain Size

30 Clusters

DA Clustering Performance

Runtime Fluctuations 2% to 5% overhead

“Main Thread” and Memory M

1m1

0m0

2m2

3m3

4m4

5m5

6m6

7m7

Subsidiary threads t with memory mt

Use Data Decomposition as in classic distributed memory but use shared memory for read variables. Each thread uses a “local” array for written variables to get good cache performance

Parallel Programming Strategy

SALSA

Page 3: High Performance Data Mining On Multi-core systems

Resolution T0.5

r: Rentersa:Asian

h: Hispanic

p: Total

Resolution T0.5

Deterministic Annealing Clustering of Indiana Census DataDecrease temperature (distance scale) to discover more clusters

GTM Projection of 2 clusters of 335 compounds in 155 dimensions

Stop Press: GTM Projection of PubChem: 10,926,94 compounds in 166 dimension binary property space takes 4 days on 8 cores. 64X64 mesh of GTM clusters interpolates PubChem. Could usefully use 1024 cores! David Wild will use for GIS style 2D browsing interface to chemistryBioinformatics: Annealed Clustering and Euclidean embedding for repetitive sequences, gene/protein families. Use GTM to replace PCA in structure analysis

PCA GTM

Linear PCA v. nonlinear GTM on 6 Gaussians in 3DSALSA

Page 4: High Performance Data Mining On Multi-core systems

21

1

( ) ln{ ( ) exp[ 0.5( ( ) ( )) / ( ( ))]N

K

kx

F T a x g k E x Y k Ts k

GENERAL FORMULA DAC GM GTM DAGTM DAGM

SALSA

N data points E(x) in D dim. space and Minimize F by EM

• Link of CCR and MPI

(or cross cluster CCR)• Linear Algebra for

C#: (Multiplication, SVD, Equation

Solve) • High Performance

C# Math Libraries

Deterministic Annealing Clustering (DAC)

• a(x) = 1/N or generally p(x) with p(x) =1

• g(k)=1 and s(k)=0.5• T is annealing temperature varied

down from with final value of 1• Vary cluster center Y(k) but can

calculate Pk and (k) (even for matrix (k)) using IDENTICAL formulae for Gaussian mixtures

• K starts at 1 and is incremented by algorithm

Generative Topographic Mapping (GTM)

• a(x) = 1 and g(k) = (1/K)( /2)D/2

• s(k) = 1/ and T = 1• Y(k) = m=1

M Wmm(X(k)) • Choose fixed m(X) = exp( - 0.5 (X-

m)2/2 ) • Vary Wm and but fix values of M

and K a priori• Y(k) E(x) Wm are vectors in original

high D dimension space• X(k) and m are vectors in 2 dim.

mapped space

We need: Large Windows Cluster

Deterministic Annealing Gaussian mixture models

(DAGM)• a(x) = 1• g(k)={Pk/(2(k)2)D/2}1/T

• s(k)= (k)2 (taking case of spherical Gaussian)

• T is annealing temperature varied down from with final value of 1

• Vary Y(k) Pk and (k) • K starts at 1 and is incremented

by algorithm DAGTM: GTM has several natural annealing

versions based on either DAC or DAGM:

under investigationTraditional

Gaussian mixture models GM

As DAGM but set T=1 and fix K

Principal Component Analysis (PCA)

Near Term Future Work: Parallel Algorithms for

• Random Projection Metric Embedding (Bourgain)

• MDS Dimensional Scaling (EM like SMACOF)

• Marquardt Algorithm for Newton’s Method

Later: HMM and SVM, Other embedding

Parallel Dimensional Scaling and Metric

embedding; Generalized Cluster analysis