computer science and engineering predicting performance for grid-based...

22
[email protected] state.edu P. 1 Computer Science and Engineering IPDPS’07 Predicting Performance for Grid-Based Datamining A Performance Prediction Framework for Grid-Based Data Mining Applications Leonid Glimcher Gagan Agrawal

Upload: roderick-mccoy

Post on 04-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 1

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

A Performance Prediction Framework for Grid-Based Data Mining Applications

Leonid Glimcher

Gagan Agrawal

Page 2: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 2

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Motivating Scenario

Data Repository Clusters

Compute Clusters

User?

3 stages:•Disk i/o,•Network,•Compute.

Page 3: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 3

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Remote Data Analysis

• Remote data analysis– Grid is a good fit– Details can be very tedious

• Middleware abstracts away lots of development details

• Resource selection – crucial to performance• Performance prediction facilitates resource

selection

Page 4: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 4

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Presentation Road Map

• Problem statement and motivation• Middleware background• Our performance prediction approach• Experimental evaluation• Related work• Conclusions

Page 5: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 5

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Problem Statement

Given: Parallel data processing application Execution time break-down (profile) Configurations of available computing resources Dataset replicas in different size repositories

Predict application execution time in order to select right dataset replica and resource configuration

Page 6: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 6

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

FREERIDE-G Design

Page 7: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 7

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

FREERIDE-G Processing

KEY observation: most data mining algorithms follow canonical loop

Middleware API: • Subset of data to be

processed• Reduction object • Local and global reduction

operations • Iterator

While( ) {

forall( data instances d) {

I = process(d)

R(I) = R(I) op d

}

…….

}

Page 8: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 8

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Performance Prediction Approach

• 3 Phases of execution:– Retrieval at data server– Data delivery to compute node– Parallel processing at compute node

• Special processing structure:– Generalized reduction

Texec = Tdisk + Tnetwork + Tcompute

Page 9: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 9

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Needed profile information

Numbers of storage nodes (n) compute nodes (c)

Available bandwidth between these (b), in profile configuration

Execution time breakdown: data retrieval (td)

network communication (tn)

data processing (tc) components

Dataset size (s)

Reduction object information: maximum size communication time

Global reduction time

Page 10: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 10

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Data Retrieval and Communication Time

Data Retrieval:

Dataset size (s) and number of data hosts (n) for base profile and predicted configuration (s’ and n’).

Used to scale td.

Data Communication:

Also need dataset size and number of data hosts, as well as bandwidth (b and b’).

Used to scale tn.

tT nnetwtork b

b

n

n

s

s

''

'

Page 11: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 11

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Initial Data Processing Time Prediction

Dataset size (s) and number of compute nodes (c):

• base profile (s,c) • predicted profile (s’, c’)

Used to scale up tc.

Limitations – not modeling:• Inter-processor

communication time• Global reduction time

ccompute tc

c

s

sT

'

'

Page 12: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 12

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Modeling Interprocessor Communication

• Parallel computation involves communication of reduction object

• Communication time (Tro)• Reduction object size (r)• Interprocessor bandwidth (w)• Latency (l)• Reduction object size either

remains constant or scales linearly Tt roc

T '

lrwT ro

^

''

'TT rocompute

Tc

c

s

s

ccompute tc

c

s

sT

'

'

Page 13: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 13

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Modeling Global Reduction

• Global reduction time (Tg) is also serialized

• Depending on application, global reduction time:

– Scales linearly with number of nodes but is constant independent of size

– Stays constant independent of number of nodes, but scales linearly with data size

TTt grocT "

^^

"'

'TTT grocompute

Tc

c

s

s

^

''

'TT rocompute

Tc

c

s

s

Page 14: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 14

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Modeling Across Heterogeneous Clusters

Need scaling factors for all 3 stages of computation (from a set of representative applications).

3/)(

3

3

2

2

1

1

TT

TT

TT

sdisk

disk

disk

disk

disk

disk

A

B

A

B

A

B

d

^^^^

TsTsTsT computenetworkdiskexec AcAnAdB

Page 15: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 15

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

FREERIDE-G Applications

Data mining:• K-means clustering• KNN search• EM clustering

Scientific data processing:• Vortex extraction (right)• Molecular defect detection

and categorization

Page 16: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 16

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Experimental Setup

Base:700 MHz Pentiums connected through Myrinet LaNai 7.0

Heterogeneous prediction:2.4 GHz Opteron 250’s connected through Infiniband (1Gb)

Goal – to correctly model changes in:1. Parallel configuration2. Dataset size3. Network bandwidth4. Underlying resources

TTTexact

predictedexactError||

Page 17: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 17

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Modeling Parallel Performance

Errors for 3 approaches for:

1. Vortex detection, base:• 1-1 configuration• 710 MB dataset

2. Defect detection, base:• 1-1 configuration• 130 MB dataset

Results:• modeling reduction pays

off• accurate predictions

Vortex Detection (base: 1-1 configuration, 710MB dataset)

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

3.50%

4.00%

4.50%

5.00%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r %

no communicationreduction communicationglobal reduction

Molecular Defect Detection (base: 1-1 configuration, 130MB dataset)

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r % no communication

reduction communicationglobal reduction

Page 18: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 18

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Modeling Dataset SizeEM clustering (base: 1-1 configuration/350 MB, predicted: 1.4 GB dataset)

0.00%

1.00%

2.00%

3.00%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r %

global reduction

Molecular Defect Detection (base: 1-1 configuration/130MB dataset; predicting: 1.8 GB dataset)

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r %

global reduction

Errors for 1 (best) approach for:1. EM clustering (1.4 GB) , base:

• 1-1 configuration• 350 MB dataset

2. Defect detection (1.8 GB), base:• 1-1 configuration• 130 MB dataset

Results:• biggest error when number of

data nodes is same as number of compute nodes

• accurate predictions

Page 19: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 19

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Impact of Network Bandwidth

Errors for 1 (best) approach for:1. EM clustering (250 Kbps) ,

base:• 1-1 configuration• 500 Kbps

2. Defect detection (250 Kbps), base:• 1-1 configuration• 500 Kbps

Results:• biggest error when number of

data nodes is same as number of compute nodes

• Modeling reduction is most accurate

EM clustering (base: 4-4 configuration/1.4GB dataset; predicting: 130 MB dataset)

0.00%

0.50%

1.00%

1.50%

2.00%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r %

global reduction

Molecular Defect Detection (base: 4-4 configuration/1.8 GB dataset; predicting: 350 dataset)

0.00%

0.50%

1.00%

1.50%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r %

global reduction

Page 20: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 20

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Predictions for different type of cluster

Errors for 1 (best) approach for:1. Defect detection (1.8 GB) ,

base:• 1-1 configuration• 710 MB dataset

2. EM clustering (700 MB), base:• 8-8 configuration• 350 MB dataset

Results:• Scaling factors different• Largest error when predicted

configuration has same number of compute nodes as base

Molecular Defect Detection (base: 4-4 configuration, 130MB dataset;prediction: 1.8 GB dataset)

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r %

global reduction

EM clustering (base: 8-8 configuration, 350 MB dataset; prediction: 700 MB dataset)

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

1 cn 2 cn 4 cn 8 cn 16 cn 2 cn 4 cn 8 cn 16 cn 4 cn 8 cn 16 cn 8 cn 16 cn

1 2 4 8Number of data nodes

Re

lati

ve

pre

dic

tio

n e

rro

r %

global reduction

Page 21: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 21

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Existing Work

3 broad categories for resource allocation: Heuristic approach to mapping Prediction through modeling:

Statistical estimation/predictionAnalytical modeling of parallel

application Simulation based performance prediction

Page 22: Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework

[email protected]. 22

Computer Science and Engineering

IPDPS’07 Predicting Performance for Grid-Based Datamining

Summary

• Performance prediction approach • Exploits similarities in application processing

structure to come up with very accurate results• Approach accurately models changes in:

– Computing configuration– Dataset size– Network bandwidth– Underlying compute resources