bandwidth-aware scheduling for clustered multi-core...

Post on 01-Oct-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

2 CASPER GROUP

UCY, Cyprus

1 SiPS GROUP

IST, Portugal

Bandwidth-Aware Scheduling for Clustered Multi-Core Systems

Panayiotis Petrides2, Frederico Pratas1 and Pedro Trancoso2

Leonel Sousa1

technology from seed

Motivation

technology from seed

Outline

• Target Architectures

• Overall Scheduling/Mapping Method

• Bandwidth and Execution Time Profiling

• Bandwidth-Aware Scheduler (BAS)

• Experimental Results

• Conclusions and Future Work

technology from seed

Target Architectures: Clustered Multi-Core

• Pros:

• Integration - reduces memory

latency

• Multiple controllers – larger

memory bandwidth

• Cons:

• Die area

• Difficult memory access

management

Bandwidth Aware Scheduler

Balance memory requests among different controllers

• Existing architectures:

• Intel Single-chip Cloud

Computer

• Intel Nehalem

(approximately the same

problem).

technology from seed

Target Architectures: hardware and tools

• Binary instrumentation

– PIN + cache simulation

extension

– Single core execution

• Bandwidth

– The bandwidth calculated

independently for each

application using:

• Type of memory accesses

• Number of memory

accesses

• Average execution time

Characteristics SCC-like architecture

Core type Intel Core2Duo@2.83GHz

L1 Cache 4-way

32KB Data

32KB Instructions

L2 Cache 8-way 2MB Unified

Cache policies Write-back, write-allocate,

No support for coherency

Cluster

configurations

32 (8x4)

64 (16x4)

128 (32x4)

# Controllers 4

Main memory DDR3-800 (6.4 GB/s)

• Architectural model

technology from seed

Overall Scheduling Method

Static

• Memory bandwidth profiling of representative applications from different areas.

• Classification according to the bandwidth requirements

Dynamic

• Dynamically bandwidth sensing

• Use static information to classify applications for run-time scheduling

• Rebalance memory accesses according to the classification

• Distribute (schedule) applications by the multicore-

clusters to overcome “memory wall”

– Main assumptions: all cores are busy

technology from seed

Experimental Setup: Set of Applications

Name Description Tests

TPC-H Decision Support benchmark 16 queries

DEX Graph-based database query

application

8 queries

MrBayes Bioinformatics application performing

Bayesian inference of phylogeny

17 DNA data sets

Biobench Benchmark suite containing different

bioinformatics algorithms

phylip_protdist,

phylip_protpars, fasta_dna,

fasta_protein, and hmmer

NAMD Computer chemistry application for

molecular dynamics simulation

single precision, double

precision

PARSEC Benchmark of representative

applications from different areas

Blackscholes,

streamcluster, freqmine

Total 51 different workloads

technology from seed Bandwidth and Execution Time Profiling:

Classification

• Dimensions considered for characterizing application

– Execution time

– Bandwidth requirements

• Classification used for each application:

Bandwidth

Low (<0.5*AV)

Medium (≥0.5AV and <1.5*AV)

High (≥1.5*AV)

Exec

uti

on

tim

e

Short (<10s)

Short-Low Short-Medium Short-High

Medium (≥10s and <100s)

Medium-Low Medium-Medium Medium-High

Long (≥100s)

Long-Low Long-Medium Long-High

• Bandwidth calculation:

• Chopped the execution

in several phases or

quantum.

• Calculate the bandwidth

for each phase.

• Calculate the average

of all the phases in the

application.

• Phase - smallest period

of time considered

between two scheduling

actions.

technology from seed Bandwidth and Execution Time Profiling:

Classification (cont.)

• Selection of one representative

application per class:

– Calculate the center of each class

– Select the application that is nearest

to the center

• Nine representative

applications were

selected.

Short - Low Short - Medium Short - High

tpch Q3, tpch Q6, tpch Q7, tpch Q8

tpch Q12, tpch Q13, tpch Q16

dex 3 Q4, dex 3 Q8, dex 3 All

dex 4 Q4, dex 4 Q8

dex 5 Q8, dex 5 All

namd single

tpch Q10, tpch Q14 tpch Q15 tpch Q11

Medium - Low Medium - Medium Medium - High

dex 4 All, dex 5 Q4

namd double

freqmine

tpch Q1, tpch Q2 tpch Q9 , streamcluster,

mrbayes 10x5000, 10x20000,

20x5000, 50x1000, 50x1000,

50x5000, 100x1000

Long - Low Long - Medium Long - High

phylip protdist, fasta dna, fasta protein hmmer, phylip protpars mrbayes 10x50 000, 20x20000,

20x50000, 50x20000, 100x5000,

100x20000, 100x50000

technology from seed

Scheduler: Policies Evaluated

• Random Static Scheduler

Agnostic policy representing a common scenario where

applications are mapped to cores according to their resource

availability

• Oracle Scheduler

A policy which takes into account a priori the overall application

bandwidth characteristics to define the best static placement

of the different applications through the chip

• Proposed Bandwidth Aware Scheduler (BAS)

Proposed policy which takes into account the different

demands of applications at run-time level in order to satisfy

their demands and utilize the systems’ bandwidth through

their execution

technology from seed

BAS: Distributing Applications to Cores

• Different distribution scenarios – Variation of the number of cores per cluster

– Variation of the distribution of applications:

» Per cluster

» Overall

• Only considered:

– Distribution of applications within the same time category.

– Reduce the exploration space to some representative distributions

technology from seed

BAS: Considered Inner-cluster Distributions

• Considered distributions inside one cluster:

Ba

nd

wid

th

Low 100% 50% 50% 50% 33% 25% 0% 25% 0% 0%

Medium 0% 50% 25% 0% 33% 50% 100% 25% 50% 0%

High 0% 0% 25% 50% 33% 25% 0% 50% 50% 100%

Low Medium High

Bandwidth Class

00:00:100100:00:00 50:50:00 25:25:50

technology from seed

BAS: Inner-cluster Distributions (cont.)

• Increasing number of cores per

cluster linearly increases the number

of extra phases

• Scalability of multi-core processors

highly dependent of the off-chip

memory bandwidth

Short applications Medium applications

Long applications

technology from seed

BAS: Overall Distribution

• Overall distributions considered:

Ban

dw

idth

Low 50% 50% 33% 0%

Medium 50% 0% 33% 50%

High 0% 50% 33% 50%

Note: Distributions with 100% were not considered because there are no scheduling

opportunities. Distributions with 25% were removed for sake of complexity.

• For the random policy all combinations of inner-core

distributions were considered, for example:

100:

00:0

0

00:1

00:0

0

00:0

0:10

0

33:3

3:33

Overall = 33:33:33

50:5

0:00

50:5

0:00

00:0

0:10

0

33:3

3:33

Overall = 33:33:33

25:2

5:50

25:5

0:25

50:2

5:25

33:3

3:33

Overall = 33:33:33

technology from seed

BAS algorithm

Application

Execution

Apps.

Bandwidth

Distribution

per cluster

Calculate

Bandwidth

Utilization

MAX(UBW)>1

Adaptive

Procedure

T

F

technology from seed

BAS algorithm (cont.)

Calculate

UBWai, UBWbi

and new

BWBal for new

distributions

T

F

UBWa = MAX(UBWi) UBWb = MIN(UBWi)

BWBal(UBWa,UBWb)

Get valid

new

distributions

Compatibility

Distribution

Lookup Table

More ai-bi

valid

distributions

?

Perform

applications

exchanges

from cluster a

to b

Adaptive Procedure

Low complexity: O(n) – n size of the compatible lookup table

technology from seed

50:50:00 50:00:50

00:50:50 33:33:33

BAS: Experimental Results

Short applications

Proposed Scheduler Shows better Results Only one case with worst results

technology from seed

50:50:00 50:00:50

00:50:50 33:33:33

BAS: Experimental Results

Medium applications

Only for two cases same performance with Random Policy Proposed Scheduler Shows better Results

technology from seed

50:50:00 50:00:50

00:50:50 33:33:33

BAS: Experimental Results

Long applications

Proposed Scheduler Shows better Results for all cases

technology from seed Applications Execution Speedup Using

the Bandwidth-Aware Scheduler

Average Speedups

• Short applications: 1.36x

• Medium applications: 1.48x

• Long applications: 1.46x

Short applications

Long applications

Medium applications

technology from seed

Results Analysis

• The majority of the applications distributions benefit from the proposed bandwidth-aware scheduler

• Very close performance of the proposed bandwidth-aware scheduler to the Oracle policy

• Stable performance of the proposed scheduler

• Multi-cores scalability can benefit from the use of the proposed bandwidth-aware scheduler

technology from seed

Conclusions

• We have shown:

–The importance of having a bandwidth aware scheduling policy in clustered multi-core architectures

–There are benefits even for short applications

–Scaling multi-cores is highly correlated with the available bandwidth

• We have proposed:

–A quite simple dynamic bandwidth-aware scheduler

–A set of representative applications with different bandwidth and time requirements

–Scaling multi-cores with the use of the proposed bandwidth-aware scheduler

technology from seed

Future work

• We are performing experimental work on the SCC (Intel donated to us an SCC system)

• Investigation of different but still simple scheduling algorithms, also with more accurate cost functions

• Can we integrate this work in automatic tools at the compiler and OS levels? – Rephrasing the question: can we expect to have automatic

scheduling in these type of systems to overcome memory bandwidth limitation?

technology from seed

Thank You!

http://www.sips.inesc-id.pt

http://www.cs.ucy.ac.cy/carch/casper

Questions?

top related