workload balancing on heterogeneous systems: a case …...− workload balancing multi-versioning...

Workload Balancing on Heterogeneous Systems:A Case Study of Sparse Grid Interpolation

Alin Murarasu ([email protected]), Josef Weidendorfer,Arndt Bode

Informatik XTechnische Universität München

mailto:[email protected]

29.08.2011 A. Murarasu @ UCHPC 2011 2

Outline

Introduction and Motivation Heterogeneous Computing Workload Balancing Strategies

Sparse Grid Interpolation Dynamic Load Balancing Static Load Balancing Results

Conclusion

29.08.2011 A. Murarasu @ UCHPC 2011 3

Introduction

Accelerators provide more speed, acceptable power GPUs = the most common accelerators Many systems are heterogeneous, e.g. CPUs + GPUs

To efficiently handle heterogeneity we need:− Multi-version functions for CPU and GPU− Workload balancing

Multi-versioning results from writing functions that exploit specific properties of CPUs / GPUs

Workload balancing is built on top of multi-versioning and is a “must” for performance portability

29.08.2011 A. Murarasu @ UCHPC 2011 4

Motivation

It is challenging to achieve optimal performance on hybrid systems− There are multiple options for load balancing: static / dynamic− For dynamic, the impact of the task (grain) size on performance?

We study these points for an interpolation method

Our goal: to make interpolation faster by using all the compute resources in a heterogeneous system, i.e. GFlops(hybrid) = GFlops(cpu) + GFlops(gpu)

29.08.2011 A. Murarasu @ UCHPC 2011 5

Heterogeneous Systems

CPUs = latency oriented processors− Large caches− Complex logic for out-of-order execution and speculation− General purpose

GPUs = throughput oriented processors− Large number of in-order cores (16) with wide SIMD units (32 lanes)− Interleaved multi-threading, 1536 concurrent threads / core− Special purpose

Different hardware properties => different optimizations For CPUs: cache blocking, vectorization For GPUs: maximizing # concurrent threads, coalescing

accesses to global memory, using memories (global, shared, constant, texture) appropriately, minimizing # branches and # bank conflicts

29.08.2011 A. Murarasu @ UCHPC 2011 6

Heterogeneous Computing

For programming GPUs, an offloading approach is typically employed, similar to co-processors

We determine a mapping “function – processor type” This often leads to idle compute resources

Instead, the CPUs and the GPUs could cooperate for executing a function => full utilization of resources

For this we need multi-versioning and load balancing

29.08.2011 A. Murarasu @ UCHPC 2011 7

Load Balancing Strategies

Dynamic− Load distribution can be changed after the computation has started− Adapts to different systems / different inputs / external system load− A typical dynamic strategy is task based, receiver initiated− Overhead from task queue management and distribution− The grain size problem: too small / too big

Static− Distributes the workload according to the speeds of the processors− An initial distribution does not change during the execution− Fewer sources of overhead− No grain size problem − But less adaptive

29.08.2011 A. Murarasu @ UCHPC 2011 8

The Grain Size Problem (1)

The grain size problem:− If too small => a lot of overhead− If too big => no balancing− If ignores CPU optimizations => inefficient on CPU− If ignores GPU optimizations => inefficient on GPU

Examples: − Grain size is a multiple of the cache tile size on CPU− Grain size is a multiple of # active threads on GPU

Solutions:− Determine the optimal value− Use static workload balancing

29.08.2011 A. Murarasu @ UCHPC 2011 9

The Grain Size Problem (2)

The problem gets more complicated when the optimal value of the grain size depends on the inputs

This is the case with our interpolation method The figures below are obtained for different inputs

29.08.2011 A. Murarasu @ UCHPC 2011 10

Sparse Grid Interpolation

Performance critical for a computational steering application Simulation data is D-dimensional

− Compressed using the sparse grid technique− Decompressed using sparse grid interpolation

Compression is lossy The refinement level L controls the accuracy

Reynolds

29.08.2011 A. Murarasu @ UCHPC 2011 11

Performance Behavior

CI := flops / (bytes transferred to / from main memory) D (dimensionality): D ++ => CI ++ L (refinement level): L ++ => CI -- N (number of interpolations): N ++ => CI ++ D, L, N influence the performance on CPU / GPU and load balancing

L

D

29.08.2011 A. Murarasu @ UCHPC 2011 12

Dynamic Task Based Load Balancing (1)

The grain size is important for performance Finding its optimal value can be time-consuming since

it depends on D, L, N => reduce the search time

We model Tx(w) as the # tasks executed on compute resource x multiplied by the runtime of each task on x

=> we simplify the best grain size's dependence on N

29.08.2011 A. Murarasu @ UCHPC 2011 13

Dynamic Task Based Load Balancing (2)

We approximate the runtime of a task (tcpu(w) / tgpu(w) from below) as a function of workload for every D, L

We reuse tx(w) for any N We solve the system (right) The optimal grain size = argmin(T'cpu(w))

CPU GPU

29.08.2011 A. Murarasu @ UCHPC 2011 14

Static Load Balancing

Central idea: find w that minimizes max(Tcpu(w), Tgpu(N – w)) We approximate Tcpu(w), Tgpu(w) with “ct0 x w + ct1” On CPU this is straightforward On GPU: steps from high # cores, # SIMD lanes, # concurrent threads We build a linear approximation using the value of Tgpu at w = 1 and

w = max. # concurrent threads + 1 The result is reused for every N

CPU GPU

29.08.2011 A. Murarasu @ UCHPC 2011 15

Results

The GPU is 4 – 8 times faster than the CPU(s) on our systems => we aim at 12.5 – 25% extra speedup from CPU(s) + GPU

Dynamic is implemented using OMP + CUDA and StarPU The best grain size varies from ~44000 to ~7500 for D = 1 ... 20 Static performs the best

8 CPU cores1 GPU

4 CPU cores1 GPU

29.08.2011 A. Murarasu @ UCHPC 2011 16

Conclusion

Heterogeneity is an effective way to increase computation power while keeping the energy consumption at an acceptable level => it is likely see it in the future

Load balancing is a “must” for optimal / portable performance on heterogeneous systems

Load balancing can be achieved dynamically / statically

Grain size is important for dynamic task based balancing => ignoring it can severely reduce the performance

Static balancing provides close to optimal performance for sparse grid interpolation on heterogeneous systems

29.08.2011 A. Murarasu @ UCHPC 2011 17

Thank you for listening!

workload balancing on heterogeneous systems: a case …...− workload balancing multi-versioning...

Documents