asynchronous task creation for task-based parallel ... · asynchronous taskwait •counter of tasks...

26
Asynchronous Task Creation for Task-Based Parallel Programming Runtimes Jaume Bosch ([email protected]), Xubin Tan, Carlos Álvarez, Daniel Jiménez, Xavier Martorell and Eduard Ayguadé Barcelona, Sept. 24, 2018

Upload: others

Post on 25-Nov-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Asynchronous Task Creation for Task-Based Parallel Programming Runtimes

Jaume Bosch ([email protected]), Xubin Tan, Carlos Álvarez, Daniel Jiménez, Xavier Martorell and Eduard Ayguadé

Barcelona, Sept. 24, 2018

Page 2: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

1) Introduction 2) OmpSs@FPGA 3) Implementation and design 4) Conclusion

Page 3: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Task-Based Parallel Programming Models

void cholesky(float *A[NT][NT]) { for (int k=0; k<NT; k++) { #pragma omp task inout(A[k][k]) spotrf( A[k][k] ) ; for (int i=k+1; i<NT; i++) { #pragma omp task in(A[k][k]) inout(A[k][i]) strsm( A[k][k], A[k][i] ); } for (i=k+1; i<NT; i++) { for (int j=k+1; j<i; j++) { #pragma omp task in(A[k][i], A[k][j]) inout(A[j][i]) sgemm( A[k][i], A[k][j], A[j][i] ); } #pragma omp task in(A[k][i]) inout(A[i][i]) ssyrk( A[k][i], A[i][i] ); } } }

3

Page 4: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Task-Based Parallel Programming Models

void cholesky(float *A[NT][NT]) { for (int k=0; k<NT; k++) { #pragma omp task inout(A[k][k]) spotrf( A[k][k] ) ; for (int i=k+1; i<NT; i++) { #pragma omp task in(A[k][k]) inout(A[k][i]) strsm( A[k][k], A[k][i] ); } for (i=k+1; i<NT; i++) { for (int j=k+1; j<i; j++) { #pragma omp task in(A[k][i], A[k][j]) inout(A[j][i]) sgemm( A[k][i], A[k][j], A[j][i] ); } #pragma omp task in(A[k][i]) inout(A[i][i]) ssyrk( A[k][i], A[i][i] ); } } }

4

Page 5: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Runtime operational flow of a task

5

Ru

nti

me

Runtime

TDG Ready Task Pool

Task states:

Instantiated Ready Active Completed

Page 6: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Current target model

6

Ru

nti

me

Page 7: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Next? target model

Ru

nti

me

Ru

nti

me

7

Page 8: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

1) Introduction

2) OmpSs@FPGA 3) Implementation and design 4) Conclusion

Page 9: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

OmpSs: Forerunner of the OpenMP tasking

+ Task prototyping

+ Task dependences

+ Task priorities + Taskloop prototyping

+ Task reductions + Taskwait deps + OMPT impl. + Multideps + Commutative + Data affinity

+ Taskloop dependences

Today

9

Page 10: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

OmpSs@FPGA

• Easily offloading tasks to FPGA devices • Automatic generation

of FPGA bitstream from C/C++ tasks

• Automatic data movements and task synchronization between the host and the FPGA

• Support for HW instrumentation integrated in Extrae

10

Page 11: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

OmpSs@FPGA | Source code example

void dotProductBlock(float *v1, float *v2, float *result) { int resultLocal = result[0]; for (size_t i = 0; i < BSIZE; ++i) { resultLocal += v1[i]*v2[i]; } result[0] = resultLocal; } int main() { ... for (size_t i = 0; i < VSIZE; i += BSIZE) { dotProductBlock(v1+i, v2+i, results+i/BSIZE); } #pragma omp taskwait ... }

11

#pragma omp task in([BSIZE]v1, [BSIZE]v2) inout([1]result)

#pragma omp target device(fpga) num_instances(2)

Page 12: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

OmpSs@FPGA | Compilation process

(fpgacc, fpgacxx)

Native compiler

autoVivado FPGA specific vendor tools

Linker

12

Page 13: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

OmpSs@FPGA | Execution

Runtime

TDG Ready Task Pool

Task Manager

ReadyQ FinishQ

13

Page 14: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

OmpSs@FPGA | Execution

Runtime

TDG Ready Task Pool

Task Manager

ReadyQ FinishQ

14

Page 15: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

1) Introduction 2) OmpSs@FPGA

3) Implementation and design 4) Conclusion

Page 16: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Task creation inside a fpga task

#pragma omp target device(fpga) num_instances(2) #pragma omp task in([BSIZE]v1, [BSIZE]v2) inout([1]result) void dotProductBlock(float *v1, float *v2, float *result) { ... } #pragma omp target device(fpga) #pragma omp task in([VSIZE]v1, [VSIZE]v2) inout([1]result) void dotProduct(float *v1, float *v2, float *result) {

... for (size_t i = 0; i < VSIZE; i += BSIZE) { dotProductBlock(v1+i, v2+i, results+i/BSIZE); }

... } int main() { ... dotProduct(v1, v2, &result); ... }

16

Page 17: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Task creation from FPGA

Runtime

TDG Ready Task Pool

Task Manager

ReadyQ FinishQ

17

NewQ

Page 18: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Task creation from FPGA with Picos

Runtime

TDG Ready Task Pool

TM

18

RQ FQ NQ

Page 19: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Asynchronous task creation

• Shared memory region between the accelerator and the handler • Coherent and consistent

• Tasks must be added and handled in the same order • Ensure sequential order execution

• Handling must cannot be parallelized for the same parent task

• Shared memory can be splited into sub-regions, one for each accelerator

19

Page 20: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Taskwait inside a fpga task

#pragma omp target device(fpga) num_instances(2) #pragma omp task in([BSIZE]v1, [BSIZE]v2) inout([1]result) void dotProductBlock(float *v1, float *v2, float *result) { ... } #pragma omp target device(fpga) #pragma omp task in([VSIZE]v1, [VSIZE]v2) inout([1]result) void dotProduct(float *v1, float *v2, float *result) { for (size_t i = 0; i < VSIZE; i += BSIZE) { dotProductBlock(v1+i, v2+i, results+i/BSIZE); } #pragma omp taskwait } int main() { ... dotProduct(v1, v2, &result); ... }

20

Page 21: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Taskwait inside a fpga task

Runtime

TDG Ready Task Pool

Task Manager

RQ FQ

21

NQ TW Taskwait Manager

Parent Task Id Count

1 1 99

1

-1 0

Page 22: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Asynchronous taskwait

• Counter of tasks executed in each context (parent task id) • Children of SMP tasks not considered, do not have a parent fpga task id

• Accelerator blocks until receives a signal from the Taskwait Manager • The signal is sent when the count becomes 0

• The entry for a context is deleted when the task finalizes • Garbage collector

22

Page 23: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

1) Introduction 2) OmpSs@FPGA 3) Implementation and design

4) Conclusion

Page 24: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Execution trace Matrix Multiply 1024*1024, 3 accelerators at 333Mhz 128*128

• With current implementation, we can save around 50% the time of 1 thread

• With the Picos implementation, we can save an additional thread (fpga helper thread)

24

Page 25: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

Conclusions

• Design of a general mechanism for asynchronous task creation

• Suitable for accelerators

• Design successfully implemented over OmpSs@FPGA on an Axiom Board

• Similar performance to SMP task creation version despite the round trip

• Next steps

• Integration with Picos HW manager

• Performance evaluation for iterative solvers

25

Page 26: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,

THANK YOU!

www.bsc.es

FOR MORE INFO:

pm.bsc.es/ompss-at-fpga [email protected]