dependence-based automatic parallelization … automatic parallelization using cnc bo zhao, ali...

19
Dependence-Based Automatic Parallelization using CnC Bo Zhao, Ali Janessari Technische Universit¨ at Darmstadt [email protected], [email protected] September 8, 2015 Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 1 / 19

Upload: doanthuy

Post on 30-May-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Dependence-Based Automatic Parallelization using CnC

Bo Zhao, Ali Janessari

Technische Universitat Darmstadt

[email protected], [email protected]

September 8, 2015

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 1 / 19

Overview

1 IntroductionMotivationObjectives

2 ApproachOverview FrameworkProgram AnalysisTask parallelism ExtractionCode Generation

3 Conclusion

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 2 / 19

Introduction Motivation

Motivation

Multicore and architecture has become popular as a result of thestagnating single core performance

Many software products are implemented sequentially

fail to tap potential of the parallel hardware

Problem : the gap between parallel hardware and sequential software

take advantage of new hardware featurespreserve the current software investmentsave human resource

Solution: automatically (semi-automatically) transform sequentialcode into parallel code

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 3 / 19

Introduction Objectives

Objectives

Discover potential parallelism

Loop parallelismIrregular task parallelism

Detect data and control dependencies

Generate parallel code using Concurrent Collections

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 4 / 19

Approach Overview Framework

Overview Workflow

Static Code Analysis

Dynamic Code Analysis

Code instrumentation

Front End

Task Graph Generator

Seq IR

Parallel Execution

Unit T

estin

g

Sequential

Source

Code

LLVM JIT Compiler

DiscoPoP

Correctness Feedback

CU

Graph

Co

mp

ile T

ime

Ru

ntim

e

Phase1: Program AnalysisPhase2: Coarse-Grained Task

Extraction & IR2IR TransPhase3: Code Generation

CnC-Par IRIR-to-IR transformation

Ctrl Info

Task

Graph

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 5 / 19

Approach Program Analysis

DiscoPoP (Discovery of Potential Parallelism)

Phase 1:Static and dynamic analysesInstruments the target program and identifies control and datadependencies

Phase 2 & 3:Post-mortem analysis for parallelism discoveryBuilds Computational Units (CUs) for the target programRanking

Phase 3Phase 2Phase 1

SourceCode

Conv

ersio

n to

IR

Memory Access& Control-flow

Instrumentation

Static Control-flowAnalysis

Data Dependency Analysis

CUGraph

ControlRegion

InformationPa

ralle

lism

Disc

over

y &

Para

llel P

atte

rnDe

tect

ion

RankedParallel

Opportunities

exec

utio

n

static dynamic

Rank

ing

Dynamic Control-flow Analysis

Variable Lifetime Analysis

Runtime Dependency MergingComputational

Unit Analysis

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 6 / 19

Approach Program Analysis

Dependence Profiling

Control dependence<FileID:LineID <Contr.ID> <Label> <Exec.Time>

1:60 BGN loop void1:74 END loop 1200

Data dependence<FileID:LineID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName>

1:63 NOM void RAW 1:59|temp11:70 NOM void WAR 1:67|temp2

Data dependence (multi-threaded)<FileID:LineID|ThreadID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName|ThreadID>

4:59|2 NOM void WAR 4:71|2|z real

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 7 / 19

Approach Task parallelism Extraction

Computation Unit (CU)

A collection of instructions (LLVM-IR instruction)Follows the read-compute-write pattern

A program state is first read from memory, the new state is computed,and finally written back

A small piece of code containing no parallelism or only ILPBuilding blocks for forming parallel tasksCU graph

Dependences are mapped to CUsExposes tightly-connected CUs

1 x = 32 y = 43 a = x + rand() / x4 b = x - rand() / x5 x = a + b6 a = y + rand() / y7 b = y - rand() / y8 y = a + b

x = 3

a = x + rand() / x

b = x - rand() / x

CUx

INIT

x = a + b

y = 4

a = y + rand() / y

b = y - rand() / y

y = a + b

CUy

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 8 / 19

Approach Task parallelism Extraction

CU Graph

CU-19 [146,152,153,154]

*in = args->in_img;for(all_pixels){

R = *in++;G = *in++;B = *in++;

}

CU-32 [152,153,154,156]

for(all_pixels){R = *in++;G = *in++;B = *in++;

Y= round(c1*R+c2*G+c3*B);}

CU-33 [152,153,154,157]

for(all_pixels){R = *in++;G = *in++;B = *in++;

U = round(c4*R+c5*G+c6*B);}

CU-34 [152,153,154,158]

for(all_pixels){R = *in++;G = *in++;B = *in++;

V = round(c7*R+c8*G+c9*B);}

CU-21 [147,156,160]

*pY = args->pY;for(all_pixels){

Y = round(c7*R+c8*G+c9*B);*pY++ = Y;

}

CU-24 [148,157,161]

*pU = args->pU;for(all_pixels){

U = round(c7*R+c8*G+c9*B);*pU++ = U;

}

CU-27 [149,158,162]

*pV = args->pV;for(all_pixels){

V = round(c7*R+c8*G+c9*B);*pV++ = V;

}

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 9 / 19

Approach Task parallelism Extraction

Program Execution tree

A call tree combined withloop information and basicblocks

CU graph is mapped on tothe execution tree

Program 1 - 377

Basic Block11 - 19

Loop21 - 36

Basic Block22 - 27

Tree node

CU

DataDependency

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 10 / 19

Approach Task parallelism Extraction

Task Extraction

Merge CUs contained in strongly connected components (SCCs) or inchains

A

B

C

D

E

F

G

H

I

A

B

C

D

E I

FGH

A

B

I

FGHCDE

1 2

SCCSCCchain

SCCFGH and chainCDE are two tasks

Hide complex dependences inside SSCs, exposing parallelizationopportunities outside

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 11 / 19

Approach Task parallelism Extraction

Task Extraction

Two CUs can share common instructions

53

54

57

56

59 58

55

2

1

2

7

3

6

4

7

1 5

4

1

2

3

No. of Common Instructions

No. of Dependences

(a) CU graph with CUsas vertices and RAWdependences and commoninstructions as edges

53

54

57

56

59 58

55

0.20

0.28

0.17

0.35

0.10

0.18

0.43

0.20

0.05

Affinity

(b) CU graph withaffinities between the CUs

53

54

57

56

59 58

55

0.20

0.28

0.17

0.35

0.10

0.18

0.43

0.20

0.05

Min Cut

(c) CU graph with aminimum cut

53

54

57

56

59 58

55

0.20

0.28 0.35

0.18

0.43

0.20

0.05

(d) CU graphpartitioned to identifytasks

Figure : Demonstration of a CU graph and graph partitioning to form tasks.

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 12 / 19

Approach Task parallelism Extraction

Task Graph

Task Extraction

Not limited to predefined language constructsCovers independent tasks and dependent tasks (coarse-grained tasks)

function: 365 - 381Parallelizable: true loop: 372 - 380

Parallelizable: falseINIT370

CU374 - 379

INIT666

if-else: 667 - 678Parallelizable: false

loop: 682 - 709Parallelizable: true

if-else: 719Parallelizable: false

RAWCU

ControlRegion

Blue

Yellow

Grey

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 13 / 19

Approach Code Generation

Code Generation

On going work

Map the task graph to CnC graph

CnC defines two scheduling constraints in parallel execution

producer/consumer relationshipscontroller/controllee relationships

A task (coarse-grained CU) is similar to a step collection

Data dependency among tasks are known form the task graph

Detected control information is not sufficient

Users specify the controller/controllee relationships

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 14 / 19

Approach Code Generation

Code Generation

Propose CnC-specific IR template

Transform the original IR to Cnc specific IR using task graph andusers’ control information

Generate binary code form CnC speceific IR

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 15 / 19

Approach Code Generation

Code Generation

previous code transformation results

Source-to-source transformation using Intel TBB flow graph(semi-automatic)FaceDetection (CnC sample application)

(a) Logic of FaceDetection (b) Flow graph

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 16 / 19

Approach Code Generation

Code Generation

Speedups on 2x8-core Intel Xeon E5-2650 2 GHz

1 2 4 8 16 320

5

10

15

20

thread

spee

dup

Official Manual CnC ParallelizationSemi-automatic TBB Parallelization

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 17 / 19

Conclusion

Conclusion

Profile data and control dependencies

DiscoPoPUsers’ specification

Extract coarse-grained task parallelism

CU graphProgram execution treeTask graph

Generate parallel code using CnC

Define CnC-specific IRCode transformation at IR levelEmploy CnC runtime library

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 18 / 19

Conclusion

Thanks!Q & A

Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 19 / 19