parallel programming models, languages and compilers

Parallel Programming Models, Languages and Compilers

Module 6

Points to be covered• Parallel Programming Models-

Shared-Variable Model, Message-Passing Model, Data-Parallel Model, Object Oriented Model, Functional and Logic Models

• Parallel Languages and Role of Compilers-Language Features for Parallelism, Parallel Language Constructs, Optimizing Compilers for Parallelism

• Dependence Analysis of Data Arrays-Iteration Space and Dependence Analysis, Subscript Separability and Partitioning, Categorized Dependence Tests

• Code Optimization and Scheduling- Scalar Optimization with Basic Blocks, Local and Global Optimizations, Vectorization and Parallelization Methods, Code Generation and Scheduling, Trace Scheduling Compilation

Parallel Programming Model

• Programming model->simplified and transparent view of computer hardware/software system.

• Parallel Programming Model are specifically designed for multiprocessors, multicomputer or vector/SIMD computers.

Classification

• We have 5 programming models-:Shared-Variable ModelMessage-Passing ModelData-Parallel ModelObject Oriented ModelFunctional and Logic Models

Shared Variable Model

• In all programming system, processors are active resources and memory & IO devices are passive resources.

• Program is a collection of processes.• Parallelism depends on how IPC( Interprocess

Communication) is implemented.• Process address space is shared.• To ensure orderly IPC ,a mutual exclusion property

requires that shared object must be shared by only 1 process at a time.

Shared Variable Model(Subpoints)

• Shared Variable communication• Critical section• Protected access• Partitioning and replication• Scheduling and synchronization• Cache coherance problem

Shared Variable communication

• Used in multiprocessor programming• Shared variable IPC demands use of shared

memory and mutual exclusion among multiple processes accessing the same set of variables.

Shared Variable in common

memory

Process A

Process C

Process B

Critical Section

• Critical Section(CS) is a code segment accessing shared variable, which must be executed by only one process at a time and which once started must be completed without interruption.

Critical Section Requirements

• It should satisfy following requirements-: Mutual ExclusionAt most one process executing CS at a time.No deadlock in waitingNo circular wait by 2 or more process.No preemptionNo interrupt until completion.Eventual EntryOnce entered CS,must be out after completion.

Protected Access

• Granulity of CS affects the performance.• If CS is too large,it may limit parallism due to

excessive waiting by process.• When CS is too small,it may add unnecessary

code complexity/Software overhead.

4 operational Modes

• Multiprogramming• Multiprocessing• Multitasking• Multithreading

Multiprogramming

• Multiple independent programs running on single processor/multiprocessor by time sharing use of system resource.

• When program enters the I/O mode, the processor switches to another program.

Multiprocessing

• When multiprogramming is implemented at the process level on a multiprocessor, it is called multiprocessing.

• 2 types of multiprocessing-: If interprocessor communication are handled at the

instruction level, the multiprocessor operates in MIMD mode.

If interprocessor communication are handled at the program,subroutine or procedure level, the multiprocessor operates in MPMD mode.

Multitasking

• A single program can be partitioned into multiple interrelated tasks concurrently executed on a multiprocessor.

• Thus multitasking provides the parallel execution of 2 or more parts of single program.

Multithreading

• The traditional UNIX/OS has a single threaded kernal in which 1 process can receive OS kernal service at a time.

• In multiprocessor we extend single kernal to be multithreaded.

• The purpose is to allow multiple threads of light weight processes to share same address space.

Partitioning and Replication

• Goal of parallel processing is to exploit parallelism as much as possible with lowest overhead.

• Program partitioning is a technique for decomposing a large program and data set into many small pieces for parallel execution bt multiple processors.

• Program partitioning involves both programmers and compilers.

• Program replication refers to duplication of same program code for parallel execution on multiple processors over different data sets.

Scheduling and Synchronization

• Scheduling further classified-:Static Scheduling• It is conducted at post compile time.• Its advantage is low overhead but shortcomings is a possible

mismatch with run time profile of each task.Dynamic Scheduling• Catches the run time conditions.• Requires fast context switching,premption and much more

OS support.• Advantage include better resource utilization at expense of

highest scheduling overhead.

• One can use atomic memory operations such as Test&Set and Fetch&Add to achieve synchronization.

Cache Coherence & Protection

• Multicache coherance problem demands an invalidation or update after each write operation.

Message Passing Model

• Two processes D and E residing at different processor nodes may communicate wit each other by passing messages through a direct network.

• The messages may be instructions, data,synchronization or interrupt signals etc.

• Multicomputers are considered loosely coupled multiprocessors.

IPC using Message Passing

Message(Send/Recieve)Process D Process E

Synchronous Message Passing

• No shared Memory• No mutual Exclusion• Synchronization of sender and reciever

process just like telephone call.• No buffer used.• If one process is ready to cummunicate and

other is not,the one that is ready must be blocked.

Asynchronous Message Passing

• Does not require that message sending and receiving be synchronised in time and space.

• Arbitrary communication delay may be experienced because sender may not know if and when the message has been received until acknowledgement is received from receiver.

• This scheme is like a postal service using mailbox with no synchronization between senders and recievers.

Data Parallel Model

• Used in SIMD computers• Parallelism handled by hardware

synchronization and flow control.• Fortran 90 ->data parallel lang.• Require predistrubuted data sets.

Data Parallelism

• This technique used in array processors(SIMD)• Issue->match problem size with machine size.• Recently Connection Machine used woth

16384 PE working in single configuaration.• No mutual exclusion problem.

Array Language Extensions

• Various data parallel language used• Represented by high level data types• CFD for Illiac 4,DAP fortran for Distributed

array processor,C* for Connection machine• Target to make the number of PE’s of problem

size.

Compiler Support

• Compiler can be used for optimization purpose.

• Decides the dimension of array.

Object Oriented Model

• Objects dynamically created and manipulated.• Processing is performed by sending and

receiving messages among objects.

Concurrent OOP

• Need of OOP because of abstraction and reusability concept.

• Objects are program entites which encapsulate data and operations in single unit.

• Concurrent manipulation of objects in OOP.

Actor Model

• This is a framework for Concurrent OOP.• Actors->independent component• Communicate via asynchronous message

passing.• 3 primitives->create,send to and become.

Parallelism in COOP

• 3 common patterns for parallism-:1)Pipeline concurrency2)Divide and conquer3)Cooperative Problem Solving

Functional and logic Model

• Functional Programming Language->Lisp,Sisal and Strand 88.Logic Programming Language->Concurrent Prolog and Parlog

Functional Programming Model

• Should not produce any side effects.• No concept of storage,assignment and

branching.• Single assignment and data flow language

functional in nature.

Logic Programming Models

• Used for knowledge processing from large database.

• Supports implicitly search strategy.• And parallel execution and Or Parallel

Reduction technique used.• Used in artificial intelligence

Parallel Language and Compilers

• Programming env is collection of s/w tools and system support.

• Parallel Software Programming env needed.• Users still forced to focus on hardware details

rather than parallelism using high level abstraction.

Language Features For Parallelism

• Optimization Features• Availability Features• Synchronization/communication Features• Control Of Parallelism• Data Parallelism Features• Process Management Features

Optimization Features

• Theme->Conversion of sequential Program to Parallel Program.

• The purpose is to match s/w parallelism with hardware parallelism.

• Software in Practice-:1)Automated Parallelizer Express C automated parallelizer and Allaint

FX Fortran compiler.2)Semiautomated ParallizerNeeds compiler directives or programmers

interaction.

3)Interactive Restructure SupportStatic analyzer,Run time statistics,data flowgraph and code translater for restructuring

fortran code.

Availability Features

• Theme-:Enhance user friendliness,make language portable for large no of parallel computers and expand the applicability of software libraries.

1)ScalibilityLanguage should be scalable to number of

processors and independent of hardware topology.

2)CompatibilityCompatible with sequential language.3)PortabilityLanguage should be portable to shared memory

multiprocessor,message passing or both.

Synchronization/Communication Features

• Shared Variable (locks) for IPC• Remote Procedure Calls• Data Flow languages• Mailbox,Semaphores,Monitors

Control Of Parallelism

• Coarse,Medium and fine grain• Explicit vs implicit parallelism• Global Parallelism• Loop Parallelism• Task Parallelism• Divide and Conquer Parallelism

Data Parallelism Features

Theme-:how data is accessed and distributed in either SIMD and MIMD computers.

1)Runtime automatic decompositionData automatically distributed with no user

interation.2)Mapping SpecificationUser specifies patters and input data mapped to

hardware.

• Virtual Processor SupportCompilers made statically and maps to physical

processor.• Direct Access to shared dataShared data is directly accessed by operating

system.

Process Management Features

Theme-:Support efficient creation of parallel

process,multithreading/multitasking,program partitioning and replication and dynamic load balancing at run time.

1)Dynamic Process Creation at Run Time.2)Creation of lightweight processes.3)Replication technique.4)Partitioned Networks.5)Automatic Load Balancing

Optimizing Compilers for Parallelism

• Role of compiler to remove burden of optimization and generation.

3 Phases-:1)Flow analysis2)Optimization3)Code Generation

Flow Analysis

• Reveals design flow patters to determine data and control dependencies.

• Flow analysis carried at various execution levels.

1)Instruction level->VLSI or superscaler processors.

2)Loop level->Simd and systolic computer3)Task level->Multiprocessor/Multicomputer

Program Optimization

• Transformation of user program to explore hardware capability.

• Explores better performance.• Goal to maximise speed of code execution.• To minimize code length.• Local and global optimizations. • Machine dependent Transformation

Parallel Code Generation

• Compiler directive can be used to generate parallel code.

• 2 optimizing compilers-:1)Parafase and Parafase 22)PFC and Parascope

Parafase and Parafase2

• Transforms sequential programs of fortran 77 into parallel programs.

• Parafase consists of 100 program that are encoded and passed.

• Pass list indentifies dependencies and converts it to concurrent program.

• Parafase2 for c and pascal in extension to fortran.

PFC AND Parascope

• Translates fortran 77 to fortran 90 code.• PFC package extended to PFC + for parallel

code generation on shared memory multiprocessor.

• PFC performs analysis as following steps below-:

• PFC performs analysis as following steps below-:

1)Interprocedure Flow analysis2)Transformation3)dependence analysis4)Vector Code Generation

Dependence Analysis of Data Arrays

• Iteration Space And Dependence Analysis• Subscript Seperability & Partitioning• Categorized Dependence Tests

Iteration Space And Dependence Analysis

• Data and control dependences• Scalar data dependences– True-, anti-, and output-dependences

58

Data and Control Dependences• Data dependence: data is

produced and consumed in the correct order

• Control dependence: a dependence that arises as a result of control flow

S1 PI = 3.14S2 R = 5.0S3 AREA = PI * R ** 2

S1 IF (T.EQ.0.0) GOTO S3

S2 A = A / TS3 CONTINUE

59

Data Dependence Classification• Dependence relations are similar to hardware hazards

(which may cause pipeline stalls)– True dependences (also called flow dependences), denoted by S1 S2, are the

same as RAW (Read After Write) hazards– Anti dependences, denoted by S1 -1 S2, are the same as WAR (Write After

Read) hazards– Output dependences, denoted by S1 o S2, are the same as WAW (Write After

Write) hazards

S1 X = …S2 … = X

S1 … = XS2 X = …

S1 X = …S2 X = …

S1 S2 S1 -1 S2 S1 o S2

• When data object is data array indexed by multidimentional subscript,dependency is more difficult to determine at compile time.

• Dependency analysis->Process of detecting all data dependency in program.

Dependence Testing• Data dependence in array difficult as no 2 array ref may access same memory location.• We consider subscript ref here.Example-:DO i1 = L1, U1

DO i2 = L2, U2...

DO in = Ln, UnS1A(f1(i1,...,in),...,fm(i1,...,in)) = ...S2... = A(g1(i1,...,in),...,gm(i1,...,in))

ENDDO...

ENDDO

Iteration Space

• N-Dimensional cartesion space for n loops ia called iteration space.

• Example(Next Slide)

63

Iteration Space Example• The iteration space of the

statement at S1 is {(1,1), (2,1), (2,2), (3,1), (3,2), (3,3)}

• At iteration i = (2,1) the value of A(1,2) is assigned to A(2,1)

DO I = 1, 3 DO J = 1, IS1 A(I,J) = A(J,I) ENDDO ENDDO

(1,1)

(2,1) (2,2)

(3,2) (3,3)(3,1)

I

J

Dependence Equation(flow dependence)

DO I = 1, NS1 A(I+1) = A(I) ENDDO

DO I = 1, NS1 A(f(I)) = A(g(I)) ENDDO

To prove flow dependence:for which values of < is

f() = g()

+1 = has solution =-1

Dependence equation(anti dependence)


DO I = 1, NS1 A(f(I)) = A(g(I)) ENDDO

To prove anti dependence:for which values of > is

f() = g()

+1 = has no solution

66

Dependence Direction Vector• Definition– Suppose that there is a dependence from S1 on

iteration i and S2 on iteration j; then the dependence direction vector D(i, j) is defined as

D(i, j)k = “<” if d(i, j)k > 0“=” if d(i, j)k = 0“>” if d(i, j)k < 0

Distance Vector

• Its defined in terms of - Example-:Direction vector is (<,=)Distance vector is(-1,0) DO I = 1, 3 DO J = 1, IS1 A(I+1,J) = A(I,J) ENDDO ENDDO

Subscript Seperability and Partitioning

• Subsscript is nothing but Subscript position in array reference.

3 positions-:1)ZIV(Zero Index Variable)2)SIV(Single Index Variable)3)MIV(Multiple Index Variable)

69

ZIV, SIV, and MIV• ZIV (zero index variable)

subscript pairs• SIV (single index variable)

subscript pairs• MIV (multiple index

variable) subscript pairs

DO I = … DO J = … DO K = …S1 A(5,I+1,J) = A(N,I,K) ENDDO ENDDO ENDDO

ZIV pair SIV pair MIV pair

Subscript Seperability

• Seperable if indices if its indices donot occur in other subscript.

• If 2 diff subscript contain the same index then we say they are coupled.

71

Separability and Coupled Subscripts• When testing

multidimensional arrays, subscript are separable if indices do not occur in other subscripts

• ZIV and SIV subscripts pairs are separable

• MIV pairs may give rise to coupled subscript groups

DO I = … DO J = … DO K = …S1 A(I,J,J) = A(I,J,K) ENDDO ENDDO ENDDO

SIV pairis separable

MIV pairgives coupled

indices

72

Dependence Testing Overview1. Partition subscripts into separable and coupled groups2. Classify each subscript as ZIV, SIV, or MIV3. For each separable subscript, apply the applicable single

subscript test (ZIV, SIV, or MIV) to prove independence or produce direction vectors for dependence

4. For each coupled group, apply a multiple subscript test5. If any test yields independence, no further testing needed6. Otherwise, merge all dependence vectors into one set

73

ZIV Test• The ZIV test compares

two subscripts• If the expressions are

proved unequal, no dependence can exist

DO I = 1, NS1 A(5) = A(6) ENDDO

K = 10 DO I = 1, NS1 A(5) = A(K) ENDDO

K = 10 DO I = 1, N DO J = 1, NS1 A(I,5) = A(I,K) ENDDO ENDDO

74

Strong SIV Test• Requires subscript pairs of the

form aI+c1 and aI’+c2 for the def and use, respectively

• Dependence equationaI+c1 = aI’+c2

has a solution if the dependence distanced = (c1 - c2)/a is integer and|d| < U - L with L and U loop bounds


DO I = 1, NS1 A(2*I+2) = A(2*I) ENDDO

DO I = 1, NS1 A(I+N) = A(I) ENDDO

Strong Siv test Strong SIV test when• f(...) = aik+c1 and g(…) = aik+c2• Plug in α,β and solve for dependence:• β-α = (c1 – c2)/a• A dependence exists from S1 to S2 if:• β-α is an integer• |β-α| ≤ Uk- Lk

76

Weak-Zero SIV Test• Requires subscript pairs

of the form a1I+c1 and a2I’+c2 with either a1= 0 or a2 = 0

• If a2 = 0, the dependence equation I = (c2 - c1)/a1 has a solution if (c2 - c1)/a1 is integer and L < (c2 - c1)/a1 < U

• If a1 = 0, similar case

DO I = 1, NS1 A(I) = A(1) ENDDO

DO I = 1, NS1 A(2*I+2) = A(2) ENDDO

DO I = 1, NS1 A(1) = A(I) + A(1) ENDDO

Weak-Zero SIV test when• f(...) = aik+c1 and g(…) = c2• Plug in α,β and solve for dependence:• α = (c2 – c1)/a

• A dependence exists from S1 to S2 if:• α is an integer• Lk ≤ α ≤ Uk

MIV Tests

• Similar to Siv except that I and j are distinct indices.

Code Optimization and scheduling

• Compilersoftware that translates the source code to object code.

• Optimization demands efforts from programmer and compiler.

Scaler optimization within basic blocks

• 2 types of scheduling-:Static Scheduling (compiler determines dependency

analysis)Dynamic Scheduling(Hardware of operating system

determines dependency analysis at run time)• Code scheduling methods ensure that control

dependency,data dependency and resource dependency are properly handled during concurrent execution.

Precedence Constraints

• Hardware scheduler needed to direct the flow of instruction after the branch.

• Software scheduler ->used to handle only conditional execution statements.

• Program profiling can be used.• If flow dependence detected write must proceed

after read operation.• Care should be taken in output dependence

problem.

Basic Blocks

• Basic blocks is a sequence of statements, with the properties that

(a) The flow of control can only enter the basic block through the first instruction in the block. That is, there are no jumps into the middle of the block.

(b) Control will leave the block without halting or branching, except possibly at the last instruction in the block.

Partitioning instructions into basic blocks

• INPUT: A sequence of three-address instructions.

• OUTPUT: A list of the basic blocks for that sequence in which each instruction is assigned to exactly one basic block

• METHOD: First, we determine those instructions in the intermediate code that are leaders, that is, the first instructions in some basic block.

• The rules for finding leaders are:1. The first three-address instruction in the

intermediate code is a leader.2. Any instruction that is the target of a

conditional or unconditional jump is a leader.3. Any instruction that immediately follows a

conditional or unconditional jump is a leader.

Example

sum = 0 do 10 i = 1, n

10 sum = sum + a[i]*a[i]

Three Address Code1. sum = 0 initialize sum2. i = 1 initialize loop counter3. if i > n goto 15 loop test, check for limit4. t1 = addr(a) – 45. t2 = i * 4 a[i]6. t3 = t1[t2]7. t4 = addr(a) – 48. t5 = i * 4 a[i]9. t6 = t4[t5]10. t7 = t3 * t6 a[i]*a[i]11. t8 = sum + t712. sum = t8 increment sum13. i = i + 1 increment loop counter14. goto 3

15. …

Control Flow Graph (CFG)1. sum = 02. i = 1

3. if i > n goto 15

4. t1 = addr(a) – 45. t2 = i*4 6. t3 = t1[t2]7. t4 = addr(a) – 48. t5 = i*49. t6 = t4[t5]10. t7 = t3*t611. t8 = sum + t712. sum = t813. i = i + 114. goto 3

15. …T

F

Common Subexpression Elimination1. sum = 0 1. sum = 0 2. i = 1 2. i = 13. if i > n goto 15 3. if i > n goto 154. t1 = addr(a) – 4 4. t1 = addr(a) – 45. t2 = i*4 5. t2 = i*46. t3 = t1[t2] 6. t3 = t1[t2]7. t4 = addr(a) – 4 7. t4 = addr(a) – 48. t5 = i*4 8. t5 = i*49. t6 = t4[t5] 9. t6 = t4[t5]10. t7 = t3*t6 10. t7 = t3*t611. t8 = sum + t7 10a t7 = t3*t312. sum = t8 11. t8 = sum + t713. i = i + 1 12. sum = t814. goto 3 13. i = i + 115. … 14. goto 3

Copy Propagation1. sum = 0 1. sum = 0 2. i = 1 2. i = 13. if i > n goto 15 3. if i > n goto 154. t1 = addr(a) – 4 4. t1 = addr(a) - 45. t2 = i * 4 5. t2 = i * 46. t3 = t1[t2] 6. t3 = t1[t2]10a t7 = t3 * t3 10a t7 = t3 * t311 t8 = sum + t7 11. t8 = sum + t712. sum = t8 11a sum = sum + t713. i = i + 1 12. sum = t814. goto 3 13. i = i + 115. … 14. goto 3

15. …

Invariant Code Motion1. sum = 0 1. sum = 0 2. i = 1 2. i = 13. if i > n goto 15 2a t1 = addr(a) - 44. t1 = addr(a) – 4 3. if i > n goto 155. t2 = i * 4 4. t1 = addr(a) - 46. t3 = t1[t2] 5. t2 = i * 410a t7 = t3 * t3 6. t3 = t1[t2]11a sum = sum + t7 10a t7 = t3 * t313. i = i + 1 11a sum = sum + t714. goto 3 13. i = i + 115. … 14. goto 3

15. …

Strength Reduction1. sum = 0 1. sum = 02. i = 1 2. i = 12a t1 = addr(a) – 4 2a t1 = addr(a) - 43. if i > n goto 15 2b t2 = i * 45. t2 = i * 4 3. if i > n goto 156. t3 = t1[t2] 5. t2 = i * 410a t7 = t3 * t3 6. t3 = t1[t2]11a sum = sum + t7 10a t7 = t3 * t313. i = i + 1 11a sum = sum + t714. goto 3 11b t2 = t2 + 415. … 13. i = i + 1

14. goto 3

15. …

Constant Propagation and Dead Code Elimination

1. sum = 0 1. sum = 02. i = 1 2. i = 1 2a t1 = addr(a) – 4 2a t1 = addr(a) - 4 2b t2 = i * 4 2b t2 = i * 42c t9 = n * 4 2c t9 = n * 43a if t2 > t9 goto 15 2d t2 = 46. t3 = t1[t2] 3a if t2 > t9 goto 1510a t7 = t3 * t3 6. t3 = t1[t2]11a sum = sum + t7 10a t7 = t3 * t311b t2 = t2 + 4 11a sum = sum + t714. goto 3a 11b t2 = t2 + 415. … 14. goto 3a

15. …

New Control Flow Graph1. sum = 02. t1 = addr(a) - 4 3. t9 = n * 44. t2 = 4

5. if t2 > t9 goto 11

6. t3 = t1[t2]7. t7 = t3 * t38. sum = sum + t79. t2 = t2 + 410. goto 5

11. …

F

T

Analysis and Optimizing Transformations

• Local optimizations – performed by local analysis of a basic block

• Global optimizations – requires analysis of statements outside a basic block

• Local optimizations are performed first, followed by global optimizations

Local Optimizations --- Optimizing Transformations of a Basic Blocks

• Local common subexpression elimination• Dead code elimination• Copy propagation• Renaming of compiler-generated

temporaries to share storage

Example 1: Local Common Subexpression Elimination

1. t1 = 4 * i2. t2 = a [ t1 ]3. t3 = 4 * i4. t4 = b [ t3 ]5. t5 = t2 * t46. t6 = prod * t57. prod = t68. t7 = i + 19. i = t710.if i <= 20 goto 1

Example 2: Local Dead Code Elimination

1. a = y + 2 1’. a = y + 22. z = x + w 2’. x = a3. x = y + 2 3’. z = b + c4. z = b + c 4’. b = a5. b = y + 2

Example 3: Local Constant Propagation

1. t1 = 1 Assume a, k, t3, and t4 are used beyond basic block:2. a = t1 1’. a = 13. t2 = 1 + a 2’. k = 24. k = t2 3’. t4 = 8.25. t3 = cvttoreal(k) 4’. t3 = 8.26. t4 = 6.2 + t37. t3 = t4

Global Optimizations --- Require Analysis Outside of Basic Blocks

• Global common subexpression elimination• Dead code elimination• Constant propagation• Loop optimizations– Loop invariant code motion– Strength reduction– Induction variable elimination

Vectorization and Parallization

• Vectorization->Process of converting scalar looping operations into equivalent vector instructions.

• Parallelization->Converting sequential code to parallel code.

• Vectorizing Compiler->Compiler that does vectorization automatically.

• Vector hardware->Speed up Vector Operations.• Multiprocessors->used to speed up parallel codes.

Vectorization Methods

• Use of temporary storage• Loop Interchange• Loop Distribution• Vector Reduction• Node Splitting

Vectorization Example-:

Consider scalar loop for addition-:Do 20 I=8,120,220 A(I)=B(I+3)+C(I+1)After Vectorization-:A(8:120:2)=B(11:123:2)+C(9:121:2)

Use of temporary Storage

• Theme->In order to have pipelined execution, use temporary array to produce vector code.

Do 20 I=1,N A(1)=B(1)+C(1)A(I)=B(I)+C(I) B(1)=2*A(2)

20 B(I)=2*A(I+1) A(2)=B(2)+C(2) B(2)=2*A(3) …….

By use of temperory storage method we have-:TEMP(1:N)=A(1:N+1)A(1:N)=B(1:N)+C(1:N)B(1:N)=2*TEMP(1:N)

Loop Interchange

• Theme->Exchange inner loop with outer loop and increase profitability(improvement in execution time)

Example-:Do 20 I=2,N

Do 10 J=2,NS1: A(I,J)=(A(I,J-1)+A(I,J+1))/210 Continue20 Continue

After Loop Interchange

Do 20 J=2,NDo 20 I=2,N

S1: A(I,J)=(A(I,J-1)+A(I,J+1))/220 Continue

Now inner loop can be vectorized as follows-:Do 20 J=2,N

A(2:N,J) =(A(2:N,J-1)+A(2:N,J+1))/220 Continue

Loop Distribution• Theme:Distribute the loop and vectorize it.

Example-:DO I = 1, N

S1A(I+1) = B(I) + C

S2 D(I) = A(I) + E

ENDDO

• transformed to:

DO I = 1, NS1 A(I+1) = B(I) + CENDDO

DO I = 1, NS2 D(I) = A(I) + EENDDO

• leads to:

S1 A(2:N+1) = B(1:N) + CS2 D(1:N) = A(1:N) + E

Vector Reduction• Theme->Produces the scaler value from 1 or more data

arrays.Example-:sum,product,maximum,minimum of all elements

in a single array.Example-:DO 40 I=1,NA(I)=b(I)+C(I)S=S+A(I)amax==max(amax,A(I))Continue

After vector Reduction

A(I)=B(I)+C(I)S=S+SUM(A(1:N))AMAX=MAX(AMAX,MAXVAL(A(1:N))

Where sum,max and maxval are all vector operations.

Node Splitting

• The data dependence cycle can some times be broken by node splitting.

Example:-Do 50 I=2,N Do 50 I=2,NS1: T(I)=A(I-1)+A(I+1) S1a: X(I)=A(I+1)S2: A(I)=B(I)+C(I) S2: A(I)=B(I)+C(I)

50 Continue S1b:T(I)=A(I-1)+X(I) 50 Continue

Thus new loop structure can be vectorized as follows-:

S1a: X(2:N)=A(3:N+1)S2: A(2:N)=B(2:N)+C(2:N)S1b: T(2:N)=A(1:N-1) +X(2:N)

Vectorization Inhibitors

• Conditional statements which depend on runtime conditions

• Multiple loop entries• Function or subroutine calls • Recurrence and their variations

Code Parallization

• Single program divided into many threads for parallel execution in multiprocessor env

• Each thread executing on single processor• Parallelization performed on outer loop unless

there is no dependency

Example-:DoAll 20 I=2,N

Do 10 J=2,NS1: A(I,J)=(A(I,J-1)+A(I,J+1))/2 EndDoEndAll

Parallization Inhibitors

• Loop dependence• Function calls• Multiple entries

Code Generation and Scheduling

• Directed Acyclic Graph(DAG)• Register Allocation• List Scheduling• Cycle Scheduling• Trace Scheduling Compilation

DAG

• Constructed from basic block• Basic block->no backtracking• So we construct DAG

Example of Constructing the DAG

(1) t1:= 4 * i Step (1): create node 4 and i0

Step (2): create node

Step (3): attach identifier t1

(2) t2:= a[t1] Step (1): create nodes labeled [], aStep (2): find previously node(t1)Step (3): attach label

(3) t3:= 4 * iHere we determine that:node (4) was creatednode (i) was creatednode (*) was created just attach t3 to *.

*

*

t1

i04

*t1,t3

i04

[]t2

a0

Fall 2000 CS6241 / ECE8833A - (6-120)

Compiler Design

Front End Back EndSource Program

Intermediate Language

Scheduling Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-121)

Why Register Allocation?

• Storing and accessing variables from registers is much faster than accessing data from memory.

The way operations are performed in load/store

(RISC) processors.

• Therefore, in the interests of performance—if not by necessity—variables ought to be stored in registers.

• For performance reasons, it is useful to store variables as long as possible, once they are loaded into registers.

Fall 2000 CS6241 / ECE8833A - (6-122)

The Goal

• Primarily to assign registers to variables.

• However, the allocator runs out of registers quite often.

• Decide which variables to “flush” out of registers to free them up, so that other variables can be bought in.

This important indirect consequence of allocation is referred to as spilling.

Scheduling

• List• Cycle• Trace

Trace Scheduling

• Compaction Code• Compensation Code

Thank YouAll the best for exams….!!!

parallel programming models, languages and compilers

Documents

sharedvariable modelmessage

shared object

programming system

variable modelin

process level

use of shared memory

process address space

parallel language constructs