[communications in computer and information science] software engineering and computer systems...

13
J.M. Zain et al. (Eds.): ICSECS 2011, Part II, CCIS 180, pp. 605–617, 2011. © Springer-Verlag Berlin Heidelberg 2011 The UTLEA: Uniformization of Non-uniform Iteration Spaces in Three-Level Perfect Nested Loops Using an Evolutionary Algorithm Shabnam Mahjoub 1 and Shahriar Lotfi 2 1 Islamic Azad University-Shabestar Branch [email protected] 2 Computer Science Department, University of Tabriz [email protected] Abstract. The goal of the uniformization based on the concept of vector de- composition, to find the basic dependence vector set in a way that any vector in iteration space could present non-negative integer combination of these vectors. To get an optimal solution, we can use an approximate algorithm. In this paper, the uniformization for three-level perfect nested loops has been presented using an evolutionary method that is called the UTLEA, the method to minimize both the number of vectors and dependence cone size. The most available ap- proaches have not been used; moreover, there are problems in approaches that could generalize them in three levels. In the proposed approach, we have been tried to solve these problems and according to executed tests, the achieved re- sults are close to optimal result. Keywords: Uniform and Non-uniform Iteration Space, Vector Decomposition, Uniformization, Loop Parallelization, Evolutionary Algorithm. 1 Introduction A challenging problem for parallelizing compilers is to defect maximum parallelism [8]. According to the studies [12], most of the execution time of computational pro- grams is spent in loops. Since then parallelizing compilers have focused on loop par- allelism. In fact, parallelizing compiler to get the parallel architectural advantages generates parallel code in a way that generated code had the dependence constraints in that program. Then the iterations of the loop can be spread across processors by hav- ing different processors executing different iterations simultaneously. One of the simplest approaches used for parallelism is WaveFront which in all of the iterations in the same WaveFront are independent of each other and depend only on the iteration in the previous WaveFront [13], [18]. From this point, we can find out the importance of uniformizations. Dependence constraints in iteration loop are known as cross-iteration dependence. The loop with no cross-iteration dependence is known as doall which could execute in any order. Simply parallelizing of these loops is possible. But if a loop had cross-iteration dependence, known as doacross, parallelizing of these loops is very harder than previous loops [19], [20].

Upload: eyas

Post on 12-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

J.M. Zain et al. (Eds.): ICSECS 2011, Part II, CCIS 180, pp. 605–617, 2011. © Springer-Verlag Berlin Heidelberg 2011

The UTLEA: Uniformization of Non-uniform Iteration Spaces in Three-Level Perfect Nested Loops Using an

Evolutionary Algorithm

Shabnam Mahjoub1 and Shahriar Lotfi2

1 Islamic Azad University-Shabestar Branch [email protected]

2 Computer Science Department, University of Tabriz [email protected]

Abstract. The goal of the uniformization based on the concept of vector de-composition, to find the basic dependence vector set in a way that any vector in iteration space could present non-negative integer combination of these vectors. To get an optimal solution, we can use an approximate algorithm. In this paper, the uniformization for three-level perfect nested loops has been presented using an evolutionary method that is called the UTLEA, the method to minimize both the number of vectors and dependence cone size. The most available ap-proaches have not been used; moreover, there are problems in approaches that could generalize them in three levels. In the proposed approach, we have been tried to solve these problems and according to executed tests, the achieved re-sults are close to optimal result.

Keywords: Uniform and Non-uniform Iteration Space, Vector Decomposition, Uniformization, Loop Parallelization, Evolutionary Algorithm.

1 Introduction

A challenging problem for parallelizing compilers is to defect maximum parallelism [8]. According to the studies [12], most of the execution time of computational pro-grams is spent in loops. Since then parallelizing compilers have focused on loop par-allelism. In fact, parallelizing compiler to get the parallel architectural advantages generates parallel code in a way that generated code had the dependence constraints in that program. Then the iterations of the loop can be spread across processors by hav-ing different processors executing different iterations simultaneously. One of the simplest approaches used for parallelism is WaveFront which in all of the iterations in the same WaveFront are independent of each other and depend only on the iteration in the previous WaveFront [13], [18]. From this point, we can find out the importance of uniformizations. Dependence constraints in iteration loop are known as cross-iteration dependence. The loop with no cross-iteration dependence is known as doall which could execute in any order. Simply parallelizing of these loops is possible. But if a loop had cross-iteration dependence, known as doacross, parallelizing of these loops is very harder than previous loops [19], [20].

606 S. Mahjoub and S. Lotfi

There are several methods to deal with nested loops. We might break the depend-ences and change the loop into the other loop that didn’t have any cross-iteration dependences. If it is not possible, we can still execute the loop in parallel in a way that proper synchronization had added to impose cross-iteration dependences. If all tech-niques fail, the doacross loop must be executed serially [4].

There are three major difficulties in parallelizing nested loops [21]. First, to enter correct simultaneity, compilers or programmers have to find out all cross-iteration dependences. But until now there has not been any dependences analysis method that could efficiently identify all cross-iteration dependences unless in a condition that dependence pattern would be uniform for all the iterations. Second, although all the cross-iteration dependences can been identified, it is difficult to systematically ar-range synchronization primitives especially when the dependence pattern is irregular. Finally, the synchronization overhead which will significantly degrade the perform-ance should be minimized.

In this paper, a new method using an evolutionary approach has been presented for uniformization of non-uniform iteration space of a three-level nested loop that is called UTLEA and has been analyzed after executing on different dependences. The rest of the paper is organized as follows; in section 2 the problem is explained, in section 3 the basic concepts for a better understanding are explained, in section 4 related work, in section 5 the proposed method and in section 6 evaluation and ex-perimental results are explained.

2 The Problem

In general, loops with cross-iteration dependences are divided in two groups. First group is loops with static regular dependence which can be analyzed during compile time and the second group is loops with dynamic irregular dependences. The loops of the second group for the lack of sufficient information can not be parallelized in the compile time. To execute such loop efficiently in parallel, runtime support must be provided. Major job of parallelizing compilers is to parallelize the first group loops. These loops are divided into two subgroups. Loops with uniform dependences and loops with non-uniform dependences. The dependences are uniform when the patterns of the dependence vectors are uniform. In other words, the dependence vectors have been expressed by constants or distance vectors. But if dependence vectors in irregu-lar patterns have not been expressed by distance vector, these are known as non-uniform dependences [10].

Parallelizing nested loops have several stages. These are involving data depend-ence analysis, loop tilling [16], loop generation and loop scheduling [1], [6], [7], [17], [18]. The uniformization is performed in the data dependence analysis stage. The result of this step is dependence vectors between loop iterations that are possible to have non-uniform pattern. To facilitate generating parallel code with basic vectors, this non-uniform space changes to uniform space. The goal of doing this job is to decrease basic dependence vector sets (BDVSs) in a new space. Although the de-pendence cone size (DCS) of the basic vectors should be minimum, to seek small and simple set of uniform dependence vectors, to cover all of the non-uniform depend-ences in the nested loop. Then the set of basic dependences will be added to every iteration to replace all original dependence vectors.

The UTLEA: Uniformization of Non-uniform Iteration Spaces 607

3 Background

In this section, necessary basic concepts have been presented.

3.1 Dependence Analysis

Common methods to compute data dependence is to solve a set of equations and ine-qualities with a set of constraints which are the iteration boundaries. In the result, two methods presented for solving the dependence convex hull (DCH). Both of which are valid. In simple cases, using one set of these solutions is sufficient. But in complex cases, in which dependence vectors are very irregular, we have to use both sets of solutions. These two DCH represented in one DCH as the complete DCH (CDCH) and is proved that DCDH includes complete information about dependences [10].

3.2 Dependence Cone and Dependence Cone Size

For the dependence vector set D, the dependence cone C(D) is defined as the set [4], [5]:

}0,...,,...:{)( 111 ≥++=∈= mmmn ddxRxDC λλλλ . (1)

And the DCS, assuming each di that means the DCS is defined as the area of the in-

tersection of d12+d2

2+…dn2=1 with dependence cone C(D). In fact, dependence cone

is the smaller cone that includes all of the dependence vectors of the loop. In three-level space, DCS is proportional to enclosed volume between the basic dependence vectors and sphere with r = 1.

3.3 Evolutionary Algorithm Overview

Darwin’s gradual evolution theory has been inspiring source for evolutionary algo-rithms. These algorithms are divided into five branches which genetic algorithm is special kind of those. Using genetic algorithm [9] for optimum process was proposed by Holland in 1975. Inventing this algorithm as an optimization algorithm has been on base of simulating natural development and it has been based on the hefty mathe-matical theory. Developing optimization process is on the base of random changes of various samples in one population and selecting the best ones. Genetic algorithm as an optimal computational algorithm efficiently seeks different areas of solution space considering a set of solution space pointes in any computational iteration. Since all of the solution spaces have been sought, in this method, against one directional method, there will be little possibility for convergence to a local optimal point. Other privilege of this algorithm needs determining objective value in different points and do not use other information such as derivative function. Therefore, this algorithm could be used in various problems such as linear, nonlinear, continuous and discrete.

In this algorithm, each chromosome is indicative on a point in solution space. In any iteration, all of available chromosome are decoded and acquired objective

608 S. Mahjoub and S. Lotfi

function. Based on the stated factors each chromosome has been attributed fitness. Fitness will determine selection probability to each chromosome, and with this selec-tion probability, collections of chromosome have been selected and the new chromo-somes will be generated applying genetic operator on them. These new chromosomes will be replaced with previous generated chromosomes. In executing this algorithm, we need to 4 parameters such as generation size, initial population size, crossover rate and mutation rate.

4 Related Works

The first method called naive decomposition [4]. Although a simple and clear method, a contradiction might exist in parallel execution of iterations.

Tzen and Ni [21] proposed the dependence uniformization technique based on solving a system of Diophantine equations and a system of inequalities. In this method, maximum and minimum of dependence slops have been computed according to dependence pattern for two-level iteration spaces. Then by applying the idea of vector decomposition, a set of basic dependences is chosen to replace all original dependence constraints in every iteration so that the dependence pattern becomes uniform. But since one of the vectors (0, 1) or (0, -1) should be in BDVS, DCS re-mains large.

In first method of Chen and Chung yew [3], the maximum number of basic vectors is three. In this method there are several selections without limitation for BDVS and any selection has different efficiency. Thus, this method needs a selected strategy that chooses a set which decreases the synchronization overhead and increases the paral-lelism. Therefore, in the second method of them [4] has been tried to improve the proposed method which in DCS is close to original DCS from non-uniform depend-ences and in fact this method has been improved to minimize the DCS in two-levels.

Chen and Shang [5] have proposed three methods on the basis of three possible measurements in a way that goal of any method is achieving to maximum of that measure. This method can be used for three-level spaces but the direction of depend-ence has not been considered. Furthermore, the optimization for the DCS which greatly affects parallelism has not been studied.

In the method according to evolutionary approach, genetic algorithm for two-level spaces has been used to solve the problem [15]. Although acquired results are not certain, but are very close to optimal solution for two-level spaces.

In general, in recent years in the area of uniformization a minority studies have been done. Therefore, it is required to perform better optimization.

5 The UTLEA Method

In this paper, the genetic algorithm is used as an instrumentation and searching method for finding basic vectors in uniformization of non-uniform three-level itera-tion spaces. In the following, different stages of proposed method are presented.

The UTLEA: Uniformization of Non-uniform Iteration Spaces 609

5.1 Coding

In the proposed method, every chromosome indicated a BDVS. Since the nested loop is three-level, chromosomes are spotted as a three-dimensional array. Every gene of chromosomes involves three x, y and z components that all of them indicate one vec-tor in three-level space. In the following, an example of chromosome is shown. U1, U2 and U3 are upper bound for three-level nested loop. Because of the loop index vari-ables are integer, the amounts of xi, yi and zi are integer. Also, the reason of selecting yi between –s1 and s1 and zi between –s2 and s2 is for equaling the amounts on the bases of loop execution.

x1 x2 x3

y1 y2 y3

z1 z2 z3

Fig. 1. Coding in the UTLEA method

5.2 Objective and Fitness Function

In this problem, a minimization problem, because of maximum nature of fitness func-tion, it is on the contrary of objective function. In general, three factors are playing roles in determining fitness of every chromosome and we should consider these fac-tors one by one.

Length of chromosomes. In this paper, the length of every chromosome is shown by L(i) function in which i is the chromosome number. Because the goal is to minimize length of BDVS, the 1/L(i) statement is added to fitness function. Since, the length of any chromosomes is not zero, L(i) is never zero and thus 1/L(i) will not take undefined value. At the beginning of implementation of algorithm, the length of all chromosomes is equal to 5. But because the optimal length is between n and 2n-1, during the implementing of algorithm, chromosomes have variable lengths are between 3 and 5.

Computing DCS(i) of chromosomes. In this paper, DCS is shown by DCS(i) in which i is the chromosome number. 1/DCS(i) statement like the least length is added to fitness function. Since, DCS(i) could have zero, the 1/DCS(i) statement could take undefined value. For solving this problem, we could have 1/(DCS(i)+1) statement instead of 1/DCS(i). Fig. 1 showing the dependence cone in three-level space.

⎣ ⎦2/,1111 Ussiys =≤≤−

⎣ ⎦2/,2222 Ussizs =≤≤−

10 Uix ≤≤

610 S. Mahjoub and S. Lotfi

y

z

1v 2v

3v

x

Fig. 2. DCS in three-level spaces

It is clear, if the coordinate system is changed, the considering DCS is not changed. Therefore, for computing this volume, we could change the coordinate system in a way that one of the vectors coincides with a main axis. For doing this, the vectors should rotate in a way that one of them coincides with z axis. A used rotation matrix is a trigonometric and clockwise matrix shown in relation 2. We can compute the considering volume after the rotation using a triple integral. It is better instead of Cartesian coordinates, spheral coordinates can be used to determine upper bounds and lower bounds of this integral. After rotating, the ϕ for the vector that coincides with z

axis is equal to zero. Also we can consider the ρ for each of these vectors equal to

one. Therefore in this section, calculation of ϕ and θ for two other vectors after

rotating, is sufficient. Finally by using relation 3, we could compute DCS for chromo-somes with L(i)=3.

⎥⎥

⎢⎢

−=

)cos(0)sin(

)sin().sin()cos()cos().sin(

)sin().cos()sin()cos().cos(

βββααβαβααβα

R (2)

⎪⎪⎩

⎪⎪⎨

+−Δ

Δ=

∫ ∫ ∫=

1)1()(

2

1

)(

0

1

0d d d sin2)(

ϕθθθϕ

θ

θ

θ

θθϕρϕρ

f

fiDCS

(3)

In relation 2, variable α indicates the angle of the projection vector in 2-D Carte-sian space x and y with positive direction of x axis and variable β indicates the angle of vector with positive direction of z axis. In relation 3, )(θf is a linear interpolation function. Although, more complex functions [14] can be used to calculate the volume, the same linear function is sufficient.

The above method just could compute the DCS for basic vectors with L(i)=3. For longer lengths, we can use other method or universalize this method in a way that is

The UTLEA: Uniformization of Non-uniform Iteration Spaces 611

used for longer than three. Even if all vectors are located in convex space of three vectors in three-level space, we can use this method. But there is a problem which in general in three-level space, we can not say that all vectors are located in convex space of three vectors. As a whole, if we consider the weight of every gene of chro-mosome equal to one number and the found result of three-vector method to be in the form of DCS3(a, b, c) function, in which a, b, c are the weight of vectors, in this case we can use the following method for computing the BDVS for chromosomes with length 4 or 5.

Fig. 3. Computing DCS for chromosomes with L(i)>3

443133213 =+= ), if L(i), , (DCS), , (DCSDCS(i) (4)

5541343133213 =++= ), if L(i), , (DCS), , (DCS), , (DCSDCS(i) (5)

For using the above method for lengths longer than three, all vectors in chromo-somes should be sorted correctly. For this reason, all of the vectors have been rotated until a vector could coincide with z axis. This vector’s number is 1. Then for other vectors, their projection angle in 2-D Cartesian space x and y with negative direction of x axis are computed and are used as a weight to sort the vectors. The weight of each vector is considered between 0 and 360 in order to prevent different vectors to have equal length. In fact, the vector with negative y element, the number of 360 is detracted of occurred absolute value.

Computing M(i) of chromosomes. In this paper, the number of construable dependence vectors extent by BDVS of chromosome i are shown as M(i) function. By solving Diophantine equation [2], [10], [11] in relation 6 the amount of M(i) are acquired for chromosome i.

(x, y, z))n, zn, yn(xnα...), z, y(xα =++1111 (6)

1 2

3 4

1

2

3 4

5

612 S. Mahjoub and S. Lotfi

In relation 6 α j is as unknown of equation that must be non-negative integer, then basic vectors will not have any problem in executing of the loop. In fact, relation 6 showed that whether dependence vector (x, y, z) are decomposable by basic vectors of chromosome i or not. Therefore, fitness function is computed as the following:

))(3())1)(/(2())(/1()( iMwiDCSwiLwif ×+++= (7)

wj are weights considered for showing the importance of function. Here w1=1, w2=3 and w3=2.

5.3 Selection Operator

The selection operator applied in this minimization problem is tournament selection operator.

5.4 Crossover Operator

The crossover operator applied in this problem is 2-point crossover with different cutting points. These points are shown in fig. 4 by cpoint1 and cpoint2 for chromo-some 2i-1 and cpoint3 and cpoint4 for chromosome 2i. If the length of chromosomes after crossover operation is longer than 5, these genes of the chromosome are eliminated.

Fig. 4. Two-point crossover operation with different cutting points

5.5 Mutation Operator

In the mutation operator applied in this problem, one gene of chromosome is selected randomly and then its amount is replaced with one possible amount of other. In fig. 5 an example of mutation operation is shown.

}

* * * * * *

& & & & & &

1cpoint 2cpoint

3cpoint4cpoint

* * * * * *

& & & & & &

The UTLEA: Uniformization of Non-uniform Iteration Spaces 613

Fig. 5. Mutation operation

By doing these operators on the chromosomes may generate infeasible chromo-somes. For this reason, penalty technique has been used.

0 50 M(i)= f(i), if.f(i)=f(i)- (8)

), etc., , (-ne such asinvalid genthere is a f(i), if .f(i)=f(i)- 001 40 (9)

Also, there is an elitism operator before selection operator that means in every gen-eration, the best chromosome is selected and transformed to intermediate generation directly.

6 Evaluation and Experimental Results

The UTLEA is executed in vb6 and according to various dependences, some results of its performance are presented in this section.

6.1 Experimental Results

To verify the proposed method, many tests have been executed. In this section, the results of three tests are summarized in Table 1.

Table 1. The results of the UTLEA

Dependence vectors DCS Fitness Result Uniform iteration space with vectors (1, 2, 1), (1, 2, 3) and (3, 2, 1)

0.114 21.025 {(1, 2, 1), (3, 2, 1), (0, 0, 1)}

Uniform iteration space with vectors (1, 2, -2), (1, -2, 2) and (0, 1, 1)

0.741 26.055 {(1, 2, -2), (1, -2, 2), (0, 1, 1)}

Uniform iteration space with vectors or combination of them (0, 1, -1), (0, 0, 1) and (1, 0, 0)

0.718 82.088 {(0, 1, -1), (0, 0, 1), (1, 0, 0)}

The stability of these tests is shown in fig. 6.

a b c

d e f

Mutated gene

614 S. Mahjoub and S. Lotfi

1

6

11

16

21

1 101 201 301 401 501 601 701 801 901

Population Number

Fit

ness

Average Fitness

1

6

11

16

21

1 101 201 301 401 501 601 701 801 901

Population Number

Fit

mes

s

Best Fitness

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Execution Number

Fit

ness

Series1

Series2

Series3

test 1

test 2

test 3 Fig. 6. Stability of the UTLEA

For showing the convergence of the UTLEA, these tests are executed again with the pa-rameters summarized in table 2 which k is selection parameter, pc is crossover rate, pm is mutation rate, init_pop is initial population size and num_gene is the number of generation.

Table 2. The prameters of the tests

Tests K Pc Pm Init_pop Num_gene Result

Test 1 4 0.7 0.01 1000 1000 {(1, 2, 1), (3, 2, 1), (0, 0, 1)} Test 2 4 0.7 0.01 500 500 {(1, 2, -2), (1, -2, 2), (0, 1, 1)} Test 3 4 0.7 0.01 500 500 {(0, 1, -1), (0, 0, 1), (1, 0, 0)}

The convergence of these tests is shown in fig. 7, 8 and 9 respectively.

Fig. 7. Convergence of test 1

The UTLEA: Uniformization of Non-uniform Iteration Spaces 615

0

5

10

15

20

25

30

1 51 101 151 201 251 301 351 401 451

Population Number

Fit

ness

Average Fitness

0

5

10

15

20

25

30

1 51 101 151 201 251 301 351 401 451

Population Number

Fit

ness

Best Fitness

Fig. 8. Convergence of test 2

Fig. 9. Convergence of test 3

In addition, other tests have been executed on a larger scale. For example, this

algorithm for 500 dependences of uniform spaces with vectors {(1, 2, 1), (2, -1, 1), (2, 1, -1)} is executed. The best result for this test is {(1, 2, 1), (2, -1, 1), (2, 1, -1)} with DCS=0.267 and fitness=1000.699.

6.2 Comparison with Other Methods

Comparison with other methods is summarized in table 3. For example, the last test (test 4) that its code is given in this section is executed to compare the proposed method with other methods. The result of following test according to basic area 1 in Chen and Shang’s method [4] is {(1, 0, 0), (0, 1, 0), (0, 0, 1)} with DCS=0.523, ac-cording to basic area 2 is {(1, 2, 0), (-2, -4, 1)} with DCS=0 but x component is nega-tive and according to basic area 3 is {(1, 2, 1), (1, 0, 0), (0, 1, 0), (0, 0, 1)} with DCS=0.523. But the result from executing the UTLEA has obtained {(2, 3, -3), (1, 2, 1), (1, 2, 0)} with DCS=0.0087. Also, the result of the UTLEA for this test is better than Chen and Yew’s method [3].

0

10

20

30

40

50

60

70

80

90

1 51 101 151 201 251 301 351 401 451

Population Number

Fit

ness

Average Fitness

0

10

20

30

40

50

60

70

80

90

1 51 101 151 201 251 301 351 401 451

Population Number

Fit

ness

Best Fitness

616 S. Mahjoub and S. Lotfi

For i=1 To 15 For j=1 To 15 For k=1 To 15 A=(3i+j-1, 4i+3j-3, k+1)=… …=A(i+1, j+1, k) End For End For End For

Table 3. Comparison the UTLEA method with other methods

Uniformization methods The number of basic vectors

Considering the direction vectors

Other descriptions

Chen and Shang Unknown (the most cases is large) × The DCS remains

large Tzen and Ni Unusable in three-level

spaces ×

-

Chen and Yew Most of the time is large in three-level (5)

here are at least a main vector in three-level spaces

The Method based on evolutionary approach

Unusable in three-level spaces

-

The UTLEA Between 3 and 5 The DCS is small and the number of basic vectors is optimal

7 Conclusion and Future Works

In this paper, a dependence uniformization method is presented by using an evolu-tionary approach for three-level non-uniform iteration spaces called the UTLEA. Most of the previous methods only used in two-level spaces. In some of the methods used in three-level have not been paid attentions to direction of vectors. In other words, outcome basic vectors were not acceptable considering execution of loops. Also, the uniformization algorithm according to evolutionary approach presented for two-level spaces, are not applicable in three-level spaces too, since the dependence cone size in three-level spaces is different from two-level. Therefore, in this paper we have tired to solve the defects of previous methods and do a correct uniformization for three-level spaces.

As future works, a method based on evolutionary approach for uniformization of two and three levels together is suggested and the genetic parameters used in this proposed method can be improved.

The UTLEA: Uniformization of Non-uniform Iteration Spaces 617

References

1. Andronikos, T., Kalathas, M., Ciorba, F.M., Theodoropoulos, P., Papakonstantinou, G.: An Efficient Scheduling of Uniform Dependence Loops. Department of Electrical and Computer Engineering National Technical University of Athens (2003)

2. Banerjee, U.: Dependence Analysis for Supercomputing. 101 Philip Drive, Assinippi Park, Norwell, 02061. Kluwer Academic, Massachusetts (1988)

3. Chen, D., Yew, P.: A Scheme for Effective Execution of Irregular DOACROSS Loops. In: Int’l Conference on Parallel Processing (1993)

4. Chen, D.K., Yew, P.C.: On Effective Execution of Non-uniform DOACROSS Loops. IEEE Trans. On Parallel and Distributed Systems (1995)

5. Chen, Z., Shang, W., Hodzic, E.: On Uniformization of Affine Dependence Algorithms. In: 4 the IEEE Conference on Parallel and Distributed Processing (1996)

6. Darte, A., Robert, Y.: Affine-By-Statement Scheduling of Uniform Loop Nests Over Pa-rametric Domains. J. Parallel and Distributed Computing (1995)

7. Darte, A., Robert, Y.: Constructive Methods for Scheduling Uniform Loop Nests. IEEE Trans. Parallel Distribut. System (1994)

8. Engelmann, R., Hoeflinger, J.: Parallelizing and Vectorizing Compilers. Proceedings of the IEEE (2002)

9. Goldberg, D.E.: Genetic Algorithm in Search, Optimization, and Machine Learning. Addi-son-Wesley, Reading (1989)

10. Ju, J., Chaudhary, V.: Unique Sets Oriented Parallelization of Loops with Non-uniform Dependence. In: Proceedings of International Conference on Parallel Processing (1997)

11. Kryryi, S.L.: Algorithms for Solving of Linear Diophantine Equations in Integer Domains. Cybernetics and Systems Analysis, 3–17 (2006)

12. Kuck, D., Sameh, A., Cytron, R., Polychronopoulos, A., Lee, G., McDaniel, T., Leasure, B., Beckman, C., Davies, J., Kruskal, C.: The Effects of Program Restructuring, Algorithm Change and Architecture Choice on Program Performance. In: Proccedings of the 1984 In-ternational Conference on Parallel Processing (1984)

13. Lamport, L.: The Parallel Execution of DO Loops. Comm. ACM 17(2), 83–93 (1974) 14. Murphy, J., Ridout, D., Mcshane, B.: Numerical Analysis Algorithms and Computation.

Ellis Harwood, New York (1995) 15. Nobahari, S.: Uniformization of Non-uniform Iteration Spaces in Loop Parallelization Us-

ing an Evolutionary Approach. M. Sc. Thesis, Department of Computer Engineering (2009) (in Persian)

16. Parsa, S.: A New Genetic Algorithm for Loop Tilling. The Journal of Supercomputing, 249–269 (2006)

17. Parsa, S.: Lotfi, Sh.: Parallel Loop Generation and Scheduling. The Journal of Supercom-puting (2009)

18. Parsa, S., Lotfi, S.: Wave-front Parallelization and Scheduling. In: 4th IEEE International Conference on Parallel Processing, pp. 382–386 (2007)

19. Sawaya, R.: A Study of Loop Nest Structures and Locality in Scientific Programs. Maste of Applied Science Graduate Department of Electrical and Computer Engineering Univer-sity of Torento (1998)

20. Tzen, T.H.: Advance Loop Parallelization: Dependence Uniformization and Trapezoid Self Scheduling. Ph. D. thesis, Michigan State University (1992)

21. Tzen, T.H., Ni, L.: Dependence Uniformization: A loop Parallelization Technique. IEEE Trans. on Parallel and Distributed Systems. 4(5), 547–558 (1993)