on the effective bandwidth of interleaved memories in vector

9
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, NO. 10, OCTOBER 1985 949 On the Effective Bandwidth of Interleaved Memories in Vector Processor Systems WILFRIED OED AND OTTO LANGE Abstract—Memory interleaving and multiple access ports are the key to a high memory bandwidth in vector processor systems. Each of the active ports supports an independent access stream to memory among which access conflicts may arise. Such conflicts lead to a decrease in memory bandwidth. We present some analytical results for the calculation of the resulting effective bandwidth for one and two access streams to a memory system in a vector processor. In particular, conditions for conflict-free access are given together with some conflicting cases that should be avoided. Finally, examples of measurements on a Cray X-MP and corresponding simulations are presented. Index Terms—Barrier-situation, conflict-free access, inter- leaved memories, linked conflict, memory access in vector mode. I. INTRODUCTION A variety of analytical models concerning the access to parallel memories has been developed in the past (see, for example, [l]-[5]). Very little, however, is known about interleaved memory systems in vector processors. Of special interest is an analysis of the effective bandwidth that can be expected in systems like the Cray X-MP [6] or the Fujitsu VP-100/VP-200 [7], which, because of multiple ports to memory, allow the concurrent operation of multiple access streams. Some insight into the memory system of the Cray X-MP was gained by simulations undertaken by Cheung and Smith [8]. We have done similar simulations along with corresponding measurements on the 2-processor, 16-bank Cray X-MP installed at the Central Institute for Applied Mathematics at the Nuclear Research Center Juelich, Federal Republic of Germany. Furthermore, we obtained some analytical results for one and two access streams. The conditions for achieving the maximum bandwidth, i.e., conflict-freeness, and some conflicting situations are presented. Manuscript received February 1, 1985; revised May 30, 1985. A preliminary version of this paper was presented at the IEEE 1985 International Conference on Parallel Processing, St. Charles, IL, Aug. 1985. W. Oed was with the Zentralinstitut fur Angewandte Mathematik, Kern- forschungsanlage Julich GmbH, 5170 Jiilich, West Germany. He is now with Cray Research GmbH, Perkhamerstr. 31, 8000 Miinchen 21, West Germany. O. Lange is with the Allgemeine Elektrotechnik und Datenverarbeitungssy- steme, Rheinisch-Westfalische Technische Hochschule Aachen, 5100 Aachen, West Germany. II. CHARACTERISTICS OF A MEMORY SYSTEM IN A VECTOR PROCESSOR Usually, a memory system of a vector processor is m -way interleaved, where the addresses are cyclically distributed over the m banks, j = (i mod m), where j = 0 , 1 , · · · , m - 1 is the address of the bank, and / = 0,1,2, · · ·, the address of the storage cell being referenced. In this context we are only interested in the address j of the bank since no further reference to any storage cell residing in a just- referenced bank is possible for the next n c clock periods. By n c we denote that multiple of the clock period which ac- counts for the bank cycle time t c , i.e., t c = n c r. A bank is said to be active while servicing a request. The memory may be accessed via ρ concurrently operating ports, each requesting access to a memory location every clock period. A port also has the capability of delaying an access request if it cannot be serviced because of some kind of conflict. We are assuming dynamic conflict resolution, i.e., access requests that could not be satisfied will be de- layed for one clock period within their corresponding ports, along with all subsequent access requests of that port. In the next clock period all active ports compete again for access. Furthermore, the memory may be divided into s sections where s ^ m. The purpose of sections is to reduce the num- ber of access paths to memory. If a request to memory is granted, the corresponding path is occupied for one clock period. We assume s \m and a cyclical distribution of the banks over the sections, i.e., k = (j mod s) where k is the address of the section and j is the address of the bank. Fig. 1 depicts a four-way interleaved memory system with two sections and two access paths from each of two CPU's. For example, the memory system of the Cray X-MP [6] is basically designed in this way. When such a memory is accessed the following three types of access conflicts may be encountered. 1) A bank conflict occurs if access to an active bank is requested; the access request is postponed. 2) A simultaneous bank conflict occurs if two or more ports using different access paths request access to the same inactive bank; a priority rule determines which port will be able to proceed and which ports must wait. 3) A section conflict occurs if two or more ports request access to inactive banks within the same section and would have to use the same access path; a priority rule determines which port will be able to proceed and which ones must wait. 0018-9340/85/1000-0949$01.00 © 1985 IEEE

Upload: lamxuyen

Post on 29-Jan-2017

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On the Effective Bandwidth of Interleaved Memories in Vector

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 10, OCTOBER 1985 949

On the Effective Bandwidth of Interleaved Memories in Vector Processor Systems

WILFRIED OED AND OTTO LANGE

Abstract—Memory interleaving and multiple access ports are the key to a high memory bandwidth in vector processor systems. Each of the active ports supports an independent access stream to memory among which access conflicts may arise. Such conflicts lead to a decrease in memory bandwidth.

We present some analytical results for the calculation of the resulting effective bandwidth for one and two access streams to a memory system in a vector processor. In particular, conditions for conflict-free access are given together with some conflicting cases that should be avoided. Finally, examples of measurements on a Cray X-MP and corresponding simulations are presented.

Index Terms—Barrier-situation, conflict-free access, inter­leaved memories, linked conflict, memory access in vector mode.

I . INTRODUCTION

A variety of analytical models concerning the access to parallel memories has been developed in the past (see,

for example, [ l ] - [5 ] ) . Very little, however, is known about interleaved memory systems in vector processors. Of special interest is an analysis of the effective bandwidth that can be expected in systems like the Cray X-MP [6] or the Fujitsu VP-100/VP-200 [7], which, because of multiple ports to memory, allow the concurrent operation of multiple access streams.

Some insight into the memory system of the Cray X-MP was gained by simulations undertaken by Cheung and Smith [8]. We have done similar simulations along with corresponding measurements on the 2-processor, 16-bank Cray X-MP installed at the Central Institute for Applied Mathematics at the Nuclear Research Center Juel ich , Federal Republic of Germany. Furthermore, we obtained some analytical results for one and two access streams. The conditions for achieving the maximum bandwidth, i .e . , conflict-freeness, and some conflicting situations are presented.

Manuscript received February 1, 1985; revised May 30, 1985. A preliminary version of this paper was presented at the IEEE 1985 International Conference on Parallel Processing, St. Charles, IL, Aug. 1985.

W. Oed was with the Zentralinstitut fur Angewandte Mathematik, Kern-forschungsanlage Julich GmbH, 5170 Jiilich, West Germany. He is now with Cray Research GmbH, Perkhamerstr. 31, 8000 Miinchen 21, West Germany.

O. Lange is with the Allgemeine Elektrotechnik und Datenverarbeitungssy-steme, Rheinisch-Westfalische Technische Hochschule Aachen, 5100 Aachen, West Germany.

II. CHARACTERISTICS OF A MEMORY SYSTEM IN A VECTOR PROCESSOR

Usually, a memory system of a vector processor is m -way interleaved, where the addresses are cyclically distributed over the m banks, j = (i mod m), where j = 0 , 1 , · · · , m - 1 is the address of the bank, and / = 0 , 1 , 2 , · · · , the address of the storage cell being referenced. In this context we are only interested in the address j of the bank since no further reference to any storage cell residing in a just-referenced bank is possible for the next nc clock periods. By nc we denote that multiple of the clock period τ which ac­counts for the bank cycle time tc, i .e . , tc = ncr. A bank is said to be active while servicing a request.

The memory may be accessed via ρ concurrently operating ports, each requesting access to a memory location every clock period. A port also has the capability of delaying an access request if it cannot be serviced because of some kind of conflict. We are assuming dynamic conflict resolution, i .e. , access requests that could not be satisfied will be de­layed for one clock period within their corresponding ports, along with all subsequent access requests of that port. In the next clock period all active ports compete again for access.

Furthermore, the memory may be divided into s sections where s ^ m. The purpose of sections is to reduce the num­ber of access paths to memory. If a request to memory is granted, the corresponding path is occupied for one clock period. We assume s \m and a cyclical distribution of the banks over the sections, i .e . , k = (j mod s) where k is the address of the section and j is the address of the bank.

Fig. 1 depicts a four-way interleaved memory system with two sections and two access paths from each of two CPU's. For example, the memory system of the Cray X-MP [6] is basically designed in this way.

When such a memory is accessed the following three types of access conflicts may be encountered.

1) A bank conflict occurs if access to an active bank is requested; the access request is postponed.

2) A simultaneous bank conflict occurs if two or more ports using different access paths request access to the same inactive bank; a priority rule determines which port will be able to proceed and which ports must wait.

3) A section conflict occurs if two or more ports request access to inactive banks within the same section and would have to use the same access path; a priority rule determines which port will be able to proceed and which ones must wait.

0018-9340/85/1000-0949$01.00 © 1985 IEEE

Page 2: On the Effective Bandwidth of Interleaved Memories in Vector

950 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 10, OCTOBER 1985

t . . . r e g i s t e r s - CPU 0 1 I

i n t e r c o n n e c t i o n n e t w o r k - CPU 0

p o r t 1 p o r t 2

i n t e r c o n n e c t i o n n e t w o r k - CPU 0

s e c t i o n 0

bank 0

p a t h 0

bank 2

p a t h 0

p a t h 1

s e c t i o n 1

bank 1 <— —>

i—> bank 3

p a t h 1

i n t e r c o n n e c t i o n n e t w o r k - CPU 1

p o r t 3 p o r t 4

i n t e r c o n n e c t i o n n e t w o r k - CPU 1

1 1 ... r e g i s t e r s - CPU 1 . . .

Fig. 1. Four-way interleaved memory system with two sections and two access paths from each of two CPU's.

In the memory system depicted in Fig. 1 a simultaneous bank conflict can only occur among ports of different CPU's, while a section conflict can only occur among ports within a CPU.

The maximum bandwidth bw of a memory system is given by the number of ports available, i .e . , bw = p. The effective bandwidth be{{ is the average number of data transferred per clock period, with be{{ ^ bw; bcii is equal to bw if there are no access conflicts and all ports are busy.

III. ANALYTIC MODELING OF ACCESS STREAMS

In our analysis we only consider access requests in vector mode, i .e. , due to a single vector memory instruction (load or store) a port is activated in order to transfer η data, whose memory addresses are equally spaced. Equal spacing means that the ith port issues its first request to some start bank Β,, with address bx Ε {0, 1, · · · , m - 1}, and subsequently steps through memory with some distance with modulus di9di Ε {0 ,1 ,2 , · · · , m - 1}. The (k + l)th access request of the ith port is to the bank with address (ft,- + kdi) mod m. If there are no conflicts, an active port supports an access stream, which makes a new request to memory every clock period.

In order to keep our analytical model simple, we make the following two assumptions.

1) An access stream is infinitely long. 2) All active access streams begin simultaneously. The first assumption is made because the possible memory

states are finite, and some cyclic state will be reached. Ne­glecting startup times, we compute the effective bandwidth for the cyclic state.

The second assumption is not a severe restriction because a relative position in time can be transformed to a relative

position in space, i .e . , the relative position of start banks. We will characterize an access stream by the address fot of

its start bank Β ι, its distance dt, its return number rt, and its access set Z, .

Theorem 1: The return number is the number of accesses that are made before an access to the same bank is requested again and is given by

r, = m/gcd(m,di). (1)

Proof: Since there are only m distinct integers modulo m, there are repetitions in the sequence (b, + jdi), j = 0 , 1 , 2 , · · · , i . e . ,

(bi + jdi) = (bi + kdi) mod m, j Φ k. (2)

Thus, it follows that

(j — k)di = 0 = hm mod m. (3)

The minimum difference of j — k satisfying (3) is the re turn number r , . Let d( = d , - /gcd(m, d,-) and m' = m/gcd(m,.d/); then d- and m' are relatively prime, and there is an h' for which (3) becomes

(j -k) = h'm'/dl. (4)

The value of h' in (4) must be chosen such that (j - k) becomes minimal, which is the case for h' = d\y and we obtain r{ = m' = m/gcd(m,di).

• The access set Z, contains the r, different addresses of the

banks that the ith access stream visits in the course of its activity. A. One Access Stream

If there is just one active access stream, the analysis is fairly simple because neither simultaneous bank conflicts nor section conflicts can occur. The only possible type of conflict is the bank conflict, which, if it should occur, will always occur at the start bank Bx.

The first rx access requests are to rx different banks and are conflict free. The (rx + l)th access request is again to the start bank.

If rx < nc, a bank conflict occurs, and all remaining access requests of the access stream are delayed until the bank be­comes available; this occurs after nc - rx clock periods. There are again rx conflict-free access requests because all remaining access requests have already been delayed. How­ever, on return to the start bank the initial situation is encoun­tered again.

Thus, the effective bandwidth for one access stream is beff = bw — 1 for rx ^ nc and 2?eff — rx/nc < 1 for rx < nc

s i n c e rx a c c e s s r e q u e s t s a re s e r v i c e d w i t h i n rx + (nc - rx) = nc clock periods in that case.

B. Two Access Streams

We will separately analyze the case where two access streams come from two different CPU's and the case where both access streams come from the same CPU (Fig. 1). In the first case no section conflicts can occur because access paths are not a bottleneck, i .e. , s = m. In the second case no

Page 3: On the Effective Bandwidth of Interleaved Memories in Vector

OED A N D LANGE! EFFECTIVE BANDWIDTH OF INTERLEAVED MEMORIES 951

simultaneous bank conflicts are possible because the bank concerned is within a section to which just one access path exists, i .e. , s < m. That case will be treated as a section conflict.

In the following we assume that no access stream will encounter self-conflicts, i .e . , rx,r2 = nc.

Equal Number of Sections and Banks: The most inter­esting of all possible pairs of access streams certainly are the ones that do not encounter any conflicts; these yield an effec­tive bandwidth of fceff = bw = 2.

An obvious conflict-free case results if Zx Π Z 2 = 0, i .e . , the two access sets are disjoint.

Theorem 2: Disjoint access sets can be achieved if and only if

gcd(m,dx,d2) > 1 . (5)

Proof: None of the access streams visits all banks, thus rx,r2 < m. Let / = gcd(m, dx, d2) > 1 and let f = gcd(m,dx) a n d / 2 = gcd(m,d2) ^ / ; then rx = m/f < m and r2 = m/f2 < m. Furthermore, the following two sets are disjoint:

{/1 k(k = 0 , 1 , · · · , m/f): i =fk mod m)

{j\k(k = 0,1,>·>,m/f):j =fk + I m o d m } .

If the addresses of the start banks of the two access streams are consecutive, i .e . , fc2 = bx + 1, then the two access sets are subsets of the two sets above and also disjoint because of

f\dx*naf\d2. We now have to show that if gca(m,dx,d2) = 1, disjoint

access sets cannot be found. Since we assume disjoint access sets, we are only interested in the elements contained in Zx

a n d Z 2 . The distances d[ = gcd(m,dx) and d2 = gcd(m,d2) p r o d u c e t h e s a m e e l e m e n t s as dx a n d d2. S i n c e gcd(m, dl9 d2) = 1, d[ and d2 are relatively prime. With the Euclidean algorithm [9] we then may find integers kx and k2

such that

1 = kxdx - k2d2 mod m . (6)

Setting bx = 0 and b2 Φ 0, we multiply (6) by b2:

b2kxdx = b2 + b2k2d2 mod m. (7)

The left-hand side in (7) is an element of Zx while the right-hand side is an element of Z 2 , and thus the two access sets are not disjoint.

• We now show under what conditions two access streams

with nondisjoint access sets are conflict free. Since an accessed bank is occupied for nc clock periods,

the access streams are conflict free if

Vk: {bx + kdx,bx + (k + l)dx,

···,£, + (k + nc - l)dx}*

Π {b2 + kd2,b2 + (k + l)d2,

--,b2 +(k + nc - l)d2}* = 0 (8)

where {}* denotes that the elements of the sets are calculated modulo m.

However, only the relative, rather than the absolute, posi­tions of the access streams are important. Then, by arbitrarily setting bx = 0 and subtracting kdx mod m from (8), we obtain

V*:{0,rf l f · · · , ( * , - 1 )</,}*

Π {b2 + k(d2 - dx),b2 + d2 + k(d2 - dx),

• · · ,b2 + (nc - \)d2 + k(d2 - dx)}* = 0. (9)

If it is possible to find two conflict-free access streams, then b2 = ncdx mod AN is a possible choice for b2 relative to bx = 0. In this case the two access streams will definitely meet at b2, with access stream " 1" arriving at b2 just at the time when b2 becomes available again. With this observation, (9) may be written as

V * : { 0 , < / i , - " , ( / i c - lWi}.

Π {ncdx + k(d2 - dx),{nc + \)dx + (k + \){d2 - dx),

• · · , (2n c - \)dx + (k + n c - l)(d2 - </,)}* = 0. (10)

Since (10) has to be satisfied for all k, it also can be formu­lated as

V * : { 0 , d „ " - , ( n c - Drfi}. Π {ncdx + k(d2 - dx),(nc + \)dx + k(d2 - dx),

• • - , ( 2 / i c - \)dx + k{d2 - dx)U = 0. (11)

With these preliminary observations, we now can state the following theorem.

Theorem 3: Let / = gcd(m,dx,d2). There exist start banks with addresses bx and b2 such that two access streams with nondisjoint access sets are conflict free if and only if

gcd(m/f,(d2- dx)/f)^2nc. (12)

Proof: The division by / changes the difference of d2

and dx but not the return numbers of the two access streams. As a common factor to dx, d2, and m it only "pushes" the relevant banks apart. In the following we a s s u m e / = 1, and dx I m; other values of dx are isomorphic (see the Appendix) to that case.

The corresponding pairwise differences of the sequences

0, dx,2dx, 3dx, · · · mod m

0, d2, 2d2, 3d2, - - - mod m

are given by k(d2 - dx), k = 0 , 1 , 2 , · · · . With the Eu­clidean algorithm we find the smallest positive value for these differences to be g = gcd(m,d2 - dx). When k = 0, (11) results in

{<>,£/„· · · , ( / ! , - 1 ) ^ } .

Π {ncdx, (nc + \)dx, · · · , (2nc - = 0 . (13)

Equation (13) can also be interpreted such that all possible differences jdx, 1 ^ j 2 n c - 1, between elements of the two sets are multiples of dx. In order to guarantee conflict-free access streams, (13) must remain valid if, for any k, k(d2 - dx) mod m is added to the second set. As for the differences, this is equivalent to

Page 4: On the Effective Bandwidth of Interleaved Memories in Vector

952 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 10, OCTOBER 1985

Vfc: jdi + k(d2 - dx) Ψ 0 mod m , (14)

which, according to the above, is equivalent to

Vfc: jd{ Ψ kg mod m , (15)

1 ^ j; ^ 2nc - 1, and g = gcd (m,d 2 ~ dx). Since d i | m and di9 d2, and m are relatively prime, it follows that dx

and g are relatively prime; thus, gdx ^ m. Therefore, g is the smallest value for j that satisfies

jdx = kg mod m , (16)

and k = dx, resul t ing in m ^ gdx > (2nc — \)dx, re­spectively, m/di = g = 2nc.

• Note that g c d ( m , 0 ) = m, i . e . , access s treams with

dx = d2 are conflict free if rx — r 2 ^ 2n c . Another important aspect is "synchronization," meaning

that two access streams will definitely fall into a conflict-free cycle, irrelevant of the relative starting positions, if (12) is satisfied. If, due to an improper relative starting position, one access stream should request access to a bank before it is cleared, this access stream will be delayed until this bank becomes available, thus yielding a situation described by (10).

Fig. 2 depicts a 12-way interleaved memory system with nc = 3. Two access streams with dx = 1 and d2 = 7 do not encounter any conflicts, i .e. , bc{i = 2.

All other access streams whose access sets are not disjoint and do not satisfy (12) will fall into a conflicting cycle re­sulting in an effective bandwidth bcif < 2.

A very special conflicting case is given where one of the two access streams runs conflict free, while the other one regularly becomes delayed. We will call such a case a "barrier-situation" since the conflict-free access stream forms a "barrier" for the delayed access stream.

In Fig. 3 a barrier-situation is shown where an access stream with d2 = 6 is constantly delayed ( " < " depicts a de­lay) by another stream with dx = 1 in a 13-way interleaved memory system with nc = 6.

Theorem 4: Let / = gcd(m,dx,d2); rx ^ 2nc; r2 > nc; Ζι Π Z2 Φ 0; di \ m; d2 > di. There exist start banks with addresses bx and b2 such that a barrier-situation occurs if

((d2 mod m/di) - dx)/f < nc. (17)

Proof: L e t m ' = m/f,d[ = dx/f,mad2 = d2/f Since dx\m it follows d[\m' and d[ and d2 are relatively prime.

Since the access sets are not disjoint, there is at least one common bank whose address arbitrarily is taken to be 0. If bx = 0, b2 = 0 and access stream " 2 " is delayed we obtain the following situation:

M ( (nc + \)d[

0 d'2

(nc + d[)d[ mod m'

d[d2 mod m'.

Since d[ and d2 are relatively prime, the first common ad­dress after 0 is the address d[d2 mod m'. If this address appears within the nc - 1 previous clock periods of the first access stream a barrier-situation is encountered, i .e. ,

bank 0 1 2 3 4 5 6 7 8 9

10 11

bank 0 1 2 3 4 5 6 7 8 9

10 11 12

111222 111222 111222 1112 . 1 1 1 222111 222111 222111 . .111222 111222 111222 11 222111 222111 222111 2221

111222 111222 111222 . .222111 222111 222111 22

111222 111222 111222 222111 222111 222111

111222 111222 111222. . 222111 222111 222111

.222 111222 111222 111222 222111 222111 222111. .

c l o c k - p e r i o d

Fig. 2. Conflict-free access.

1 « « < 2 2 2 2 2 2 . 1 1 1 1 1 1 111111 1 . 111111 111111 111111 . .111111 111111 111111 . . .111111 111111 111111

111111 111111 1 « « < 2 2 2 2 111111 1 « « < 2 2 2 2 2 2 . 1 1 1 1 1 1 . . .

1 « « < 2 2 2 2 2 2 . 1 1 1 1 1 1 111111. . 222222.111111 111111 111111.

111111 111111 111111 111111 111111 11111

111111 111111 1 « < 111111 1 « « < 2 2 2 2 2 2 . 1 1 1

1 « « < 2 2 2 2 2 2 . 1 1 1 1 1 1 11

c l o c k - p e r i o d

Fig. 3. Barrier-situation.

d[d'2 mod m' G {(nc + d[ - \)d[,{nc + d[ - 2)d[,

• · · ,{nc + d[ - (nc - l))d[} mod m ' . (18)

Since d\ and d2 are relatively prime, (18) can be written as

d2 G Κ + dl - l,nc + d[ - 2,

- · ,nc + d[ - (nc- 1)} mod m" (19)

with m" = m'/d{. Equation (19) can be written as

di G {d[ + \,d'x + 2, · · · , < / ( + nc - 1} m o d m " . (20)

Thus, we obtain for the barrier-situation

d'2 = d'x + c + km" (21)

with 1 ^ c < nc and any arbitrary integer k. •

However, the conditions of Theorem 4 alone do not guar­antee a barrier-situation. Fig. 4 shows that the access streams of the example depicted in Fig. 3 reach a cyclic state where they delay each other in case b2 = 1. ( " < " depicts a delay of "2" by " 1 " ; " > " depicts a delay of " 1 " by "2".) We will call such a case a "double conflict," i .e . , there are clock periods where mutual delays appear.

Theorem 5: Let rx ^ 2nc; r2 > nc; Zx Π Z 2 Φ 0; dx \ m\ d2> dx. Then a double conflict is never encountered if

(nc - l)(d2 + dx) <m. (22)

Proof: Let / = g c d ( m , d x , d 2 ) , m' = m/f, dl = dx/f, and d2 = d2/f Since dx | m, it follows that dl \ m', and dl and d2 are relatively prime.

Let the first conflict be on a bank with address k. We arbi­trarily assume that access stream " 1 " becomes delayed any­where from 1 up to nc clock periods. In order to avoid a double conflict, the continuing access stream " 2 " may not hit any of the addresses k - id{, i = 1, 2, · · · , nc - 1, since these banks still may be active due to accesses of stream " 1 . " This situation can be illustrated as follows:

Page 5: On the Effective Bandwidth of Interleaved Memories in Vector

OED A N D LANGE: EFFECTIVE BANDWIDTH OF INTERLEAVED MEMORIES 953

bank 0 1 2 3 4 5 6 7 8 9

10 11 12

1 1 « « 2 2 2 2 2 2 2 » » > 111111. 2 » » > 1 1 1 1 1 1 222222. . .111111

111111 222222 11111 111111. . .222222 1 1 «

1 1 « « 2 2 2 2 2 2 2 » > 2 > » » 1 1 1 1 1 1 222222

222222. . .111111 222222 .222222 111111. . .222222 . . . .

1 1 « « 2 2 2 2 2 2 2 » » > 1 1 1 1 1 1

222222. . .111111 222 222222 111111. . .22222

222222 1 1 « « 2 2 2 2 2 2 .

c l o c k - p e r i o d

Fig. 4. Double conflict: barrier-situation is not reached.

bank 0 1 2 3 4 5 6 7 8 9

10 11 12

1 1 « 2 2 2 2 1111 1111 1 1 « .1111 1111 . 1 1 « 2 2 2 2 111 . . 1111 1 1 « 2 2 2 2 1111 11 . . . 1 1 « 2 2 2 2 1111 1111 1

1111 1111 1 1 « 2 2 2 2 1111 1 1 « 2 2 2 2 1111

1 1 « 2 2 2 2 1111 1111 2222. . . 1 1 1 1 . . 1111 1 1 « 2 2 2 2 .

1111 1 1 « 2 2 2 2 1111. . . . 1 1 « 2 2 2 2 1111 1111. . .

.2222 1111 1111 1 1 « 2 2 1111 1 1 « 2 2 2 2 1111.

1 1 « 2 2 2 2 1111 1111

c l o c k - p e r i o d

Fig. 5. Barrier-situation.

No. of Delays: nc 1

k - (nc- l)d'2 k - (nc - \)d\ k - (nc- 2)d'2 k - {nc- 2)d[ k - (nc - \)d\

\ ·: k-(ne- 2)d[ k - 2d'2 k -2d[ \ k - di k - d[ k - 2d[

k k k - d\ · · · k - (nc- \)d\ k + d'2 k · · · k - (nc- 2)d\ k + 2d'2 ·

: k - d\ k k + (nc- \)d'2

The maximum distance is achieved for a conflict duration of one clock period. Therefore, the minimum number of banks required in order to avoid a double conflict is given by

(nc- \)-{d'2 +d[)<m',

which, multiplied by / , yields (22).

(23)

Fig. 5 shows another example of a barrier-situation satis­fying both (17) and (22), with m = 13, nc = 4, dx = 1, d2 = 3, bi = 0, and b2 = 7.

However, as Fig. 6 shows, this barrier-situation is not unique if b2 = 1 is chosen as the start bank. The barrier-situation is now inverted, i .e . , access stream " 2 " delays (">") access stream " 1 . "

Since in general the relative starting positions cannot be predicted, we are mainly interested in a "unique barrier-situation," i .e. , access stream " 2 " is delayed by access stream " 1 " irrelevant of the relative starting positions.

Theorem 6: Let rx ^ 2nc\ r2 > nc\ Zx Π Z 2 Φ 0; dx\m; d2 > dx. If (17) holds, then a unique barrier-situation is reached if

(2nc - \)d2 ^ m. (24)

Proof: Let / = g c d ( m , d x , d 2 ) , m' = m/f, d[ = dx/f, and d2 = d2/f Since dx | m, it follows that d[\m', and d[ and d2 are relatively prime.

Since d2 > d[, it follows that (2nc - \)d2 > (nc - 1) · (d2 + d[), and thus a double conflict cannot occur. There­fore, as soon as access stream " 2 " becomes delayed a unique barrier-situation is reached.

Therefore, we only have to consider the case that access stream " 1 " becomes delayed, which can be illustrated as ("*" depicts a delay)

bank 0 1 2 3 4 5 6 7 8 9

10 11 12

11112222 2222 2 » > 1 1 1 1 . . 2 > » 1 1 1 1 2222 2222 11112

11112222 2222 2 » > 1 2 » > 1 1 1 1 2222. 2222

.2222 11112222 2222 2 » > 1 1 1 1 2222 2222

2222 11112222 2222 . .2222 2 » > 1 1 1 1 2222

2222 11112222 222 2222 2 » > 1 1 1 1 2222. . .

. . .2222 2222 11112222. 2222 2 » > 1 1 1 1 22

2222 2222 11112222. .

c l o c k - p e r i o d

Fig. 6. Inverted barrier-situation.

ncd2 (nc + \)d'2

0 d[

(2nc - \)d'2

(nc - l)dl.

If (24) holds, there have been sufficient, i .e . , at least nc, accesses by access stream " 1 " before access stream " 2 " ex­ceeds m ' . Then, due to modulo m ' , access stream " 2 " will subsequently hit an active bank due to accesses of stream " 1 . "

• Equation (24) gives an upper bound to the minimum num­

ber of banks required in order to definitely reach a unique barrier-situation. Depending on the distances dx and d2, there are cases where a unique barrier-situation is reached for a smaller number of banks.

Theorem 7: Let rx ^ 2nc\ r2 > nc\ Zx Π Z 2 Φ 0; dx \ m\ d2 > dx. If (17) and (22) are satisfied, but not (24), then a unique barrier-situation is reached if

kd2 < (k - nc)dx mod m (25)

with k = [m/(dxd2y\dx < 2nc. Proof: Let / = g c d ( m , d x , d 2 ) , m' = m/f, d[ = dx/f,

and d2 = d2/f Since dx \ m, it follows that d'x\m', and d[ and d2 are relatively prime.

We have to investigate the case that access stream " 1 " first is delayed by access stream " 2 . " Then, with k'd2 > m',

'' · ncdi (nc + \)d'2

* 0 d\

k'd2 mod m'

(k' - nc)d[ mod m'.

If k'd2 mod m' is within 0 to (k' - nc - \)d[, a unique barrier-situation is reached, i .e . ,

k'd'2 < (k' - nc)d[ m o d m ' . (26)

We are only interested in bank addresses common to both access streams. With d[ \ m' it follows that common address­es are multiples of d[d2 mod m', and the first k' we actually care for is

Page 6: On the Effective Bandwidth of Interleaved Memories in Vector

954 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 10, OCTOBER 1985

*' = \m'Mdiy\d{. (27)

By replacing (27) in (26) and multiplying b y / we obtain (25). •

In case access stream " 1 " has higher priority over access stream " 2 " (either because of a fixed priority rule or because of a cyclic priority rule), (25) also holds for

kd2 = (k — nc)dx mod m (28)

with k = \m/(dxd2)~\dx < 2nc. Then a simultaneous bank conflict will occur, delaying access stream " 2 , " and the unique barrier-situation is reached.

In case of a unique barrier-situation, dx/f — 1 accesses of access stream " 2 " are conflict free until the next conflict occurs. The subsequent delay lasts (d2 — dx)/f clock peri­ods. Thus, in d\lf clock periods, 2dx/f — 1 accesses from both access streams take place. In the next (d2 — dx)/f clock periods, (d2 — dx)/f accesses of the conflict-free access stream and 1 access of the delayed access stream take place. In total, 2dx/f - 1 + (d2 - dx)/f + 1 = (d2 + dx)/f ac­cess requests are granted within (d2 - dx)/f + dx/f = d2/f clock periods, and we obtain the effective bandwidth for a unique barrier-situation as

bt{{ = 1 + dx/d2 < 2 . (29)

Fewer Sections than Banks: It is quite obvious that in order to have a chance to achieve the maximum bandwidth there must be at least as many sections as there are ports. Thus, in our case 2 ^ s < m where we assume s \ m and that each section contains the same number of banks.

Again, we are interested in finding the conditions under which the two access streams are conflict free. Above we have found that this definitely is the case if the access sets are disjoint, and it is clear that this may be extended to the case when the section sets are disjoint (the section set of an access stream contains all the addresses of the sections being ac­cessed by that stream).

Theorem 8: If the access sets are disjoint but the section sets are not, then conflict-free access streams can only be achieved if

gcd(s ,d 2 - dx) ^ 2 . (30)

Proof: This follows directly from (12) if m is replaced by s and nc is set to 1 since the "cycle time" for a path is taken to be 1.

• The case of nondisjoint access sets is more complicated

because this implies that the section sets are also not disjoint. Theorem 9: If (12) is satisfied, then two access streams

are conflict free if, for any k = 1 , 2 , · · · ,

ncdx*ks. (31)

Proof: We have seen that if (12) holds, ncdx is a possible choice for b2 relative to bx = 0. For the pairwise differences of the following sequences

n(d\,n(dx + d2,n<dx + 2d2, · · · mod m

0 , dx , 2dx , · · · mod m

we obtain ncdx + j(d2 — dx) mod m, j = 0 , 1 , 2 , · · · . As has been shown in the proof of (12), the minimal difference or multiples thereof is the relative position between simul­taneous access requests. Since ncdx also is a minimal dif­ference, the simultaneous access requests always occur to different sections if ncdx and s are relatively prime.

• If a value k can be found such that ncd\ = ks, then

(nc -\- l)dx Φ ks. Therefore, conflict-free access streams are still possible if, w i t h / = gcd(m,dx,d2),

gcd(m// , (d2 - dx)/f) ^ 2{nc + 1) (32)

holds and the start banks are relat ively posi t ioned by (nc + 1 ) 4 .

Equation (32) may be interpreted such that an extra clock period is needed in order to avoid a section conflict in case (31) cannot be satisfied.

Fig. 7 depicts a 12-way interleaved memory system with two sections and nc = 2. Two access streams with dx = l a n d d2 = 1 are conflict free for the relative starting positions of (nc + \)dx = 3. Notice that this case does not fulfill (31) but (32).

Finally, we briefly discuss the problem of the "linked conflict" [8], where two access streams fall into a cycle with the alternating encounter of a bank conflict and a section conflict; although (32) is satisfied, the required relative start­ing positions are not met. In this case "synchronization" does not necessarily occur if the access streams conflict due to inappropriate starting positions.

For example, Fig. 8(a) illustrates a situation where two access streams, with dx = d2 = 1 on a 12-way interleaved memory system with three sections and nc = 3 , run into a linked conflict ("*" depicts a section conflict; " < " depicts a delay of " 1 " by "2") . The first access stream ("1") always has priority over the second one ("2"). Although both access streams begin simultaneously, the first one encounters two bank conflicts, which puts it into a relative position of nc = s to the second one. This clearly does not satisfy the require­ment of (31), and the linked conflict builds up.

If a cyclic priority rule instead of the fixed priority rule is chosen, a linked conflict can be resolved as is shown in Fig. 8(b), if (32) holds.

Cheung and Smith [8] proposed combining m/s con­secutive banks into a section instead of cyclically distributing banks over sections in order to prevent a linked conflict (Fig. 9).

IV. MEASUREMENTS AND SIMULATIONS

In order to demonstrate the practical effects of our ana­lytical results we have done execution time (CPU time) measurements of simple vector loops in dedicated environ­ments on a 2-processor, 16-bank Cray X-MP with bipolar memory chips, i .e . , nc = 4. By corresponding simulations we obtained the number and the type of memory conflicts that were encountered. The simulator is implemented in For-

Page 7: On the Effective Bandwidth of Interleaved Memories in Vector

OED A N D LANGE: EFFECTIVE BANDWIDTH OF INTERLEAVED MEMORIES 955

s e c t i o n bank 0 - 0

1 2 3 4 5 6 7 8 9

10 11

11.22 11.22 11.22 . 1 1 2 2 . 1 1 2 2 . 1 1 . . 1 1 . 2 2 11.22 11 .22 . . . 2 2 . 1 1 2 2 . 1 1 2 2 . 1 1

11.22 11.22 11 .22 . . . 2 2 . 1 1 2 2 . 1 1 2 2 . 1 1 . . .

11.22 11 .22 11.2 2 2 . 1 1 2 2 . 1 1 2 2 . 1 1 .

11.22 11.22 11 2 2 . 1 1 2 2 . 1 1 2 2 . 1

.22 11.22 11.22 2 2 . 1 1 2 2 . 1 1 22

I N C - r U m o d m = d (33)

c l o c k - p e r i o d

Fig. 7. Conflict-free access to a 12-way interleaved memory system with two sections.

p r i o r i t y s e c t i o n bank

0 - 0

1111111111111111111111111111111111

1 - 10

p r i o r i t y s e c t i o n bank

0 - 0

.2221

. .*22 2

111 222111 2 « 1 1 1 *22<111. . . .222111 222111. . . .222111 222111 . . .*22<111 *22<111

222111 222111 222111 222111

*22<111 *22<111 222111 222111. . .

222111 222111. . *22<111 *22<111

222111 22211

c l o c k - p e r i o d

(a) 1112221112221112221112221112221112

- 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11

2 2 2 . 1 1 1 . . . . 2 2 2 . 1 1 1 . . 2 2 2 . 1 1 1 . . . . . . 2 2 2 . 1 1 1 . . .

2 2 2 . 1 1 1 . . . . 2 2 2 . 1 1 1 . . . . 2 2 2 . 1 1 1 . .

. 2 2 2 . 1 1 1 2 2 2 . 1 1 1 .

. . 2 2 2 . 1 1 1 222 .111 222

. . . 2 2 2 . 1 1 1 2 2 2 . 1 1 1 22 2 2 2 . 1 1 1 2 2 2 . 1 1 1 2

222 .111 222 .111

I l l 2 « * 1 1 1 . . . . 2 2 2 . 1 1 1 . . 2 2 2 . 1 1 1 . . . . . . 2 2 2 . 1 1 1 . .

2 2 2 . 1 1 1 . 222 .111

. 2 2 2 . 1 1 1

. . 2 2 2 . 1 1 1 . . . . . 2 2 2 . 1 1 1 . . . . . 2 2 2 . 1 1 1 .

. . 2 2 2 . 1 1 1

. . . 2 2 2 . 1 1 . 2 2 2 . 1

222.

c l o c k - p e r i o d

(b) Fig. 8. (a) Linked conflict not resolved by a fixed priority; beff =

(b) Linked conflict resolved by a cyclic priority; beff = 2. 3/2.

p r i o r i t y s e c t i o n bank

0 - 0 1 2 3 4 5 6 7 8 9

10 11

1111111111111111111111111111111111

111 111.222 111 .222 . . . . 1 1 1 111.222 111.222. . . . 1 1 1 111.222 111.222. * * * * » 2 2 2 I l l . 222 I l l . 222

111*222 111.222 111.22 111.222 111.222 111.2

111.222 111.222 111 . 111.222 111.222 I l l

111.222 111.222 11 111.222 111.222 1

111.222 111.222 111.222 111.222

c l o c k - p e r i o d

Fig. 9. Linked conflict resolved by combining m/s consecutive banks into a section; bci{ = 2 [8].

tran 77 and closely models the characteristics of the Cray X-MP architecture.

We will discuss one specific experiment; further experi­ments and their results are described in [10]. One of the most important basic operations that can be performed in vector mode is the so-called triad:

DO 1 / = I,Ν * INC, INC

1 A(I) = B(I) + C(/) * D(I).

By INC we denote the increment (stride) of a Fortran loop. In general, the resulting distance d for accessing the (k + l)th dimension of an array is

where Jt is the size of the ith dimension with J0 = 1. By Ν * INC we indicate that independent of the increment the vector length is n. This triad is executed for 1 ^ INC ̂ 16 with a vector length of η = 1024 on one CPU; the other CPU executes a program that is tailored so that the memory is constantly accessed by all three ports with a distance of 1. In order to fix the relative position of the arrays in memory, a COMMON-block is used:

COMMON//A(/DIM), 5 (/DIM), C(/DIM), D(/DIM),

and by setting /DIM = 16 * 1024 + 1, the respective first elements of the arrays are one bank apart from each other.

Fig. 10(a) shows the required execution times of the triad over the increment INC; the "CPU times" of the correspond­ing simulations coincide very well with those of the mea­surements (differences are less than 5 percent) and therefore are not shown. In order to estimate the influences by the other CPU, we show in Fig. 10(b) the required CPU time for the same triad except that the other CPU is now shut off so no accesses to memory can occur from there. The bank conflicts [Fig. 10(c)], section conflicts [Fig. 10(d)], and simultaneous conflicts [Fig. 10(e)] encountered by the triad are obtained by the simulator.

The best performance, we observe for the increments 1,6, and 11. The good performance for INC = 1 is not un­expected since all access streams have the same distance. Our analytic model shows that such a case is a candidate for conflict-free access. Naturally, in our experiment we do not have the ideal conditions that would guarantee conflict-free access under all circumstances. Furthermore, with all ports active, there are up to six ports simultaneously requesting access to memory. Then, access conflicts are bound to occur since 6nc = 24 > 16, i .e . , 16 banks are not sufficient to support all access requests in parallel. This also explains that the performance for INC = 9 is not as good as that for INC = 1, although this case is also theoretically conflict free (Theorem 3).

As for INC = 6 and INC = 11 in the environment of INC = 1 we find that these cases are isomorphic to 2 3 and 1 φ 3. Thus, we have a barrier-situation where the ac­cess requests of the triad are fairly undisturbed while the access requests of the other CPU are greatly delayed.

A l so , we find a ba r r i e r - s i tua t ion for INC = 2 and INC = 3. Only, in that case the access requests of the triad are delayed by the competing accesses of the other CPU. The severe increases in the execution times of roughly 50 percent (INC = 2), correspondingly 100 percent (INC = 3), in con­trast to the optimal case, impressively demonstrate the con­sequences of such a situation.

V . CONCLUSION

Conditions under which a single access stream and two concurrent access streams can access an interleaved memory system in a vector processor have been presented. For the

Page 8: On the Effective Bandwidth of Interleaved Memories in Vector

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-34, N O . 10, OCTOBER 1985

Fig. 10. (a) Execution times of the triad; other CPU accesses with d = 1. (b) Execution times of the triad; no accesses from the other CPU. (c) Encountered bank conflicts for the triad, (d) Encountered section conflicts for the triad, (e) Encountered simultaneous conflicts for the triad.

Page 9: On the Effective Bandwidth of Interleaved Memories in Vector

OED AND LANGE! EFFECTIVE BANDWIDTH OF INTERLEAVED MEMORIES 957

programmer it is important to identify the distances which the required access streams will have. In case of one-dimensional arrays it is simply the stride modulo m of the DO loop. In case of higher-dimensional arrays care must be taken when rows (in case of Fortran) or diagonals are to be accessed. A safe method is to choose the dimension of arrays so that they are relatively prime to the number of banks.

However, all efforts may be in vain in case of multivector-processor systems like the Cray X-MP where barrier-situations may easily be encountered. The barrier-situation is a problem of the access environment and cannot be alleviated by architectural means. In order to build an environment with uniform access streams it may be worthwhile to consider the multitasking option (Cray X-MP) or the application of skew­ing schemes (e.g. , [1], [4], [11], [12]).

[5] D. Y. Chang, D. , Kuck, and D. H. Lawrie, "On the effective bandwidth of parallel memories," IEEE Trans. Comput., vol. C-26, pp. 480-489, May 1977.

[6] "CRAY X-MP computer systems," in CRAY X-MP Series Mainframe Reference Manual, Cray Research Inc., HR-0032, 1982.

[7] K. Miura and K. Uchida, "FACOM vector processor VP-100/VP-200," in High Speed Computation (NATO ASI Series, vol. F7), J. Kowalik, Ed. New York: Springer-Verlag, 1984, pp. 127-138.

[8] T. Cheung and J. E. Smith, "An analysis of the CRAY X-MP memory system," in Proc. 1984 Int. Conf. Parallel Processing, pp. 499-505.

[9] G. Birkhoff and S. MacLane, A Survey of Modern Algebra. New York: Macmillan, 1968.

[10] W. Oed and O. Lange, "Modelling, measurement, and simulation of memory interference in the CRAY X-MP," to be published.

[11] H.D. Shapiro, "Theoretical limitations on the efficient use of parallel memories," IEEE Trans. Comput., vol. C-27, pp. 421-428, Μ .y i!?78.

[12] J. van Leeuwen and H. A. G. Wijshoff, "Data mappings in lar^c par-JJel computers ," in Proc. GI-13, Jahrestagung, Inform, "wh-berichte. Berlin: Springer-Verlag, 1983, pp. 8-20.

APPENDIX

ISOMORPHISM OF DISTANCES

By "d\ φ d2 we denote that an access stream with dis­tance di competes for access with another access stream with distance d2. For dx we only need to consider values with d\m since that way all possible return numbers are covered. Other combinations are isomorphic, i .e. , with gcd(&, m) = 1 the original case can be obtained by appropriately renumbering the bank addresses

di φ d2 = kdi Φ kd2 mod m.

Example: m = 16.

1 φ 3 = 5 φ 1 5 = 1 1 φ 1 mod 16

2 φ 3 ^ 6 φ 9 = 6 φ 1 π ι ο ( 1 1 6 .

ACKNOWLEDGMENT

The authors would like to thank F. Hossfeld, C D . Feustel, W. Meyer, and W. Nagel for their support and valuable hints in this research.

REFERENCES

[i]

[2]

P. Budnik and D. J. Kuck, "The organization and use of parallel memo­ries," IEEE Trans. Comput., vol. C-20, pp. 1566-1569, Dec. 1971. C. V. Ravi, "On the bandwidth and interference in interleaved memory systems," IEEE Trans. Comput., vol. C-21, pp. 899-901, Aug. 1972.

[3] D. P. Bhandarkar, "Analysis of memory interference in multiprocessors," IEEE Trans. Comput., vol. C-24, pp. 897-908, Sept. 1975.

[4] D.H. Lawrie, "Access and alignment of data in an array processor," IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975.

Wilfried Oed was born in Bad Merge lei; »τ-many, on October 20, 1952. He receiver >e m and Doktor degrees in electrical engiiu Μ >m the Rheinisch-Westfalische Technische I'« • »)e (RWTH), Aachen, Germany, in 1979 and e-spectively, and he spent one year as a pos te student at Keio University, Tokyo, Japan re he conducted research under the supervi* m of Prof. H. Aiso.

In 1980 he joined the Central Institute for. >pl;ed Mathematics at the Kernforschungsanlag. iilich

where he became involved in the user support, algorithm design, and <dared research for vector processors. In July 1985 he joined Cray Research, inc., Germany, as a System Engineer.

Otto Lange was born in Stolberg, Germany, on February 12, 1935. He received the Diplom and Doktor degrees in electrical engineering and the Diplom degree in mathemat ics from the Rheinisch-Westfalische Technische Hochschule (RWTH), Aachen, Germany, in 1960, 1968, and 1970, respectively.

In 1960 he joined Philips Research Lab? atories, Eindhoven, The Netherlands, where ho -formed research in the design of digital compute;... micro­programming, and computer arithmetic. 1 com 1966

to 1970 he was with the Institute of Electrical Engineering and Data ί rocessing Systems of RWTH as its Chief Engineer where his work included teaching and research in digital differential analyzers and signal processing. From 1971 to 1975 he worked as Chief System Engineer on database systems for IBM, Germany. Since 1976 as a Lecturer and since 1978 as a Professor of Electrical Engineering and Computer Science at RWTH, he has taught courses and supervised research on data structures, fault-tolerant computation, computer arithmetic, and parallel processing. In 1981 and 1984 he was a Visiting Professor in the Department of Computer Science, Colorado State University, Fort Collins.