1 of 7 ece 511, computer architecture department of electrical and computer engineering university...

1 of 7

ECE 511, Computer ArchitectureDepartment of Electrical and Computer Engineering

University of Illinois at Urbana-ChampaignM. Frank

Final ExamDecember 14, 2006

1:30-4:30pmYou may use class notes, textbooks, web resources, computer simulations or any other reference material you desire. No interactions with others allowed. By turning in this exam you will have attested that you have neither received nor given inappropriate aid on this quiz (i.e., interacted with any person other than the instructor.)

NAME: ____________________________________

NOTE: There are 6 problems worth 100 pts. Problems have different weights.

2 of 7

1. (20 pts) We are given a pipelined inorder and single-issue processor. It sends each instruction through at least 6 stages: fetch, decode, interlock, register-read, execute and (eventually) retire. There is no renaming. The execution unit is fully pipelined. Branch instructions take 1 cycle to complete in the execution unit, add instructions take 2 cycles and multiply instructions take 4 cycles. Instructions are statically scheduled and interlocked. Multiple independent instructions can be in-flight simultaneously. If the instruction at the interlock stage depends on an instruction still in execution then that instruction (and the fetch and decode stage) are stalled until the required result is available on the bypass bus. (I.e., a multiply takes 4 cycles, so if a multiply is immediately followed by a dependent instruction there will be 3 cycles of stalling). Your lab partner has just written the main loop for your semester project in numerical computing, and now you want to generate code that is optimally scheduled. The loop produces two important values in registers $v and $w. Unroll the loop by a factor of two and generate an optimal schedule. You may assume that Cmax is an even number so that you don’t need to worry about testing $c twice. You may rename as many registers as you like, but you should try to use a minimal number of registers. (I.e., You will lose points for using more registers than I think are necessary. By my count the current loop uses $a, $b, $i, $t, $v, $w for a total of just 6 registers). Renamed registers should be named something like $a3, $b2, etc., so that I can tell which register from the original loop you are renaming. You may assume that the branch at instruction 6 is “easy” to predict, so don’t worry about mispredictions.

The loop1: top: $t <- mul $a, $i2: $v <- $v + $t3: $t <- mul $b, $i4: $w <- $w + $t5: $i <- $i + 16: if $i < Cmax goto top

3 of 7

2. (15 pts) The new Illin™ 511 processor is a statically scheduled in-order machine with a RISC instruction set. The 511 has two new instructions: “checkpoint,” and “assert.” Checkpoint saves a “snapshot” of the machine’s register state along with a program counter to jump to if an assert fails. Assert tests a condition (like a branch would) and, if the condition fails, rolls the machine state back to the state saved at the most recent checkpoint instruction, and jumps to the recovery code. After profiling you have identified an important loop in your program and, using the checkpoint and assert instructions, you have unrolled and completely renamed it as shown below. A) Identify all of the dead instructions in the unrolled code. (An instruction is dead if the register it writes can not be read by any instruction in the future.) B) Identify all of the redundant instructions in the unrolled code. (An instruction is redundant if it produces a value that is already available in a different register.)

1 top: Checkpoint (recover at original_loop) 2 Rt0 <- Ri % 1001 3 Assert Rt0 != 0 4 Rb0 <- Ra + Rz 5 Rc0 <- Rc + Rb0 6 Rd0 <- Rc0 + Rw 7 Ri0 <- Ri + 1 8 Assert Ri0 < Rn 9 Rt1 <- Ri % 100110 Assert Rt1 != 011 Rb1 <- Ra + Rz12 Rc1 <- Rc0 + Rb113 Rd1 <- Rc1 + Rw14 Ri1 <- Ri0 + 115 Assert Ri1 < Rn16 Rt2 <- Ri % 100117 Assert Rt2 != 018 Rb2 <- Ra + Rz19 Rc <- Rc1 + Rb220 Rd <- Rc + Rw21 Ri <- Ri1 + 122 if (Ri < Rn) goto top23 else goto exit24 original_loop:25 Rt <- Ri % 100126 if (Rt != 0) goto normal27 Ra <- Rd28 normal:29 Rb <- Ra + Rz30 Rc <- Rc + Rb31 Rd <- Rc + Rw32 Ri <- Ri + 133 if (Ri < Rn) goto top34 exit:

4 of 7

3. (20 pts) You are profiling the newly released SPEC 2007 benchmark suite for your Indel Core Quadro (the Core Quadro is a 4-processor multiprocessor) based workstation and discover that one of the benchmarks spends 80% of its execution time in a loop that calls functions “f” and “g”, both of which you know to be pure functions (i.e., they modify no state).

int j; int sum = 0; int lc = 0; for (j = 0; j < length; j++) { lc = f(lc, j); sum = sum + g(lc) }

Fortunately, you also know that the function “g” takes about 1000 times longer than function “f”. Rewrite the code so that you can parallelize the calls to g.

5 of 7

4. (15 pts) Consider the following C code running on two processors of a sequentially consistent shared memory multiprocessor: Processor A: Processor B:

for (i=0; i < 5; i++) for (j=0; j < 5; j++) x++; x++;

A. Assume x is initialized to 0. What are the possible values x can take after both processors are done? Explain your answer. (Note: your compiler is compiling the C statement x++ as Rx <- load [address_x] Rx <- x + 1 store Rx -> [address_x]

B. If the multiprocessor uses the Illinois (MESI) protocol over a bus for cache coherence what is the minimum and maximum number of cache read misses and cache write misses incurred by processor A? (Assume that i and j are stored in registers local to their processor and that the cache line containing location x starts out in state empty in both processor caches).

6 of 7

5. (15 pts) Oh no! Ben Bitdiddle has done it again. You’ve just shipped the new Illin 511 sequentially consistent shared memory multiprocessor and Ben forgot to implement the test and set instruction! Luckily the Illin 511 comes in only one configuration, and that with just two processors. Show how to implement a barrier operation with just two memory locations (call them “a” and “b”). A barrier operation is a synchronization operation such that all the processors in the system must call it before any of them can continue. (It’s sort of like everyone has to say “I’m ready” before anyone can go on to the next step. It’s useful for doing synchronization in “mostly” parallel code: every processor does the parallel step, then they enter a barrier, when the barrier is done the sequential code can rely on the fact that all the parallel work is already done).

Hint: This should take much less than 5 lines of code per processor.

int a = 0;int b = 0;

void processorA_barrier() {

}

void processorB_barrier() {

}

7 of 7

6. (15 pts)

mux

Inputfifo

Input fifo

Processorp

Switchj

Memory0

Switchk

Processorq

Switchl

Memory1

Switchm

Ben Bitdiddle has spent the entire semester building a flow-controlled packet switched unidirectional ring network with switches as shown to the right. The switch is made up of two FIFO queues, which are used to hold packets arriving from the neighboring switch, and one multiplexor. Each FIFO input queue can hold a maximum of b packets. On each cycle the switch can simultaneously latch an arriving packet in each input queue that is not already full, and can use the multiplexers to send packets to neighboring input queues that are not already full. (If the neighboring input queue is fullthen the output mux sends no message).

Ben uses four of these switches, along with two processors and two memory modules, to construct a shared memory multiprocessor. The system works as follows. When a processor makes a memory read request, the request is packaged into a packet that contains all required information about which memory module should service the request and the processor that sent the request. The processor then injects the request packet into the ring. Request packets travel around the ring to the indicated memory module. When the packet arrives at the memory module the indicated address is read from memory and the corresponding data is placed in a “data packet.” The memory unit then blocks until space is available in its output queue, (i.e., it refuses to accept any more messages until the switch it is attached to has enough queue space to receive the data packet). The data packet then travels back through the network to the original requesting processor. Each processor is allowed to have up to n read requests outstanding simultaneously, and contains n internal buffers so that it can always drain a data packet for an outstanding request that it made.

Ben then arranges two memory modules, two processors and 4 switches in the ring topology shown below.

A. Given that each input fifo can hold a maximum of b = 1 packet, give a sequence of read requests from processors p and q that will cause the system to deadlock.B. What is the maximum number, n, of outstanding read requests that each processor should be allowed in a system where the input fifos can hold a maximum of b packets?

1 of 7 ece 511, computer architecture department of electrical and computer engineering university...

Documents