cover page ee457 final exam (~33.5%)

May 12, 2021 10:55 am EE457 Final - Spring 2021 1 / 12 C Copyright 2021 Gandhi Puvvada

EE457 Final Exam (~33.5%) Spring 2021

Instructor: Gandhi PuvvadaFinal Exam: Wednesday, May 12, 2021 03:15-06:15 (with overhead 03:00-06:35 PM PST) Zoom

Viterbi School of Engineering

University of Southern California

Ques# Topic Page# Time Points Score

1 Pipelined Array Processor 2-32 Tomasulo 4-73 Miscellaneous advanced topics 8-104 WBFF (in Lab 6 Part 4) 11

Total Cover+10+1 = 12

Perfect Score

I have previously read the Viterbi Code of Integrity and other related material at the sites (a) https://viterbischool.usc.edu/academic-integrity/ (b) https://sjacs.usc.edu/students/academic-integrity/ and I will abide by these rules of conduct. I will neither seek help from others nor offer help to others in my exams.

_____________________________ <== Student’s signature, DEN D2L username: @usc.edu

Cover page

http://www-classes.usc.edu/engr/ee-s/457/ee457_Fall2020_exams/EE457_Online_Proctored_Exams_rules_and_procedure.pdf

https://viterbischool.usc.edu/academic-integrity/

https://sjacs.usc.edu/students/academic-integrity/

https://sjacs.usc.edu/students/academic-integrity/


1 ( points) min. Pipelined Array Processor

You had gone through this 3-stage pipelined array processor. Adder needed an extra clock. So, our standard 1-clock stall generating circuit with a single FF was added to generate this stall. .

1.1 A bare-bones copy of the above figure is shown on the next page for you to focus on stalls. Let us assume that each stage has a reason to call for its own stall. Let us assume that the A[I] and B[IPP] are memory locations, which are cached. They each can incur cache misses occasionally and individually. DFCM stands for Data Fetch Cache Miss and DDCM stands for Data Deposit Cache Miss. You stall the DF stage as long as DFCM and R (Run) are active. You stall DD stage (and the two stages behind it) as long as DDCM and RPP are active. DFCM can be ignored when R (Run) is inactive. Similarly, DDCM can be ignored when RPP is inactive.

assign STALL_DD = DDCM & RPP; assign STALL_DP = STALL_DD | STALL_DPL; assign STALL_DF = STALL_DP | (DFCM & R);

To reduce clutter, I have not drawn the above combinational logic on the next page. I have used STALL_DPL (L for local) to represent a local stall arising from the adder stage.

The STALL_DP_HW (new name given to differentiate) (HW for HW solution) (produced in the above figure using the Second_Clk FF and the 2-state state diagram) is not so good for the current problem because it assumes that there are no stalling reasons like the STALL_DD ahead of it. Suppose, when STALL_DP_HW goes active, if DDCM also goes active and DDCM stays active for 2.7 clocks. assign STALL_DP = STALL_DD | STALL_DP_HW; Then we expect that STALL_DP would also be active for 2.7 clocks (providing 3 clocks instead of 2 clocks for the adder). But we will end up providing 4 clocks, as STALL_DP_HW would go inactive in the

Counter

(like

PC

= P

rogr

am C

ount

er)

(WA = Write Address)I IP IPP

DF DP DD(Data Deposit stage)(Data Processing stage)(Data Fetch stage)

R RP RPP (WR = Write)

Fig. 13

~RESET

~RESET

~RESET

Data Stationary Method of Control with STALL and Bubble Injection added to it.

AI BA[I]

YAdder

T1T2

(WD = Write Data)

STALL_DP_HWSTALL_DP_HW

RPSecond_Clk

STALL_DP_HW

~RESET

Not so STALL_DP_HWWe renamed this signal

good (Second_Clk = 0)S0

~RE

SET

if (RP)STALL_DP_HW = 1;elseSTALL_DP_HW = 0;

RP

RP (Second_Clk = 1)S1

STALL_DP_HW = 0;

blocking assignmentsymbolically representsa combinational OFLoutput

Q1P2 Non-grading page, Do Not Submit


second clock and would go active again in the 3rd clock. It would again go inactive in the fourth clock. Fix this problem in the figure below by completing the state diagram below.You do not have to implement the state machine using gates. Complete the three enables and two bubble injections marked as

In your design above, if, at the beginning of a clock, STALL_DPL and STALL_DD go active together, complete the table below to indicate how long which signals are active.

#

I IP IPP

DF DD

R RP RPP

~RESET

~RESET

~RESET

I A[I] T1T2

Convenient format: Partly in schematic, partly in Verilog, and partly in state diagram.

assign STALL_DD = DDCM & RPP; assign STALL_DP = STALL_DD | STALL_DPL; assign STALL_DF = STALL_DP | (DFCM & R);

S0

~RE

SET

if (C1)STALL_DPL = 1;elseSTALL_DPL = 0;

C2 S1

1

2 3 4 5#

C3

C3 =

C1 =

C2 =

(Second_Clk = 0) (Second_Clk = 1)

? ?

Blank area

Q1P3page total = pts


2 ( points) min. Tomasulo

You know the above two designs very well. The top is the OoC design and bottom is the IoC design.

I-Cache

Register Status Table

Integer / Branch

D-CacheDiv Mul

TAG FIFO

Instruc. Queue

Reg

. File

Int.

Que

ue

L/S

Que

ue

Div

Que

ue

Mul

t. Q

ueue

CDB

Issue Unit

Dispatch

Load Buffer

IoI-OoW-OoC with RST

IoI-OoE-IoC_with_ROB

I-Cache

Br. Pred. Buffer

lw mult

Integer / Branch Div Mul

ROB

InstructionPrefetch Queue

Reg

. File

Int.

Que

ue

L/S

Que

ue

Div

Que

ue

Mul

t. Q

ueue

CDB

Issue Unit

D-Cache

Dispatch

1 mult2 Completed3 lw4 Completed

StoreAddr. Buffer D-Cache

L/S Buffer

Current Head

Current TailWP

Addr.Adder

No store buffer

for EE457

Stores hit in cache

always for EE457

Q2P4 Non-grading page, Do Not Submit


2.1 Reproduced below is the solution to Question 1.1 from the Fall 2017 Final exam

Read the four lines in the Box A above. The first three lines are a little vague and they seem to imply that the delay in $2’s availability stopped the instr. #1 from executing allowing #4 to go ahead of #1. Argue that #4 could not have gone ahead of #1 because of $2’s unavailability. Hint: Consider Memory Disambiguation rules. But consider cache misses allowing the sequence to happen.

BO

X A

BO

X B



____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ Now consider the four lines in Box B. As #4, #1, #5, and #2 instruction graduate in that order, who are allowed by dispatch to write to $8 and who are prevented from writing to $8. How does it "prevent"? You may like to use words like "associative search", finding, not finding, etc. ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ Can the 6 instructions (#4, #1, #5, #2, #6, #3) possibly appear on CDB in that order in the IoC design? __________ (Yes / No). Among (#4, #1, #5, #2), who are all allowed to write to $8 in what order in the IoC design? ____________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

2.2 CDB in IoC: Mr. Bruin removed the CDB register and performed behavioral simulation and proved that his design took less clocks. Any advice? _____________________________________ _____________________________________ _____________________________________ _____________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

2.3 Does RST have a valid bit in its entries? ______ (Y / N). Explain. ______________________ ____________________________________________________________________________ ____________________________________________________________________________

2.4 In the performance chapter, we talked about different implementations of a given ISA. Can two implementations of the same ISA differ in the size of the RST in OoC? _____ (Y/N)Can two implementations of the same ISA differ in the size of the ROB in IoC? _____ (Y/N)

4 pts



2.5 Two students were debugging their IoC CPUs They clear . WP, the RP, and the ROB contents at start. Before dispatching the first instruction. S#1’s WP was incrementing once unnecessarily. Before dispatching the first instruction. S#2’s RP was incrementing once unnecessarily. Both experienced deadlock after sometime in simulation. Explain why deadlock occurs and who gets to dispatch more instructions before the deadlock occurs and why/how. ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

2.6 Is it true that, for the same code running on your lab 6 (IoI-IoE-IoC) vs. on your Tomasulo part 2 (IoI-IoE-IoC) , there are more branch-related flushes in the later, yet the later performs better? ________ (Y/N). Explain _______________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

2.7 In a revised IoC design the LS buffer is changed to L Buffer and a separate path to CDB is provided for the store words. This reduced congestion in LSQ and improved performance.

Why do you need L buffer but not S buffer? ___________________________________________ _______________________________________________________________________________ _______________________________________________________________________________

2.8 Instruction Queues ___________ (are / aren’t) FIFOs. Explain _________________ ____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________

2.9 Is it true that in the OoC design, since we do not predict branches, we do not flush any instructions in the back end? ________ (Y/N). We need to flush the instructions gathered in the IFQ of the OoC design in the case of (circle all applicable)(i) taken branches (ii) not taken branches (iii) jumps (iv) jal (v) indirect jumps (jr)



3 ( points) min. Miscellaneous advanced topics

3.1 Topic: Common to all advanced topics

3.1.1 If you want one job (which cannot be subdivided into processes or tasks) (circle the best choice)(i) a 4-core single threaded IoE (In order Executing) processor (ii) a 4-core single threaded OoE (Out of order Executing) processor (iii) a single core 4-threaded IoE (In order Executing) processor (iv) a single core 4-threaded OoE (Out of order Executing) processor preferably doing SMT (v) a single-core-single threaded IoE (In order Executing) processor (vi) a single-core-single threaded OoE (Out of order Executing) processor (vii) a single-core single-threaded IoE (In order Executing) 4-way super scalar processor (viii)a single-core single-threaded OoE (Out of order Executing) 4-way super scalar processor

Among the 8 choices above, a non-blocking L1 cache is useless in _________________________ (list them by roman numerals )

3.1.2 Miss rate Per Instruction (MPI) and Clocks per Instruction (CPI):__________ (MPI/CPI) is specified for each level of the cache. In the Dynamic Instruction Trace of a bench mark program of 1 million instructions, 0.1 million (100,000) are memory instructions. If MPI is stated as 0.5%, should we infer that _________ (A/B) instructions incurred cache miss. Here, A = 0.5% of the 1 million = 5000 and B = 0.5% of the 100,000 memory instructions = 500.

3.2 Topic: Branch Prediction

3.2.1 ____________ (BTB/BPB/both/neither) resemble(s) a cache .

3.2.2 _____________ (JAL Call_Addr / JR $31 / both / neither) is predicted by __________ (BTB/BPB/RAS).

3.2.3 _______ (Taken/Untaken) branches are usually more compared to _______ (taken/untaken) branches.

3.2.4 Aliasing is OK in _________ (BTB/BPB) but it is not OK in _________ (BTB/BPB).

3.3 Topic: CMP (Multi-core Single threaded processors) and Cache Coherency:

3.3.1 These CMPs are also called ___________ (tightly / loosely) coupled __________________________ (shared memory / message passing) processors.

3.3.2 Barrier counter incrementation can be done lock-lessly using (circle all applicable)(i) LL and SC in MIPS (ii) LW and SW in MIPS (iii) Test-n-Set in some CISC (iv) Compare-and-Swap in some CISC

3.3.3 While we use (for simplicity) a bus to interconnect various cores and the memory, it is common to use a MIN for this. MIN stands for ___________________________________________________



3.4 Topic: CMT

3.4.1 A processor with 8-core, each core running 2 threads: For each one of the following items state how many (state a number) , whether it is a per core or per thread resource. (i) L1 private cache (do not count the Instruction Cache and the Data Cache as 2 items).

State whether they are blocking or non-blocking caches and also number of MSHRs.A MSHR stands for ________________________________________ . _________ (CCU/SCU) leave info. in MSHR for _________ (CCU/SCU) to act on it.

____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ (ii) TLB (do not count the Instruction TLB and the Data TLB as 2 items).____________________________________________________________________________ ____________________________________________________________________________ (iii) MMU and PTBR____________________________________________________________________________ ____________________________________________________________________________ (iv) LL_Array (beside the number of LL_Arrays) state number of rows in an LL_Array.____________________________________________________________________________ ____________________________________________________________________________ (v) PCs (program counters) and Register files ____________________________________________________________________________ ____________________________________________________________________________

3.4.2 Between the two, CGMT (Coarse Grain Multi-Treading) and FGMT (Fine Grain Multi-Treading), ___________ (CGMT/FGMT) has a slight advantage in reducing RAW dependency stalls related to juniors of a lw instructions. The CGMT and FGMT names are used if the implementation is an ___________ (IoE / OoE / any / neither).

3.5 Topic: Mutual Exclusion, LL and SC: Two statements are given below along with code snipetts. Agree or Disagree with each statement. If you diagree, explain why. _____________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

LL_Repeat: LL $2, lock;BNE $2, $0, LL_Repeat;

SC_Repeat: SC $1, lock;BEQ $1, $0, SC_Repeat;

We need to repeat polling using LL until we find a zero in the lock.

Agree / DisagreeWe need to repeat SC until we succeed writing a one in the lock.

Agree / Disagree

Statement #1

Statement #2



3.6 Topic: Cache Coherency

3.6.1 MOESI: The figure on the left depicts permitted and forbidden combinations of a ___________ (word/block/page) in two L1 caches. Complete similar figures for MSI and MOOESI protocols. Od = O-dirty, Oc = O clean

3.6.2 MOOESI: Replacements are __________(from/to) I (Invalid State) __________(from/to) any other state. Why some of such transitions are marked with R/FMM where as others with R/--?What is the difference? _________________________________________________________ ____________________________________________________________________________ Important difference between E to M transition and 3 other transitions to M (S to M, Odirty to M, Oclean to M): ________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

3.6.3 Cache coherency protocols reduce congestion on the bus while polling for the lock to be released. ________ (T / F)Say, the three cores have 8 threads each, and all 24 threads are competing for the SDBL. If Thread0 of Core0 got the lock, how and when the rest of 23 threads are eliminated form the competition? ____________________________________________________________________________ ____________________________________________________________________________

L1 Cache #1

L1 C

ache

#2

M

O

I

PrRd(S)/

PrWr/BusUpgr

BusRd

PrWr/

PrRd/--

BusRdX

BusRd/

BusRd/--

PrRd/--

E

PrRd(S)/

PrWr/--

BusRdX/

S

BusRd/

PrWr/

BusUpgr/--BusRdX/Flush

PrWr/--

BusUpgr/--BusRdX/--BusRdX/

PrRd/--

PrRd/--

BusRdX/--BusUpgr/--

‚

BusRd/Flush

Flush

Flush

BusRd

Flush

BusUpgr

Flush

Dirty O

PrWr/

BusRdX/Flush

PrRd/--BusRd/Flush

BusUpgr

BusUpgr/--

Clean

R/FMM stands for Replacement/FlushToMMR/-- stands for just Replacement.

M

O

I

PrRd(S)/

PrWr/BusUpgr

BusRd

PrWr/

PrRd/--

BusRdX

BusRd/

BusRd/--

PrRd/--

E

PrRd(S)/

PrWr/--

BusRdX/

S

BusRd/

PrWr/

BusUpgr/--BusRdX/Flush

PrWr/--

BusUpgr/--BusRdX/--BusRdX/

PrRd/--

PrRd/--

BusRdX/--BusUpgr/--

‚

BusRd/Flush

Flush

Flush

BusRd

Flush

BusUpgr

Flush

Dirty O

PrWr/

BusRdX/Flush

PrRd/--BusRd/Flush

BusUpgr

BusUpgr/--

Clean

R/--

R/--

R/FMM

R/FMM R/--



4 ( points) min. WBFF

On the right is a solution to one of the Lab 6 Part 4 questions. There are two Wrist-Band Flip-Flops. WBFF#1 is ___________ (Set/Reset) and WBFF#2 is ___________ (Set/Reset) on power-on (under reset). This solution assumes _______ (0/1/2) branch delay slots.

In the figure below, we have 5 IF stages and 5 WBFFs with different power-on initializations as stated in the figure. Complete the design assuming one branch delay slot.

WBFF#1 WBFF#2

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

IF3RESET

IF4RESET

IF5RESET

WBFF#3 WBFF#5WBFF#1 WBFF#2 WBFF#4Set on power-on

Reset on power-on

Set on power-on

Reset on power-on

Set on power-on

Blank area



Blank page: Please write your name and email. Tear it off and use it for rough work. Do not submit.Student’s Last Name:____________________ email: __________________

We enjoyed teaching EE457. Hope to see some of you in EE560 next week!Gandhi, TA: Kartik, Mentors: Gengyu and Jize Graders: Arvind, Medha, Yunfei, Sanket, Fangqing, and Lin

cover page ee457 final exam (~33.5%)

Documents