cover page ee457 final exam (~33.5%)
TRANSCRIPT
May 12, 2021 10:55 am EE457 Final - Spring 2021 1 / 12 C Copyright 2021 Gandhi Puvvada
EE457 Final Exam (~33.5%) Spring 2021
Instructor: Gandhi PuvvadaFinal Exam: Wednesday, May 12, 2021 03:15-06:15 (with overhead 03:00-06:35 PM PST) Zoom
Viterbi School of Engineering
University of Southern California
Ques# Topic Page# Time Points Score
1 Pipelined Array Processor 2-32 Tomasulo 4-73 Miscellaneous advanced topics 8-104 WBFF (in Lab 6 Part 4) 11
Total Cover+10+1 = 12
Perfect Score
I have previously read the Viterbi Code of Integrity and other related material at the sites (a) https://viterbischool.usc.edu/academic-integrity/ (b) https://sjacs.usc.edu/students/academic-integrity/ and I will abide by these rules of conduct. I will neither seek help from others nor offer help to others in my exams.
_____________________________ <== Student’s signature, DEN D2L username: @usc.edu
Cover page
May 12, 2021 10:55 am EE457 Final - Spring 2021 2 / 12 C Copyright 2021 Gandhi Puvvada
1 ( points) min. Pipelined Array Processor
You had gone through this 3-stage pipelined array processor. Adder needed an extra clock. So, our standard 1-clock stall generating circuit with a single FF was added to generate this stall. .
1.1 A bare-bones copy of the above figure is shown on the next page for you to focus on stalls. Let us assume that each stage has a reason to call for its own stall. Let us assume that the A[I] and B[IPP] are memory locations, which are cached. They each can incur cache misses occasionally and individually. DFCM stands for Data Fetch Cache Miss and DDCM stands for Data Deposit Cache Miss. You stall the DF stage as long as DFCM and R (Run) are active. You stall DD stage (and the two stages behind it) as long as DDCM and RPP are active. DFCM can be ignored when R (Run) is inactive. Similarly, DDCM can be ignored when RPP is inactive.
assign STALL_DD = DDCM & RPP; assign STALL_DP = STALL_DD | STALL_DPL; assign STALL_DF = STALL_DP | (DFCM & R);
To reduce clutter, I have not drawn the above combinational logic on the next page. I have used STALL_DPL (L for local) to represent a local stall arising from the adder stage.
The STALL_DP_HW (new name given to differentiate) (HW for HW solution) (produced in the above figure using the Second_Clk FF and the 2-state state diagram) is not so good for the current problem because it assumes that there are no stalling reasons like the STALL_DD ahead of it. Suppose, when STALL_DP_HW goes active, if DDCM also goes active and DDCM stays active for 2.7 clocks. assign STALL_DP = STALL_DD | STALL_DP_HW; Then we expect that STALL_DP would also be active for 2.7 clocks (providing 3 clocks instead of 2 clocks for the adder). But we will end up providing 4 clocks, as STALL_DP_HW would go inactive in the
Counter
(like
PC
= P
rogr
am C
ount
er)
(WA = Write Address)I IP IPP
DF DP DD(Data Deposit stage)(Data Processing stage)(Data Fetch stage)
R RP RPP (WR = Write)
Fig. 13
~RESET
~RESET
~RESET
Data Stationary Method of Control with STALL and Bubble Injection added to it.
AI BA[I]
YAdder
T1T2
(WD = Write Data)
STALL_DP_HWSTALL_DP_HW
RPSecond_Clk
STALL_DP_HW
~RESET
Not so STALL_DP_HWWe renamed this signal
good (Second_Clk = 0)S0
~RE
SET
if (RP)STALL_DP_HW = 1;elseSTALL_DP_HW = 0;
RP
RP (Second_Clk = 1)S1
STALL_DP_HW = 0;
blocking assignmentsymbolically representsa combinational OFLoutput
Q1P2 Non-grading page, Do Not Submit
May 12, 2021 10:55 am EE457 Final - Spring 2021 3 / 12 C Copyright 2021 Gandhi Puvvada
second clock and would go active again in the 3rd clock. It would again go inactive in the fourth clock. Fix this problem in the figure below by completing the state diagram below.You do not have to implement the state machine using gates. Complete the three enables and two bubble injections marked as
In your design above, if, at the beginning of a clock, STALL_DPL and STALL_DD go active together, complete the table below to indicate how long which signals are active.
#
I IP IPP
DF DD
R RP RPP
~RESET
~RESET
~RESET
I A[I] T1T2
Convenient format: Partly in schematic, partly in Verilog, and partly in state diagram.
assign STALL_DD = DDCM & RPP; assign STALL_DP = STALL_DD | STALL_DPL; assign STALL_DF = STALL_DP | (DFCM & R);
S0
~RE
SET
if (C1)STALL_DPL = 1;elseSTALL_DPL = 0;
C2 S1
1
2 3 4 5#
C3
C3 =
C1 =
C2 =
(Second_Clk = 0) (Second_Clk = 1)
? ?
Blank area
Q1P3page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 4 / 12 C Copyright 2021 Gandhi Puvvada
2 ( points) min. Tomasulo
You know the above two designs very well. The top is the OoC design and bottom is the IoC design.
I-Cache
Register Status Table
Integer / Branch
D-CacheDiv Mul
TAG FIFO
Instruc. Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
CDB
Issue Unit
Dispatch
Load Buffer
IoI-OoW-OoC with RST
IoI-OoE-IoC_with_ROB
I-Cache
Br. Pred. Buffer
lw mult
Integer / Branch Div Mul
ROB
InstructionPrefetch Queue
Reg
. File
Int.
Que
ue
L/S
Que
ue
Div
Que
ue
Mul
t. Q
ueue
CDB
Issue Unit
D-Cache
Dispatch
1 mult2 Completed3 lw4 Completed
StoreAddr. Buffer D-Cache
L/S Buffer
Current Head
Current TailWP
Addr.Adder
No store buffer
for EE457
Stores hit in cache
always for EE457
Q2P4 Non-grading page, Do Not Submit
May 12, 2021 10:55 am EE457 Final - Spring 2021 5 / 12 C Copyright 2021 Gandhi Puvvada
2.1 Reproduced below is the solution to Question 1.1 from the Fall 2017 Final exam
Read the four lines in the Box A above. The first three lines are a little vague and they seem to imply that the delay in $2’s availability stopped the instr. #1 from executing allowing #4 to go ahead of #1. Argue that #4 could not have gone ahead of #1 because of $2’s unavailability. Hint: Consider Memory Disambiguation rules. But consider cache misses allowing the sequence to happen.
BO
X A
BO
X B
Q2P5page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 6 / 12 C Copyright 2021 Gandhi Puvvada
____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ Now consider the four lines in Box B. As #4, #1, #5, and #2 instruction graduate in that order, who are allowed by dispatch to write to $8 and who are prevented from writing to $8. How does it "prevent"? You may like to use words like "associative search", finding, not finding, etc. ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ Can the 6 instructions (#4, #1, #5, #2, #6, #3) possibly appear on CDB in that order in the IoC design? __________ (Yes / No). Among (#4, #1, #5, #2), who are all allowed to write to $8 in what order in the IoC design? ____________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________
2.2 CDB in IoC: Mr. Bruin removed the CDB register and performed behavioral simulation and proved that his design took less clocks. Any advice? _____________________________________ _____________________________________ _____________________________________ _____________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________
2.3 Does RST have a valid bit in its entries? ______ (Y / N). Explain. ______________________ ____________________________________________________________________________ ____________________________________________________________________________
2.4 In the performance chapter, we talked about different implementations of a given ISA. Can two implementations of the same ISA differ in the size of the RST in OoC? _____ (Y/N)Can two implementations of the same ISA differ in the size of the ROB in IoC? _____ (Y/N)
4 pts
Q2P6page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 7 / 12 C Copyright 2021 Gandhi Puvvada
2.5 Two students were debugging their IoC CPUs They clear . WP, the RP, and the ROB contents at start. Before dispatching the first instruction. S#1’s WP was incrementing once unnecessarily. Before dispatching the first instruction. S#2’s RP was incrementing once unnecessarily. Both experienced deadlock after sometime in simulation. Explain why deadlock occurs and who gets to dispatch more instructions before the deadlock occurs and why/how. ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________
2.6 Is it true that, for the same code running on your lab 6 (IoI-IoE-IoC) vs. on your Tomasulo part 2 (IoI-IoE-IoC) , there are more branch-related flushes in the later, yet the later performs better? ________ (Y/N). Explain _______________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________
2.7 In a revised IoC design the LS buffer is changed to L Buffer and a separate path to CDB is provided for the store words. This reduced congestion in LSQ and improved performance.
Why do you need L buffer but not S buffer? ___________________________________________ _______________________________________________________________________________ _______________________________________________________________________________
2.8 Instruction Queues ___________ (are / aren’t) FIFOs. Explain _________________ ____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________
2.9 Is it true that in the OoC design, since we do not predict branches, we do not flush any instructions in the back end? ________ (Y/N). We need to flush the instructions gathered in the IFQ of the OoC design in the case of (circle all applicable)(i) taken branches (ii) not taken branches (iii) jumps (iv) jal (v) indirect jumps (jr)
Q2P7page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 8 / 12 C Copyright 2021 Gandhi Puvvada
3 ( points) min. Miscellaneous advanced topics
3.1 Topic: Common to all advanced topics
3.1.1 If you want one job (which cannot be subdivided into processes or tasks) (circle the best choice)(i) a 4-core single threaded IoE (In order Executing) processor (ii) a 4-core single threaded OoE (Out of order Executing) processor (iii) a single core 4-threaded IoE (In order Executing) processor (iv) a single core 4-threaded OoE (Out of order Executing) processor preferably doing SMT (v) a single-core-single threaded IoE (In order Executing) processor (vi) a single-core-single threaded OoE (Out of order Executing) processor (vii) a single-core single-threaded IoE (In order Executing) 4-way super scalar processor (viii)a single-core single-threaded OoE (Out of order Executing) 4-way super scalar processor
Among the 8 choices above, a non-blocking L1 cache is useless in _________________________ (list them by roman numerals )
3.1.2 Miss rate Per Instruction (MPI) and Clocks per Instruction (CPI):__________ (MPI/CPI) is specified for each level of the cache. In the Dynamic Instruction Trace of a bench mark program of 1 million instructions, 0.1 million (100,000) are memory instructions. If MPI is stated as 0.5%, should we infer that _________ (A/B) instructions incurred cache miss. Here, A = 0.5% of the 1 million = 5000 and B = 0.5% of the 100,000 memory instructions = 500.
3.2 Topic: Branch Prediction
3.2.1 ____________ (BTB/BPB/both/neither) resemble(s) a cache .
3.2.2 _____________ (JAL Call_Addr / JR $31 / both / neither) is predicted by __________ (BTB/BPB/RAS).
3.2.3 _______ (Taken/Untaken) branches are usually more compared to _______ (taken/untaken) branches.
3.2.4 Aliasing is OK in _________ (BTB/BPB) but it is not OK in _________ (BTB/BPB).
3.3 Topic: CMP (Multi-core Single threaded processors) and Cache Coherency:
3.3.1 These CMPs are also called ___________ (tightly / loosely) coupled __________________________ (shared memory / message passing) processors.
3.3.2 Barrier counter incrementation can be done lock-lessly using (circle all applicable)(i) LL and SC in MIPS (ii) LW and SW in MIPS (iii) Test-n-Set in some CISC (iv) Compare-and-Swap in some CISC
3.3.3 While we use (for simplicity) a bus to interconnect various cores and the memory, it is common to use a MIN for this. MIN stands for ___________________________________________________
Q3P8page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 9 / 12 C Copyright 2021 Gandhi Puvvada
3.4 Topic: CMT
3.4.1 A processor with 8-core, each core running 2 threads: For each one of the following items state how many (state a number) , whether it is a per core or per thread resource. (i) L1 private cache (do not count the Instruction Cache and the Data Cache as 2 items).
State whether they are blocking or non-blocking caches and also number of MSHRs.A MSHR stands for ________________________________________ . _________ (CCU/SCU) leave info. in MSHR for _________ (CCU/SCU) to act on it.
____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ (ii) TLB (do not count the Instruction TLB and the Data TLB as 2 items).____________________________________________________________________________ ____________________________________________________________________________ (iii) MMU and PTBR____________________________________________________________________________ ____________________________________________________________________________ (iv) LL_Array (beside the number of LL_Arrays) state number of rows in an LL_Array.____________________________________________________________________________ ____________________________________________________________________________ (v) PCs (program counters) and Register files ____________________________________________________________________________ ____________________________________________________________________________
3.4.2 Between the two, CGMT (Coarse Grain Multi-Treading) and FGMT (Fine Grain Multi-Treading), ___________ (CGMT/FGMT) has a slight advantage in reducing RAW dependency stalls related to juniors of a lw instructions. The CGMT and FGMT names are used if the implementation is an ___________ (IoE / OoE / any / neither).
3.5 Topic: Mutual Exclusion, LL and SC: Two statements are given below along with code snipetts. Agree or Disagree with each statement. If you diagree, explain why. _____________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________
LL_Repeat: LL $2, lock;BNE $2, $0, LL_Repeat;
SC_Repeat: SC $1, lock;BEQ $1, $0, SC_Repeat;
We need to repeat polling using LL until we find a zero in the lock.
Agree / DisagreeWe need to repeat SC until we succeed writing a one in the lock.
Agree / Disagree
Statement #1
Statement #2
Q3P9page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 10 / 12 C Copyright 2021 Gandhi Puvvada
3.6 Topic: Cache Coherency
3.6.1 MOESI: The figure on the left depicts permitted and forbidden combinations of a ___________ (word/block/page) in two L1 caches. Complete similar figures for MSI and MOOESI protocols. Od = O-dirty, Oc = O clean
3.6.2 MOOESI: Replacements are __________(from/to) I (Invalid State) __________(from/to) any other state. Why some of such transitions are marked with R/FMM where as others with R/--?What is the difference? _________________________________________________________ ____________________________________________________________________________ Important difference between E to M transition and 3 other transitions to M (S to M, Odirty to M, Oclean to M): ________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________
3.6.3 Cache coherency protocols reduce congestion on the bus while polling for the lock to be released. ________ (T / F)Say, the three cores have 8 threads each, and all 24 threads are competing for the SDBL. If Thread0 of Core0 got the lock, how and when the rest of 23 threads are eliminated form the competition? ____________________________________________________________________________ ____________________________________________________________________________
L1 Cache #1
L1 C
ache
#2
M
O
I
PrRd(S)/
PrWr/BusUpgr
BusRd
PrWr/
PrRd/--
BusRdX
BusRd/
BusRd/--
PrRd/--
E
PrRd(S)/
PrWr/--
BusRdX/
S
BusRd/
PrWr/
BusUpgr/--BusRdX/Flush
PrWr/--
BusUpgr/--BusRdX/--BusRdX/
PrRd/--
PrRd/--
BusRdX/--BusUpgr/--
‚
BusRd/Flush
Flush
Flush
BusRd
Flush
BusUpgr
Flush
Dirty O
PrWr/
BusRdX/Flush
PrRd/--BusRd/Flush
BusUpgr
BusUpgr/--
Clean
R/FMM stands for Replacement/FlushToMMR/-- stands for just Replacement.
M
O
I
PrRd(S)/
PrWr/BusUpgr
BusRd
PrWr/
PrRd/--
BusRdX
BusRd/
BusRd/--
PrRd/--
E
PrRd(S)/
PrWr/--
BusRdX/
S
BusRd/
PrWr/
BusUpgr/--BusRdX/Flush
PrWr/--
BusUpgr/--BusRdX/--BusRdX/
PrRd/--
PrRd/--
BusRdX/--BusUpgr/--
‚
BusRd/Flush
Flush
Flush
BusRd
Flush
BusUpgr
Flush
Dirty O
PrWr/
BusRdX/Flush
PrRd/--BusRd/Flush
BusUpgr
BusUpgr/--
Clean
R/--
R/--
R/FMM
R/FMM R/--
Q3P10page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 11 / 12 C Copyright 2021 Gandhi Puvvada
4 ( points) min. WBFF
On the right is a solution to one of the Lab 6 Part 4 questions. There are two Wrist-Band Flip-Flops. WBFF#1 is ___________ (Set/Reset) and WBFF#2 is ___________ (Set/Reset) on power-on (under reset). This solution assumes _______ (0/1/2) branch delay slots.
In the figure below, we have 5 IF stages and 5 WBFFs with different power-on initializations as stated in the figure. Complete the design assuming one branch delay slot.
WBFF#1 WBFF#2
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
IF3RESET
IF4RESET
IF5RESET
WBFF#3 WBFF#5WBFF#1 WBFF#2 WBFF#4Set on power-on
Reset on power-on
Set on power-on
Reset on power-on
Set on power-on
Blank area
Q4P11page total = pts
May 12, 2021 10:55 am EE457 Final - Spring 2021 12 / 12 C Copyright 2021 Gandhi Puvvada
Blank page: Please write your name and email. Tear it off and use it for rough work. Do not submit.Student’s Last Name:____________________ email: __________________
We enjoyed teaching EE457. Hope to see some of you in EE560 next week!Gandhi, TA: Kartik, Mentors: Gengyu and Jize Graders: Arvind, Medha, Yunfei, Sanket, Fangqing, and Lin