1 ( 90 points) 65 min. - University of Southern California · Spring 2014 EE457 Instructor: Gandhi...

ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 1 / 10C Copyright 2014 Gandhi Puvvada

Spring 2014 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 5/12/2014, Monday Closed Book, Closed Notes, no Chit sheet, no Calculator Time: 04:30-07:20PM in THH102/THH301

Esperan Verilog Guide is allowed but not required Total points: 261 Name: Perfect score: 250 / 261

1 ( 90 points) 65 min.

Topic: pipeline design (refer to the block diagrams on pages 5 and 6 and the two OoO designs below)

1.1 HDU (in the ID stage) is overridden and told not to stall in some cases to avoid conflict with the operation of the branch instruction. If this applies, write YES otherwise write NO in the cells below.

1.1.1 _____________ (HDU_Br / HDU /both HDU and HDU_Br) ______ (is / are) called the guardian angel in the case of _________________________ (an early branch / a medium-delay branch / a late branch / multiple of these).

1.1.2 FU_Br in the early branch design checks to see if the senior instruction in the MEM stage is(a) a register-writing instruction(b) a register-writing R-Type instruction but not a lw instruction(c) neither of the above.

1.2 Compared to the late branch, an early branch has __________ (less / more) branch penalty due to flushing and this difference is seen in ________________________________ (successful branches / unsuccessful branches / both / neither). Compared to the late branch, an early branch causes __________ (less / more) RAW dependency stalls and this difference is seen in _____________________________ (successful branches / unsuccessful branches / both / neither).

ROB

IoI - OoE - OoC design IoI - OoE - IoC design

3+2 pts

2+1 pts

3 pts

4+2 pts


1.3 Assembly language code can be written deliberately (i) to make a late-branch look better than an early branch True / False(ii) to make a late-branch look better than a medium-delay branch True / False(iii) to make a medium-delay look better than an early branch True / False

1.4 There are 4 5-bit comparison units in the ID stage of the lab 6 Part 5 design as shown below.This number 4 __________________ (includes / does not include) the two comparators needed for internally forwarding in the register file. What would this number be if it was the 7-stage pipeline of lab 6 part 4? _________ . This part 5 design provides time advantage in _________ (ID stage / EX stage / neither) and timing disadvantage in _________ (ID stage / EX stage / neither).

1.5 Assuming that both early branch and late branch designs are working at the same frequency, the control unit of the early branch has to finish decoding the code a little more quickly compared to the control unit of the late branch. True / False

1.5.1 We can be more confident to move the 5-bit wide multiplexer on the side from the current EX stage to the ID stage in the case of the ________ (early / late) branch design.This move saves in the ID/EX stage register 1-bit for RegDst control signal. Besides this RegDst signal, this move saves _____ ( 0 / 4/ 5 / 8 /10) more bits in the current lab 6 design. Besides this RegDst signal, this move saves _____ ( 0 / 4/ 5 / 8 /10) more bits in the lab 6 Part 5 design. This 5-bit multiplexer is in a time-critical path. True/False

1.6 Late branch can not be too late. WB stage in the 5-stage pipeline for the late branch execution is too late. True / FalseBranch outcome is announced from the CDB in our IoI-OoE-OoC design on page 1. Mr, Bruin says that CDB is like the WB stage hence branch execution is too late and our design is wrong. Please explain. ___________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1.7 The stage register IF/ID of the in-order 5-stage pipeline is replaced by what in the IoI-OoE-OoC design on page 1? _____________________________________________________________________________________________________________________________________________

1.8 add is stalled for _____ (1 / 2) clock(s) in an in-order 5-stage pipeline and _____ (1 / 2) clock(s) in an in-order 7-stage pipeline. If lw incurs cache miss ______ (more / less) clocks are lost. This loss impacts CPI ________ (more / less / the same) in the IoI-OoE-OoC design on page 1.

3+2 pts

4+2 pts

RegInstr.

HDU

Data

FU

BRANCH

BR

1FU_Br

PC

cont

rol

HDU_Br

Zero

EQ

5 5

EQ

5 5

EQ

5 5

EQ

5 5

EQ

5 5

EQ

5 5

2 pts

4+2 pts

5 pts

3 pts

lw $2, 2000($0);add $3, $2, $1;

4 pts


1.9 If we use 64 tokens or TAGs in the IoI-OoE-OoC design, the TAG FIFO will have ____ locations each of ____ bits per location. Since we compared this FIFO to ___________________ ________________________________ (paper tokens forming a virtual queue / pile of tokens on the cashier’s table in State Bank of India) it _____________________ (matters / does not matter) in which order the tokens 0 to 63 are placed initially in the token FIFO. The TAG FIFO should have a valid bit to indicate whether the location has a valid token or if it is empty. True / False

1.9.1 The RST (Register Status Table) is a look-up table to allow the dispatch unit to look up ________ ___________________________________ (entries for each of the source registers / entry for the destination register) of the instruction being dispatched (say add $3, $2, $1). This look-up is ____________________ (a search / an indexing) operation. The destination register gets renamed to a token drawn from the token FIFO. This register renaming causes _____________ (reading from / writing to) the Token FIFO and _____________ (reading from / writing to) the RST. The RST should have a valid bit besides the TOKEN field in each entry. True / False

1.9.2 Suppose a register-writing instruction comes on the CDB and says, "it is LION calling (or some 6-bit token in place of LION), and I am going to write the value 2000". Let us try to understand what should the dispatch unit do now. Circle all correct statements: (a) the Dispatch unit does not do anything (b) it indexes RST to find LION in RST (c) it performs a parallel search to find LION in RST. (d) if it finds LION across $2 in RST, it takes 2000 and deposit in the register file in $2.(e) if it finds LION across $2 in RST, and if it is currently dispatching an instruction with source register $2, it gives the value 2000 for its source register value. This like the internal forwarding in the register file.(f) if it finds LION across $2 in RST, it erases LION’s name across $2 and invalidates that entry.(g) it reclaims the token LION and deposits in the TOKEN FIFO at the location pointed to by the RP(h)it reclaims the token LION and deposits in the TOKEN FIFO at the location pointed to by the WP

1.10 Consider the four designs, (A) in-order 5-stage late branch, (B) in-order 5-stage early branch, (C) IoI-OoE-OoC, and (D) IoI-OoE-IoC.

1.10.1 A RAW problem for a memory location (i.e. lw dependent on a senior sw instruction with matching address) is naturally taken care of in _____________ (A/B/C/D/if multiple, then state them). Among the ones in which the problem needs to be taken care of by explicit logic, it takes less logic in ________ (A/B/C/D) compared to _________ (A/B/C/D).

1.10.2 WAW and WAR problems for memory locations do not exist in _______________ (A/B/C/D/if multiple, state them). Two store word instructions can leave LSQ in any order even if their addresses match in __________________________________ (C/D/neither C nor D/either C or D).

1.11 Every register writing instruction such as add $1, $2, $3 will end up writing into the destination register in the register file in __________________________ (IoI-OoE-OoC / IoI-OoE-IoC / both / neither) designs.

1.12 IFQ (Instruction PreFetch Queue) gets flushed ________________ (less often / more often / equally often) in the IoI-OoE-IoC design with branch prediction as compared to the IoI-OoE-OoC design with no branch prediction. In the IoI-OoE-IoC design, you end up flushing IFQ for every mis-prediction. T / F You flush IFQ when you predict a branch as ___________ (Taken/Not Taken/either/neither). Which is true in the IoI-OoE-OoC? ____ [(a) / (b)] (a) we just opted not to have branch prediction (b) Since it is OoC, branch prediction is not possible.

5+2 pts

5 pts

8+2 pts

4 pts

4 pts

4 pts

6+2 pts


Forw

ardi

ngun

it

Haz

ard

dete

ctio

nun

it

04

0

0

Instructionmemory

PC

+

r1 r2

R1

R2

w W

opcode rs rt rd shift funct

Registers

Control

(PC)

(rs) (rt)

AL

U

rs rt rd functshift

AL

Uct

rl

Sign ext.

EXME

WB

AL

USr

cA

LU

Op

Reg

Dst

AL

USr

c

Reg

Dst

AL

UO

p

Mem

Rea

d

+

(PC)

Z

Datamemory

WR

ME

WB ALU_result

@ W

R

MemRead

MemWrite

Store_data

RegWrite

(PC)

Branch

ID.F

lush

IF.Flush

EX

.Flu

sh

WR

WB MEM_data REG_data

RegWrite

Mem

toR

eg

Orig

inal

dra

win

g pr

ovid

ed b

y Pr

of. D

uboi

sPi

pelin

ed C

PU (L

ate

Bra

nch

from

1st

Ed.

) for

the

EE45

7 cl

ass L

ab #

6

Shift

Lef

t 2

3/26

/200

0

IF/I

DIF

-Sta

geID

/EX

ID-S

tage

EX

/ME

ME

X-S

tage

ME

M-S

tage

ME

M/W

B WB

-Sta

ge


Haz

ard

dete

ctio

nun

it

04 Instruction

memory

PC

+

r1 r2

R1

R2

w W

opcode rs rt rd shift funct

Registers

Control

(PC)

(rs) (rt)

ALU

rt rd

ALU ctrl

Sig

nex

t.

EXME

WB

ALU

Src

ALU

Op

Reg

Dst

ALU

Src

RegDst

ALU

Op

Reg

Writ

e_EX

Datamemory

WR

ME

WB ALU_result

@ W

R

MemRead

MemWrite

Store_data

RegWrite

IF.F

lush

WR

WB MEM_data REG_data

RegWrite

Mem

toR

eg

+

=

functs_ext

Shift

Left

2Zero

Forw

ardi

ng U

nit

Des

igne

d by

: Gan

dhi P

uvva

daD

etai

led

impl

emen

tatio

n of

Ear

ly B

ranc

h su

gges

ted

in 3

rd E

d.10

/18/

06

IF/I

DIF

-Sta

geID

/EX

ID-S

tage

EX

/ME

ME

X-S

tage

ME

M-S

tageM

EM

/WB

WB

-Sta

ge

rs

Mem

Rea

d_EX

Mem

Rea

d_M

EM

WriteRegister_EXFU

_Br

FW_RS_WB

FW_RS_MEM

FW_RT_WB

FW_RT_MEM

FW_RT

FW_RS

WriteRegister_MEM

Writ

eReg

iste

r_M

EMH

DU

_Br

STA

LL_B

EQST

ALL_

LW

STA

LL

Branch

0 1

0 1 10

01

11

11

1

00

00

0

0

0 1

Branch

1

fow

ardi

ng_m

ux_c

ontr

ol

Dra

wn

by: W

ei-je

n H

su


2 ( 77 points) 45 min. Advanced topics Miscellaneous

2.1 Exceptions are taken in _____________________________ (program order / temporal order).Illegal instruction exception is often _____________________________ (a precise exception / an exception leading to abortion of the program) so as to support software emulation of unimplemented instructions.

2.2 In the two CMP (Chip Multi Processors) organizations shown below, the shared L2 cache is shown "banked" on the left but not on the right. Explain. ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.2.1 If there are 8 banks of L2 cache in the left side organization, is it true that there can be 8 copies of a block in the 8 banks besides 8 copies in the 8 L1 caches? True / Not trueExplain: ____________________________________________________________________________________________________________________________________________________

2.3 To avoid RWM race, you should be able to make atomic operation. What RWM stand for?____________________________________________________________________________

2.4 Compared to MSI protocol, MOESI protocol _______________ (reduces / increases) L2 to L1 transactions mainly because of _________ (O-state / E-state). Explain: _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.4.1 We proposed to split the O-state into two states: O-Dirty and O-Clean. You arrive in O-Dirty state from _________ (M/O-Clean/E/S/I) and arrive in O-Clean state from _________ (M/O-Dirty/E/S/I).

2.5 Branch Prediction from ID stage: In the diagram on the side, the (30-K) bit field of the PC is ______________________ (ignored / used as a TAG). Is there any connection between the frequency of aliasing and the BPB’s (Branch Prediction Buffer’s) depth (=2K) or width (1-bit predictor vs. 2-bit predictor)? ________________________________________________________________________________________________________________________________________

2.6 RAS stands for ________________________. It is usually ____________________ (4 or 8 locations / 1K to 2K locations) deep. The contents of RAS change during the execution of ______ _____ (jal / jr $31/both). RAS helps in the execution of ____________ (jal / jr $31/both) and such help __________ (is / isn’t) considered as a prediction that can go wrong.

4 pts

4 pts

P0

L1$

P1

L1$

P7

L1$

Memory Interconnection Network

Shared (banked)L2$ L2$ L2 cache

P0

L1$

P1

L1$

P7

L1$

Shared L2 cache (no banks)

4 pts

2 pts

4 pts

4 pts

01010011

00

K-bits

K30-K PC

BPB 2K

5 pts

6 pts


2.7 Our USC multi-threaded multi-core processor currently has 4 cores each with 4 threads and the core resources are well utilized by the 4 threads. Due to process improvements we have more silicon in the next generation processors. We are considering the two choices: (i) 8 cores each with 4 threads (ii) 4 cores each with 8 threads, You recommend to go for ______ ( i / ii). Explain: _______________________________________________________________________________________________________________________________________________________________________Choice ______ ( i / ii) is expected to require _________ (more / less) silicon compared to choice ______ ( i / ii). Explain: ___________________________________________________________ ______________________________________________________________________________

2.8 Two challenges in compiler design for the current super-scalar super-pipelined processors:(i) avoid pairs of dependent instructions, (ii) schedule instructions into longer delay slots for load-word and branch instructions. The first is important because of the ____________ (super-scalar / super-pipelined) aspect of our processor and we said "pairs" because we assumed that ______________________________________________________________________________________

2.9 MPI (Miss Penalty per Instruction) is used to estimate the impact of cache misses on CPI.Assume that the CPI without cache misses is 1.35. In a system with L1 and L2 caches, the overall CPI was calculated as 1.35 + (0.04 * 25) + (0.01 * 200) = 4.35. What are the numbers 0.04, 25, 0.01, and 200? _____________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.10 Intel HTT stands for ___________________________ and it is same as ___________________________________ (fine-grain / coarse-grain / simultaneous) multi-threading. It uses ________ (ILP/TLP) advantage of its out-of-order execution engine together with ________ (ILP/TLP). _______ (Like / Unlike) other multi-threading techniques, here they do not need to roll-back a thread on a cache miss because ________________________________________________________________________________________________________________________________________________________________________________________________________________

2.11 Mr. Bruin joined USC and was appointed as EE457 lab grader. He was grading Lab 7 Part 3 Subpart 4 (RTL coding of the pipeline with EX1 and EX2 merged into EX12). Several students made errors in the stall logic coding. He was puzzled because their _________________ (Reg. file contents / TimeSpace.txt) came out correct but their ________________ (Reg. file contents / TimeSpace.txt). He expected both to be wrong. Explain: _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.11.1 In our RTL coding, we produced two STALL signals ( declared on the side. Which one of the two produces a waveform easier to understand and why? Is it necessary to declare STALL_combinational as a wire? __________________________________________________________________________________________________________________________________________________________________________

2.11.2 In RTL coding, forwarding muxes output in an "if" statement in a clocked always block is usually assigned using the _______________ (blocking / non-blocking) procedural assignment operator because _____________________________________________________________________________________________________________________________________________________________

8 pts

3 pts

6 pts

8 pts

8 pts

reg STALL;wire STALL_combinational;6

pts

X1_mux SUB 3

A A-3

5 pts


3 ( 14 + 18 = 32 points) 20 min. Topic: Lab 7 Part 3 Subpart 2

3.1 Given on the side is the incomplete stall logic. Label the three points as Stalled, Add1, and Stall as appropriate. The Stall output is a __________ (Mealy / Moore) output. This is a state machine implementation with an encoded state assignment and has _____ (1/2/3/4) states. Complete the state diagram below. Write state transition conditions and write, "if ADD 1, then STALL" in one of the two states. And then try to implement a 1-hot coded state machine for the same state diagram. Produce the STALL output.

3.2 A similar state diagram is shown below in nearly completed form to stall for 1 clock for M (MULT) instruction and to stall for 2 clocks for D (DIVIDE) instruction and no stall for other instructions. The state transition conditions are completed. Conditional STALL output generation statements need to be added in one or two or three state circles as appropriate. Some suggestions:i) if (M | D), then STALL (ii) If (M) then STALL and (iii) If (D) then STALL

Implement the one-hot state machine and generate the STALL signal (i.e do the OFL). Let us use A (ADD) and S (SUB) for other instructions which do not need stalls. If A is in the multi-functional EX state in the current clock and the above state machine is in the IDLE state, write the sequence of states the above machine goes through for D M S M A instructions.

D QCLKCLRCLK

RESET_B

14 pts

IDLE STALLED

RESET_B

D QCLK

PRESET

CLK

RESET_B

D QCLKCLRCLK

RESET_B

QQIDLE STALLED

18 pts

IDLE STALLED_1

RESET_B

STALLED_2M | D

M & D

D

D

1

D QCLK

PRESET

CLK

RESET_B

D QCLKCLRCLK

RESET_B

QQIDLE STALLED_1 D Q

CLKCLRCLK

RESET_B

QSTALLED_2

IDLE, STALLED_1, STALLED_2,


4 ( 27 points) 20 min. adder, subtracter, incrementer, decrementer

4.1 We know we do (A - B) by doing (A + B’ +1). To perform (X -2Y), we do ______ (i / ii / iii / iv / v)(i) [X + (Y’ || 1’b0) + 1] (ii) [X + (Y’ || 1’b0) + 2] (iii) [X + (Y’ || 1’b1) + 1] (iv) [X + (Y’ || 1’b1) + 2] (v) none of these, we need to do ______________________________ . Note: || means concatenate

4.1.1 Say, the numbers are all 4-bit numbers and we have a 4-bit adder/subtracter to perform (X-2Y) by simply discarding (dropping) Y3’. In which cases discarding Y3’ does not change the value of (-2Y).Answer separately for unsigned numbers and for signed numbers.Unsigned numbers: ____________________________________________________________________________________________________________________________________________Signed numbers: ______________________________________________________________________________________________________________________________________________

4.1.2 Let us play safe and use a 5-bit adder and convert it to a subtracter to perform (X-2Y). Complete the two designs by using inverters, XOR gates, etc. as needed. Produce USO (unsigned subtraction overflow) and SSO (signed subtraction overflow). You do not have to simplify the FULL-ADDER building blocks.

4.2 On the left, we have a chart for delays in gates for the incrementer/decrementer design based on CLA design assuming 4 as the blocking factor (4 CLL carry look-ahead logic boxes require one next-level CLL). Complete the table on the right for a new blocking factor of 3.

For the left-side design, how many CLL boxes are needed for a 1024-bit incrementer? ________Think of a formula to arrive at this number rather than using a brute-force method!

Are the CLL boxes identical in the case of incrementer as well as decrementer? Yes / No

Do they both contain 1-level logic or 2-level logic? _______________________

In ___________________ (an incrementer / a decrementer), C3 = g2 + g1 + g0 + C0 because every cell agrees to _________ (generate / propagate).

In ___________________ (an incrementer / a decrementer), C3 = p2.p1.p0.C0 because no cell _________ (generates / propagates) a carry.

4 pts

4 pts

6 pts

a bcin

scout C0

a bcin

scout

a bcin

scout

a bcin

scout

a bcin

scout

UNSIGNED SUBTRACTOR SIGNED SUBTRACTERDo not forget to produce USO.

R4 R3 R2 R1 R0

a bcin

scout C0

a bcin

scout

a bcin

scout

a bcin

scout

a bcin

scout

R4 R3 R2 R1 R0

13 pts

blocking factor = 4

blocking factor = 3


5 ( 35 points) 25 min. Virtual memory

5.1 Given ________ (VPN / VA), we get ________ (PPFN / PA) from _____________________ (the TLB / the PT / either the TLB or the PT if TLB does not have) if the page is ___________ (present / absent) in the MM.

5.2 A TLB of 17 entries (17 is a prime number and does not have any factors and is certainly not a power of 2) is possible if the TLB uses a _______________ (fully-associative / set-associative / direct) mapping. Usually either a fully-associative mapping or a set-associative mapping with liberal associativity is used for the TLB because (talk about both cost and performance) ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________The same thing ______ (can / can not) be said about cache because cache is much ___________ (bigger/smaller) and cost will be much __________ (higher / smaller) and the penalty of a miss is relatively _____ ________ (smaller / larger).

5.3 In a way both TLB and page Table are look-up table but so far as indexing or searching is concerned, ________________________________________________________________________________________________________________________________________________________________________________________________________________________________Direct mapped cache is ____________________________ (indexed / searched in parallel), a fully associative cache is ____________________________ (indexed / searched in parallel).

5.4 Two processes, one with process id 2 and another with process id 3 can both use location with 32-bit address 00002000 as in lw $2, 2000($0). Is this 32-bit address 00002000 a virtual address (VA) or a physical address (PA)? ______ (VA / PA). Assuming a fully associative TLB, the page number 00002 (upper 20 bits of the 32-bit address 00002000) is in the VPN field or PPN field of an entry in the TLB? __________ (VPN / PPFN) field.

5.5 In a multi-level page table, the _______ (VPN/PPFN) is present in the entries of ____________ (every level / the first A level / the last D level / other ..). In the example on the side the ______ (VA/VPN/PA/PPFN) is shown divided into 4 fields: A, B, C, D. It takes ___________(minimum/maximum/always) ____ (state a

number like 2) accesses to the example table on the side before you declare page table hit. You may be able to declare page fault after 1 or 2 or 3 or 4 accesses to the PT. ______ (T / F). "A" table is only one. ______ (T / F). Number of "B" tables is less than or equal to number of "C" tables. ______ (T / F). _________ (Page Table / TLB / Both / Neither) is flushed on context switch. Explain: ______________________________________________________________________________________________________________________________________________________

4 pts

9 pts

6 pts

3 pts

1010111100011

PTBR

A B C D

A table16 entries

B tables16 entries

C tables8 entries

D tables4 entries

PPFN

13 pts

We enjoyed teaching this course! Hope you liked it! Hope to see some of you in EE560. Grades will be out in a week. Enjoy your semester break! Happy Summer Holidays!!! - Gandhi and TA Yue, Mentors: Manasa, Vibha, Lai, Binal, Graders: Atit, Guan (Crystal), Guan (Wade), Mukhdeep, Ruozhi, and Zhe Happy Semester Break!

1 ( 90 points) 65 min. - University of Southern California · Spring 2014 EE457 Instructor: Gandhi...

Documents

Transcript of 1 ( 90 points) 65 min. - University of Southern California · Spring 2014 EE457 Instructor: Gandhi...