target IES: Ex 5.23 & 5.31 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual

Q.5.23: As presented in this chapter, load bypassing is a technique for enhancing memory data flow. With load bypassing, load instructions are allowed to jump ahead of earlier store instructions. Once address generation is done, a store instruction can be completed architecturally and can then enter the store buffer to await available bus cycle for writing to memory. Trailing loads are allowed to bypass these stores in the store buffer if there is no address aliasing.

In this problem you are to simulate such load bypassing (there is no load forwarding). You are given a sequence of load/store instructions and their addresses (symbolic). The number to the left of each instruction indicates the cycle in which that instruction is dispatched to the reservation station; it can begin execution in that same cycle. Each store instruction will have an additional number to its right, indicating the cycle in which it is ready to retire, i.e., exit the store buffer and write to the memory.

Assumptions:

•All operands needed for address calculation are available at dispatch.

•One load and one store can have their addresses calculated per cycle.

•One load OR store can be executed, i.e., allowed to access the cache, per cycle.

•The reservation station entry is deallocated the cycle after address calculation and issue.

•The store buffer entry is deallocated when the cache is accessed.

•A store instruction can access the cache the cycle after it is ready to retire.

•Instructions are issued in order from the reservation stations.

•Assume 100% cache hits.

Sol:

Q.5.31: A victim cache is used to augment a direct-mapped cache to reduce conflict misses. For additional background on this problem, read Jouppi’s paper on victim caches [Jouppi, 1990]. Please fill in the following table to reflect the state of each cache line in a 4-entry direct-mapped cache and a 2-entry fully associative victim cache following each memory reference shown. Also, record whether the reference was a cache hit or a cache miss. The reference addresses are shown in hexadecimal format. Assume the direct mapped cache is indexed with the low-order bits above the 16-byte line offset (e.g. address 40 maps to set 0, address 50 maps to set 1, etc.). Use ‘-’ to indicate an invalid line and the address of the line to indicate a valid line. Assume LRU policy for the victim cache and mark the LRU line as such in the table.

Sol:

Next Topic:

Q.6.3: Given the dispatch and retirement bandwidth specified, how many integer ARF (architected register file) read and write ports are needed to sustain peak throughput? Given instruction mixes in Table 5-2, also compute average ports needed for each benchmark. Explain why you would not just build for the average case. Given the actual number of read and write ports specified, how likely is it that dispatch will be port-limited? How likely is it that retirement will be port-limited?

Q.6.11: The IBM POWER3 can detect up to four regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.

Q.6.12: The IBM POWER4 can detect up to eight regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.

SOLUTION

Previous Topic:

Q.5.21: Simulate the execution of the following code snippet using Tomasulo’s algorithm. Show the contents of the reservation station entries, register file busy, tag (the tag is the RS ID number), and data fields for each cycle (make a copy of the table below for each cycle that you simulate). Indicate which instruction is executing in each functional unit in each cycle. Also indicate any result forwarding across a common data bus by circling the producer and consumer and connecting them with an arrow.

i: R4 <- R0 + R8

j: R2 <- R0 * R4

k: R4 <- R4 + R8

l: R8 <- R4 * R2

Assume dual dispatch and dual CDB (common data bus). Add latency is two cycles, and multiply latency is 3 cycles. An instruction can begin execution in the same cycle that it is dispatched, assuming all dependencies are satisfied.

Q.5.22: Determine whether or not the code executes at the dataflow limit for Problem 1. Explain why or why not. Show your work.
SOLUTION

target IES

Wednesday, December 4, 2013

Ex 5.23 & 5.31 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual

No comments:

Post a Comment