Monday, November 18, 2013

Ex. 3.16, 3.17, 3.18 and 3.19 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual

Q.3.16: Assume a two-level cache hierarchy with a private level one instruction cache (L1I), a private level one data cache (L1D), and a shared level two data cache (L2). Given local miss rates for the 4% for L1I, 7.5% for L1D, and 35% for L2, compute the global miss rate for the L2 cache.

Sol:
1) Assume each instruction access data L2 global miss rate 
       = (.35 L2 misses)/(L2 ref) x (((.04 L2 ref)/Ifetch + (.075 L2 ref) / Dref)) /2
       = (.35x.04 + .35x.075 = 0.04025)/2 = 2.0125%
L2 missesper global reference

2) Assume 0.4 data accesses per instruction L2 global miss rate 
        = L2 local miss rate * (L1D local miss rate *  0.4 + L2I local miss rate *1)/ (1+0.4) 
        = 0.35 * (0.075*0.4+0.04) / 1.4 = 0.0175

L2 global miss rate = L2 local miss rate * L1 miss rate



Q.3.17: Assuming 1 L1I access per instruction and 0.4 data accesses per instruction, compute the misses per instruction for the L1I, L1D, and L2 caches of Problem 16.

Sol: L1 I misses per instruction = .04 miss/ref x 1 ref/instr = .04 miss/inst

     L1 D misses per instruction = .075 miss/ref x 0.4 ref/instr = .03 miss/instr

        L2 misses per instruction = .35 miss/ref x (.04 ref/inst + .03 ref/inst) 
                                               = .35x .07 = .0245 miss/inst



Q.3.18: Given the miss rates of Problem 16, and assuming that accesses to the L1I and L1 D caches take one cycle, accesses to the L2 take 12 cycles, accesses to main memory take 75 cycles, and a clock rate of 1GHz, compute the average memory reference latency for this cache hierarchy.

Sol: Avg Ifetch mem lat = (1-.04) x 1 + .04 x (1-.35) x 12 + .04 x.35 x 75 
                                    = .96 + .312 +1.05 = 2.322 cycles
    Avg data ref mem lat = (1-.075) x 1 + .075 x (1-.35) x 12 + .075 x.35 x 75  
                                     = .925 + .585 + 1.96875 = 3.47875 cycles

NOTE: to determine overall average memory latency, you must know the ratio of data
and instruction references and take the weighted average of the two types of references.
As in Problem 17, assume 0.4 data references per instruction reference: 
Avg mem lat = (2.322 cycles / ifetch x (1 ifetch) / ref + 3.47875 cycles/dref x (0.4 dref/ref)) / 1.4= 2.6525 cycles 




Q.3.19: Assuming a perfect cache CPI (cycles per instruction) for a pipelined processor equal to 1.15 CPI, compute the MCPI and overall CPI for a pipelined processor with the memory hierarchy described in Problem 18 and the miss rates and access rates specified in Problem 16 and Problem 17.


Sol: From solution to 18, miss rates per instruction are .04, .03, and .0245, hence: 
        MCPI = .04 x (12-1) + .03 x (12-1) + .0245 x (75 - 12) = 2.3135
          CPI = 1.15 + MCPI = 1.15 + 2.3135 = 3.4635






 Two level Cache Memory System:

While discussing on performance issues, lower level programming and algorithm predictions in computer architectural design, the term "memory hierarchy in computer architecture" is used very frequently.  In computer system, memory hierarchy differentiate each level in the hierarchy by response time. The structure of the memory hierarchy will includes the many trade off in designing in high performance. So the various components can be viewed as forming a hierarchy of memories (m1,m2,...,mn) in which each member mis in a sense subordinate to the next highest member mi-1 of the hierarchy. To limit waiting by higher levels, a lower level will respond by filling a buffer and then signaling to activate the transfer.

In a two-level hierarchy, the caches can hold the number of unique blocks that fit into both L1and L2. This approach obviously makes the best use of on chip cache real estate. In an exclusive hierarchy, the L2 actsas a victim cache for the L1. When both miss, the new block comes into L1, and when evicted, it moves to L2. A block is promoted to L1 when an access hits in L2.

In two level cache memory system, it include two levels i.e. hierarchy with an inclusive cache hierarchy utilizing similar L1 and L2 parameters. Because of these hierarchies, the significant performance advantages can be gained for some benchmarks through the use of an exclusive organization. The performance differences are illustrated using the L2 cache misses and execution time metrics. The most significant improvement shown is a 16% reduction in execution time, with an average reduction of 8% for the smallest cache configuration tested. With equal size victim buffer and victim cache for exclusive and inclusive cache hierarchies respectively, some benchmarks show increased execution time for exclusive caches because a victim cache can reduce conflict misses significantly while a victim buffer can introduce worst-case penalties.




Next Topic
Q.3.28: Assume a synchronous front-side processor-memory bus that operates at 100 Hz and has an 8-byte data bus. Arbitration for the bus takes one bus cycle (10 ns), issuing a cache line read command for 64 bytes of data takes one cycle, memory controller latency (including DRAM access) is 60 ns, after which data double words are returned in back-to back cycles. Further assume the bus is blocking or circuit-switched. Compute the latency to fill a single 64-byte cache line. Then compute the peak read bandwidth for this processor-memory bus, assuming the processor arbitrates for the bus for a new read in the bus cycle following completion of the last read.

Q.3.31: Consider finite DRAM bandwidth at a memory controller, as follows. Assume double-data-rate DRAM operating at 100 MHz in a parallel non-interleaved organization, with an 8 byte interface to the DRAM chips. Further assume that each cache line read results in a DRAM row miss, requiring a precharge and RAS cycle, followed by row-hit CAS cycles for each of the double words in the cache line. Assuming memory controller overhead of one cycle (10 ns) to initiate a read operation, and one cycle latency to transfer data from the DRAM data bus to the processor-memory bus, compute the latency for reading one 64 byte cache block. Now compute the peak data bandwidth for the memory interface, ignoring DRAM refresh cycles.

Previous Topic:
Q.3.13: Consider a processor with 32-bit virtual addresses, 4KB pages and 36-bit physical addresses. Assume memory is byte-addressable (i.e. the 32-bit VA specifies a byte in memory).
L1 instruction cache: 64 Kbytes, 128 byte blocks, 4-way set associative, indexed and tagged with virtual address.
L1 data cache: 32 Kbytes, 64 byte blocks, 2-way set associative, indexed and tagged with physical address, write-back.
4-way set associative TLB with 128 entries in all. Assume the TLB keeps a dirty bit, a reference bit, and 3 permission bits (read, write, execute) for each entry.
Specify the number of offset, index, and tag bits for each of these structures in the table below. Also, compute the total size in number of bit cells for each of the tag and data arrays.

No comments:

Post a Comment