Two Cents on Computer Architecture Research -102
Preamble: This blog is a continuation of my previous blog: Two cents on Computer Architecture Research [1]. Based on the comments and suggestions that I received, students are more interested to know more about the research problems that they can delve into at an early stage of their UGs. This blog is an attempt to understand the research ideas that try to solve a fundamental problem called “The memory wall: the grandmother of all the walls.” I have provided links to tools and some papers of interest. The goal of this blog is not to explain all the details rather to provide a pointer to the curious minds.
Let’s begin: What is the main reason behind the memory wall problem? Well, it is mostly because the processor waits for the data from memory (say memory hierarchy to make it generic). Processor wants to READ/WRITE (LOAD/STORE) from/into the memory, and there is no free lunch associated with it. The penalty is the memory access time, ranging from a few cycles to hundreds of processor cycles.
Research Ideas that help to break the memory wall [list is not exhaustive]:
(i) Out-of-order (O3) processor with a large instruction window: Remember the 5-stage processor pipeline from your famous textbook with five boxes (saying each stage takes one cycle). In reality, the memory stage can take 100s to 1000s of cycles. So a processor can not move forward in terms of instruction execution unless one of the LOADs (memory READs) is done with its memory stage. O3 processor breaks this waiting time by allowing multiple concurrent but independent instructions to execute at a given time. For that, the processor should have a large instruction window from which it can fetch and execute the independent ones. “A simple example of an O3 processor is Indian road traffic :) ”
[image source: scroll.in]
(ii) LOAD Value Prediction: Value Prediction solves the memory wall problem by predicting the values (data) that helps in breaking the data dependencies, and improve instruction-level parallelism (ILP). More precisely and more technically, given a LOAD/STORE to a memory address, what will be the value associated with that address? If your predictor is highly accurate, then the processor won’t even go and probe caches (forget about the memory accesses). Ideally, all memory accesses will be handled at the processor itself, which means an almost zero cycle latency. Are you kidding me? Of course not. To know more about the subtle issues related to value prediction techniques, please refer: https://www.microarch.org/cvp1/index.html
[Image credit: CVP1]
(iii) Efficient Cache Hierarchies and cache replacement policies: As we do not have a cent percent value predictor, memory accesses will go to caches for sure. Now, the question is how to make sure these accesses do not go to the DRAM (a.k.a. Main memory). Well, one of the solutions is to design cache hierarchies in a way that can reduce the conflict and capacity misses (assuming we can’t hide the compulsory misses, maybe we can :)). Effectively, the idea is to make sure that a few accesses go to the DRAM and rest hits at the different levels of caches thanks to the spatial and temporal locality. Another way to achieve the same is to propose intelligent cache replacement policies (heard about the LRU replacement policy, it is no longer used though :)). The goal of these policies is to keep the data of interest (cache blocks of interest) and kick out the data that will not be reused (processor will not demand in the future).
image credit [RRIP, ISCA ‘10]
Modern replacement policies use a variant of re-reference-interval-prediction based policies [RRIP, ISCA ’10] where cache blocks are zoned into four zones based on their re-reference. More details: https://crc2.ece.tamu.edu/
(iv) Data and Instruction Prefetching: Prefetching techniques pre-fetch data and instructions into the caches before the processor demands. It is a speculation technique that speculates future accesses based on past accesses. From the memory wall point of view, It is a latency hiding technique that hides the off-chip DRAM access latency. More details: https://dpc3.compas.cs.stonybrook.edu/
(v) Cache and Bandwidth Compression: Zipping a file comes to mind whenever we hear the word “compress.” How can we exploit the same in Computer Architecture? What will happen if we can exploit? Well, imagine a 4MB cache storing 16MB of data (solving the memory wall !!). Caches store data, but most of the time, the data that we use are redundant within a cache block and across cache blocks, because of value locality.
Similar ideas can be applied to the communication between last-level cache (LLC) and the DRAM, exploiting the data redundancy. So, in one go, more amount of data can be pumped from DRAM into the LLC and future LLC misses may become hits.
(vi) DRAM scheduling: A DRAM controller is almost the final frontier to solve the memory wall problem. A DRAM controller converts LOADs/STOREs into a sequence of DRAM commands and schedules them to improve performance. DRAM address mapping and OPEN/CLOSE row-buffer policies play an important role too.
For a quick overview: http://www.cs.utah.edu/~rajeev/jwac12/
(vii) Processing in Memory (PIM): A new way of looking at the memory systems where the processor stays closer to the DRAM and do the operations (at a10K feet view) in memory or near memory, bridging the memory wall gap.
[1] https://medium.com/@__biswa/two-cents-on-computer-architecture-research-101-4f00957c312a