From Cricket to Winning the Data Prefetching Championship at ISCA 2019
If Sachin Tendulkar alone could win cricket matches for India, why have Rahul Dravid, Virender Sehwag, and V.V.S. Laxman in the team*? Simple answer: “No Single player plays well in all conditions and against all opposition.” Can we apply the same rationale in designing hardware prefetchers? Why not because there is “no Prefetcher that performs well across all kinds of applications’’. Well, we use the same analogy to design hardware prefetchers in the recently concluded Data Prefetching Championship (DPC), co-located with the prestigious International Symposium on Computer Architecture (ISCA) 2019.
At 10K feet view, a hardware prefetcher sits beside a cache and observes the requests (a.k.a demand accesses) that are coming from the processor. The Figure below shows different levels of caches and DRAM with their access latencies. Prefetchers (PFs in the Figure) speculate about the future accesses and bring the data into caches before the processor demands so that in future, the processor will get cache hits instead of misses.
For DPC, we started with an idealistic dream of making the L1 hit rate 100% :) as this will solve one of the fundamental problems in the field of Computer Architecture named “memory wall problem.” Just imagine a modern system in which no demand requests will go to the physical memory (a.k.a. DRAM). Now, all the memory operations can be done in one cycle instead of 100s or 1000s of cycles. Wow!! On a slightly lighter note, it means the emerging applications will run 100 to 1000 times faster. Big deal. The dream was shattered when we found the applications of interest show an L1 hit rate of 88% and L2 hit rate of 23%. Intending to achieve a 100% hit rate, we propose “a bouquet of prefetchers.” Lightly speaking, each prefetcher is like a cricket player who performs well against certain opponents or in certain conditions. We proposed a bouquet of four classes of prefetchers that can be explained with the following examples:
(i) Constant Stride Class (say Sachin Tendulkar): Tuned for accesses that have some form of strides in terms of demand accesses: For example, X, X+2, X+4, … , here the stride is two.
(ii) Complex Stride Class (say Rahul Dravid): Tuned for irregular accesses like accesses from nested loops that follow some pattern. For example, X, X+3,X+5,X+1,X+4,X+6,X+2…
Here the strides are 3, 2, -4, 3, 2, -4 and a prefetcher should prefetch with stride -4 whenever it sees strides of 3 and 2.
(iii) Global Stream Class (Say Virender Sehwag): Tuned for streaming applications where the accesses are mostly in one direction from an address/access that triggers the stream. For example, X, X+5, X+6, X+8, X+9, X+17,…. One of the best ways to prefetch is to prefetch all the addresses in one direction, and aggressively.
(iv) Next-line Class (V.V.S. Laxman): This is the simplest prefetcher that we make it an adaptive one. It prefetches the next cache line address only (X+1 whenever it sees X) when the miss rate of an application is not that high.
In case an access pattern satisfies multiple classes of prefetchers, we use tiebreaker with the following order of priority Global Stream, Constant Stride, Complex Stride, and then Next-line.
We implement and classify accesses into these four classes and create a bouquet at the L1 prefetcher. Next, we communicate this classification to the L2 prefetcher and the L2 prefetcher prefetches based on the information communicated by the L1 prefetcher. “This communication is like the front order batsmen giving tips to the tailenders.” We were able to achieve this performance improvement with a hardware overhead of ~16KB.
What happened to the dream? Well, we improved the L1 hit rate from 88% to 92% and L2 hit rate from 23% to 55%, and this results in a performance improvement of ~44%. Still, miles to go!! Oh, and yes, we won the championship. These championships are happening from the last three decades, and this is the first time an Indian submission won it.
Many thanks to my remote mentee (Samuel Pakalapati from BITS Pilani). The journey “from cricket to winning the data prefetching championship” would have been impossible without him.
* Keeping in mind the fresh memories of Cricket World Cup 2019, we have not used the names of current players that are part of the team.