Analysis of the impact of cache size on CPU performance

Infineon / Mitsubishi / Fuji / Semikron / Eupec / IXYS

Analysis of the impact of cache size on CPU performance

Posted Date: 2024-01-24

CPU Cache is the abbreviation of CPU cache memory, which is referred to as "cache" or "Cache" in this article.

This article first introduces the computer's performance bottlenecks from the computer's storage level, then uses Intel processors as an example to introduce the development of cache, then introduces the local principle of cache improving CPU performance, and finally analyzes the impact of cache size on CPU performance.

1. Computer performance bottlenecks

Under the von Neumann architecture, computer memory is hierarchical. The hierarchical structure of the memory is as shown in the figure below, which is in the shape of a pyramid. From top to bottom are registers, L1 cache, L2 cache, L3 cache main memory (memory), hard disk, etc.

The closer the memory is to the CPU, the faster the access speed, the smaller the capacity, and the more expensive the cost per byte.

For example, a CPU with a main frequency of 3.0GHZ has the fastest register and can be accessed in one clock cycle. One clock cycle (the basic time unit in the CPU) is about 0.3 nanoseconds. Memory access takes about 120 nanoseconds. Solid-state Hard drive access takes about 50-150 microseconds, and mechanical hard drive access takes about 1-10 milliseconds.

When electronic computers first came out, there was actually no cache in the CPU. At that time, the CPU frequency was very low, not even as high as the memory. The CPU read and wrote the memory directly. With the development of the times and technological innovation, starting from the 1980s, the gap began to expand rapidly.CPU speed far exceeds memory speedunder the von Neumann architecture,The speed at which the CPU accesses memory has become the bottleneck of computer performance!

Image source: How L1 and L2 CPU Caches Work, and Why They're an Essential Part of Modern Chips

In order to make up for the performance difference between the CPU and the memory, that is, to speed up the speed of the CPU accessing the memory, the cache CPU Cache was introduced. The cache speed is second only to the register and acts as an intermediate role between the CPU and the memory.

2. Caching and its development history

Cache CPU Cache uses SRAM (Static Random-Access Memory) storage, also called static random access memory. As long as there is power, the data can remain there, but once the power is cut off, the data will be lost.

CPU Cache is usually divided into three levels of cache of different sizes, namely L1 Cache, L2 Cache and L3 Cache. The common Cache typical distribution diagram is as follows:

Here we take the Intel series as an example to review the history of Cache development.

Before 80286, there was no cache at that time. The CPU frequency at that time was very low, not even as high as the memory. The CPU read and wrote the memory directly.

Starting from 80386, the problem of mismatch between CPU speed and memory speed has begun to emerge, and the gap has begun to expand rapidly. Slow-speed memory has become the bottleneck of the computer and cannot fully utilize the performance of the CPU. To solve this problem, Intel motherboards support external Cache. , to operate with 80386.

80486 places the L1 Cache (size 8KB) inside the CPU and supports an external Cache, the L2 Cache (size from 128KB to 256KB), but does not distinguish between instructions and data. Although the L1 Cache size is only 8KB, it was actually enough for the CPU at that time. Let’s look at a graph showing the relationship between cache hit rate and L1 and L2 sizes:

Image source: How L1 and L2 CPU Caches Work, and Why They're an Essential Part of Modern Chips.

From the above figure, we can find that the benefits of increasing the L1 cache are not obvious for the CPU. The cache hit rate is not significantly improved, and the cost will be higher, so the price/performance ratio is not high. As the size of the L2 cache increases, the total cache hit rate will rise sharply, so the larger capacity, slower, and cheaper L2 becomes a better choice.

Wait until Pentium-1/80586, which is the Pentium series we are familiar with. Since Pentium adopts a superscalar structure with dual execution, there are two parallel integer pipelines, which require dual access to data and instructions. In order to make these accesses mutually exclusive, Interference, so the L1 Cache was divided into two, into the instruction Cache and the data Cache (both 8K in size)[The access structure that accesses data and instructions separately is called the Harvard structure, which is different from the von Neumann structure that is mixed together. Structure]at this time, the L2 Cache was still on the motherboard. Later, Intel launched (Pentium Pro)/80686. In order to further improve performance, the L2 Cache was officially placed inside the CPU.

Outside the CPU, DRAM memory is still the same set of memory, and the hard drive is also the same set of hard drives that do not distinguish between instructions and data. Therefore, it can be said that the x86 CPU adopts the Harvard structure internally and the von Neumann structure externally. In fact, except for a few microcontrollers, DSPs and other devices, no one distinguishes between data and instructions in the outermost storage device. Therefore, this approach of internal Harvard and external von Neumann structure seems to have become an industry consensus.

Later, the era of multi-core CPUs came. In Intel's Pentium D and Pentium E series, each core in the CPU had its own L1 and L2 Cache, but they did not share them and could only rely on the bus to transfer synchronized cache data. Finally, with the emergence of the Core Duo series, L2 Cache became a multi-core shared mode, using Intel's "Smart cache" shared cache technology. So far, the basic mode of modern cache has been determined.

Nowadays, CPU Cache is usually divided into three levels of cache of different sizes, namely L1 Cache, L2 Cache and L3 Cache. The L3 cache is shared by multiple CPU cores, while L2 is occupied by each core separately. In addition, some The CPU already has L4 Cache, and there may be more in the future.

3. How does cache make up for the performance difference between CPU and memory?

Caching is mainly usedlocality principleto improve the overall performance of your computer. Because the performance of the cache is second only to the register, and the difference between the CPU and the memory is mainly the order of magnitude difference in the access speed between the two, let the CPU access the cache as much as possible, while reducing the CPU's direct access to the main memory. The number of saves, so that the performance of the computer will naturally be greatly improved.

The so-called locality principle is mainly divided into spatial locality and temporal locality:

Temporal locality:A memory location that is referenced once will be referenced many times in the future (usually in a loop).

Spatial locality:If one memory location is referenced, then locations nearby it will also be referenced in the future.

The cache temporarily stores the instructions and data recently accessed by the CPU in the main memory (memory), because according to the principle of locality, these instructions and data are likely to be used many times in the future within a short time interval. Secondly, when the slave When retrieving these data from the main memory, the data stored in the main memory unit adjacent to its location will be retrieved and temporarily stored in the cache, because the memory area near the instruction and data may also be retrieved within a short time interval. will be visited multiple times.

When the CPU accesses instructions and data, it first accesses the L1 Cache. If there is a hit, it will fetch the data directly from the corresponding cache without having to access the main memory every time. If there is a miss, it will go to the L2 Cache again. By analogy, if it does not exist in the L3 Cache, it will be found in the memory.

4. Is the bigger the L1 cache, the better?

Increasing L1 will increase the hit rate of L1, but is the larger the L1 cache better?

4.1 The impact of increasing L1 on access latency

A practical example: Starting from Intel Sunny Cove (Core 10th generation), the L1 cache has changed from a combination of 32K (instructions) + 32K (data) to a combination of 32K (instructions) + 48K (data). The consequence of this is that the access performance of the L1 cache decreases from 4 cycles to 5 cycles. Increasing L1 will increase the hit rate, but also increase the delay. So what impact does this increase and decrease have on the average access time (AMAT)?

A simple example is used below to illustrate the impact of L1 access time on AMAT. Suppose we have a three-layer memory hierarchy (L1, L2, offchip RAM), where the L2 access time is 10 cycles and the off-chip RAM access time is 200 cycles.Assume that the hit rate of L1 and L2 in the case of 32KB L1-D is roughly 90% and 9%, and the hit rate of Sunny Cove L1 and L2 in 48KB L1-D increases to 95% and 4%, then the corresponding

The average visit time to Sunny Cove is approximately

0.95*5+0.04*10+0.01*200=7.15 cycle.

The average visit time to Sunny Cove’s previous microarchitecture is approximately

0.90*4+0.09*10+0.01*200=6.5 cycle.

A simple model estimate shows that even if increasing L1 size can improve L1 hit rate, the increase in L1 access delay will still increase the average memory access delay by ~10%. This is also the reason why the L1 cache size has not changed much for a long time.

In summary, the L1 size will directly affect the access time, and the access time to L1 will directly affect the average memory access time (AMAT), which will have a huge impact on the performance of the entire CPU.

4.2 What is the reason for limiting L1 access delay?

During the system startup process, the CPU is in real mode only in the very early startup phase. After paging is turned on, the addresses corresponding to the load/store instructions received by the CPU are all virtual addresses. When accessing the L1 cache, you need to go through the virtual address VA to the physical address. Conversion of address PA. Modern CPUs usually use the L1 structure of virtual index physical tag. One of the great benefits of this L1 structure is that the TLB can be accessed simultaneously during index cache set (equivalent to hiding part of the L1 access delay). A schematic diagram of L1 access under x64 is shown below.

TLB: Translation lookaside buffer, that is, bypass conversion buffer, or page table buffer

For a 4KB page, there is a total of 12 bit page offset, of which the lower 6 bits are the cacheline offset (64B cachelinesize), and the remaining 6 bits are used as L1 index bits, which means that L1 can only be limited to 64 cache sets. For 32KB L1-D, it means that each set corresponds to 8 ways (32KB/64/64 = 8). Sunny Cove's 48KB L1-D corresponds to each set12 way. As can be seen from the above figure, the key path of L1 access delay is TLB access and TAG matching of the corresponding L1 cache set after TLB hit. Since the cache set is a simple lookup, we can think that the PPN is obtained immediately after the TLB query is completed. TAG matching is possible. (TLB is also a small set-associatative cache and also requires TAG matching. Therefore, the access time is longer than L1 set lookup). Therefore, the query time of TLB and the associativity of L1 (that is, TAG matching in the figure) determine the access delay of L1.

In order to ensure that the L1 access delay can be made low enough, it is usually necessary to design the associativity of L1 to be as small as possible. So it's usually 8 way before Sunny Cove. Here Sunny Cove has increased the L1 access delay by one cycle. It is not sure whether it is caused by the change of TLB or the increase in L1 associativity from 8 way to 12 way. But in general, we can see why L1 cannot be designed to be large, mainly due to the structure of the virtual index physical tag.

On the other hand, L1 cache 32kb is calculated to cover the entire 4KB page. This is the design from the first generation core to the current tenth generation. Each set has 8 ways, which means that 8 pages can be cached at the same time. Increasing the capacity will not only increase latency, but may also cause a large part of the ways to be idle most of the time, wasting performance.

In general, increasing the size of the cache line will greatly increase the latency of the corresponding cache miss, especially considering that L1 caches virtual addresses, and a miss means that the penalty will increase even more. In addition, if wayness or sets are increased, the addressing delay will increase, which is not worth the gain.

In addition, L1 cache replacement and prefetch strategies are very complex, which will also cause the increase in latency to have a greater impact on performance than the increase in capacity.

4.3 Why is the L1 of Apple M1 larger than the L1 of X-86?

Why can Apple's M1 achieve 192KB L1-I (128KB L1-D) while ensuring an access delay of 3 cycles?

First of all, Apple's maximum clock speed is lower than that of Intel/AMD's desktop/server line CPU, so the timing constraints are relatively relaxed (mainly the critical path of the tag matching comparator path). The second point, and the most critical thing, is that the MacOS corresponding to M1 uses a 16KB page instead of the 4KB page of x86.

From the above figure, we can see that if the page size is expanded (16KB corresponds to 14bit), it is equivalent to adding 2 bits of index bits. In this case, the number of L1 cache sets can be increased by 4 times (under the premise of maintaining associativity), so the L1-I of M1 is exactly 4 times that of Sunny Cove (192KB/48KB = 4). M1's L1-D size is 128KB (4 times 32KB).

For the L1 cache in question, it is actually more of a system-level trade-off (increasing page size will also lead to memory waste, memory fragmentation and other problems, but it will also reduce the pressure on the page table and increase TLB coverage). However, it is limited to Apple. Its M1 product is just a MacBook device and can be customized more. For x86, historical baggage and compatibility issues with many devices make it difficult to achieve a more flexible product like Apple. Architectural design.

Review Editor: Huang Fei

#Analysis #impact #cache #size #CPU #performance