Recently I learned that the GCC compiler intrinsic to prefetch data into CPU cache, __builtin_prefetch, can actually take two optional extra arguments: rw and locality. This is slightly embarrassing for me, because as far as I can tell, the intrinsic had three-arg form from day one, starting with GCC 3.1, released in 2002.
The rw argument can take 0 or 1 value, hinting whether the anticipated memory access is going to be read (0, the default) or write (1). locality can be 0, 1, 2 or 3, with 0 hinting that the memory will be accessed just once, 3 (the default) asking to keep it cached as much as possible, 1 and 2 being in the middle.
The next thing I didn't know was what the different argument values compile to on x86_64. With GCC 10 on Godbolt I found:
__builtin_prefetch(ptr); // prefetcht0
__builtin_prefetch(ptr, 1); // prefetchw
__builtin_prefetch(ptr, 0); // prefetcht0
__builtin_prefetch(ptr, 1, 0); // prefetchw
__builtin_prefetch(ptr, 1, 1); // prefetchw
__builtin_prefetch(ptr, 1, 2); // prefetchw
__builtin_prefetch(ptr, 1, 3); // prefetchw
__builtin_prefetch(ptr, 0, 0); // prefetchnta
__builtin_prefetch(ptr, 0, 1); // prefetcht2
__builtin_prefetch(ptr, 0, 2); // prefetcht1
__builtin_prefetch(ptr, 0, 3); // prefetcht0
Next let's check Intel docs on these instructions:
PREFETCHT0: prefetch to all cache levels.
PREFETCHT1: prefetch to L1 and L2. Of course if a particular implementation has an inclusive L3 cache, it will end up there too.
PREFETCHT2: Intel's Software Developer's Manual states: prefetch to L1/L2/L3, "or an implementation-specific choice." Intel's Optimization Reference Manual states that this "identical to PREFETCHT1," that is, prefetch into L1, L2, but not necessarily L3, depending on its inclusiveness.
PREFETCHNTA: the SDM tries to abstract from actual hardware a bit by describing this as "prefetch into non-temporal cache close to the CPU, try not to pollute the regular caches." The ORM explains what it means in practice: on non-Xeon, prefetch to L1, bypassing L2, and on Xeon, prefetch to "L3 with fast replacement" - I have no idea what "fast replacement" means here.
PREFETCHW: according to the SDM, prefetch into L1 or L2 and invalidate other instances of this cache line. And according to the ORM, prefetch into all levels and invalidate the same. The invalidation is of course why this is a write-specific prefetch instruction, saving an invalidation later at the actual write time.
Some common properties of all these instructions are as follows. First, prefetch stride. I always assumed it was 64, that is, L1 cache line size. Docs added a few wrinkles to that, hopefully all historical: 1) the minimum is 32 bytes; 2) NetBurst Pentium 4 used to prefetch two cache lines; 3) It should be interrogated by CPUID. In practice everyone seem to agree it's 64 nowadays.
Instruction scheduling. The Intel docs seem to be last updated for Pentium 4 (twenty years ago). For that, they state "insert a PREFETCH instruction every 20 to 25 cycles." Finding no info at the official sources, I consulted Ander Fog's instruction tables, where they are listed with 0.5-1 clock cycle per instruction per thread reciprocal throughput. That is, except for Ivy Bridge, where it's 43 for some reason. Anyway, it does not seem to be particularly harmful to issue several prefetches without interleaving with computational instructions these days.