BOLT, FDO, PGO, Propeller, tell you specifically it’s frontend stalls that matter
We really want to be careful of frontend stalls, since i-cache stalls prevent the CPU from doing useful information
Frontend is responible for fetching and decoding instructions
Responsible for executing instructions
Code layout matters for the icache
Two types of optimizations really matter
Inlining (frees up the icache)
Indirect call promotion
CPU needs to fetch instructions from memory, and if the required instruction is not in the i-cache, the frontend must wait for it to be retrieved (hundreds of cycles)
Instruction TLB
I-tlb miss occurs when you need to force a page walk? But how does this work
frontend work can’t be parallelized, if you’re waiting on d-cache info, you can do something else, but the frontend stalls means icache is stuck and can’t feed uOps
Backend can do out of order execution, by looking ahead and finding stuff that doesn’t require the missing data
How?
But how does L1 d-cache vs L1 i-cache work?
Switft is icache heavy because of indrection?
Harvard architecture
I-tlb vs D-tlb
How would you tell this in perf?
L2 TLB is unified
BOLT’ing and propeller and lightning bolt
Zero cost inling is not actually zero cost (see memcmp vs memcpy)
Outline
i-cache as bottleneck
why frontend stalls are uniquebad
FDO was great but annoying workflow
LBR with perf
You can slice every part of the LBR buffalo, use the in and out BB for CFR, cycles for how long you’re spending there