swe tea
- Weekly paper club/book club/video club
- Need to figure out timings (tuesday nights?)
- model off of ebpf reading groups and the distirbuted systems reading groups
Profiling a Warehouse Scale Computer
-
Kanev, Svilen, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. “Profiling a Warehouse-Scale Computer.” In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 158–69. Portland Oregon: ACM, 2015. https://doi.org/10.1145/2749469.2750392.
-
They only used C++, since it made it simpler
-
only on Ivy Bridge machines
-
No “killer application to optimize for, large chunks of compute are data locality bound and CPU stall bound, suggests that 2 wide SMT is not sufficient to eliminate the bulk of the overheads
- What is a 2 wide SMT anyways?
- I’m assuming it means 2 instructions at once, but not all instructions are parallelizable
- workload diversity is very real, we’ve gotten a range of compute that’s wide enough for this not to matter
- At the start, 50 hottest binaries account for 80% of execution
- Three years later, top 50 are only 60%
- Coverage decreases more than 5% per over the course of 3 years
- Also does not include public clouds
- Applications, as they grow more diverse and fatter, have gotten more flat profiles themselves
- What would this look like for chatd?
-
“Data center tax” is very real, large chunks of your machine are going to be devoted to doing logging, rpc, ser/des
-
Yacine: top down measurement? never heard of this before
- Yasin, Ahmad. “A Top-Down Method for Performance Analysis and Counters Architecture.” In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 35–44, 2014. https://doi.org/10.1109/ISPASS.2014.6844459.
- Talks about core front-end and core back-end, what is that?
- Front end:
- instruction fetch
- decode unit
- branch prediction
- uop cache
- loop stream detector (?) - optimizes tight loops
- Back end:
- sched/reservation station
- execution units
- reorder buffer
- register file
- load/store units
- Front end:
- Top down classifies pipeline slots into retiring (useful work), frontend bounc, backend bound, and bad speculation
- They believe that cache problems (lots of lukewarm code) is why the frontend is the primary staller
- i.e. binaries with 100s of mb
-
memcpyandmemove()is 4-5% of datacenter cycles- as is encryption
-
25% of datacenter tax is compressing and decompressing data
Dhalion
- This paper was actually quite dull
- Interesting bit is the split: metrics -> symptoms -> many to many -> diagnoses -> many to many -> resolvers
- This whole thing is called a “policy”
- Control loop system, explodes in complexity
- Part of the control loop is blacklisting certain actions from occuring that previously didn’t move you towards your desired solution
PGO - Block 1 2026
Li, David Xinliang, Raksit Ashok, and Robert Hundt. “Lightweight Feedback-Directed Cross-Module Optimization.” Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, ACM, April 24, 2010, 53–61. https://doi.org/10.1145/1772954.1772964.
- Two most important IPO passes are function inlining and indirect function call promotion
- Basically:
- We want to have a smarter way of combining things without making fat .o files
- We can use FDO analysis to determine which functions are hot and worth inlining or promoting
- In order to do so, we can use a greedy algorithm to generate families, which help link these together
- We push linking earlier by using the profiled data
- Profile data is augmented FDO data, so
- After training run is done, the in-memory info contains a callgraph, using a greedy clustering algorithm to decide on “friends”
- afterwards, we have a standard FDO data (raw counts of how many times each branch was taken), and also module grouping decisions
- In order to make this workable for Buck / Blaze / Bazel, an auxillery file needs to get shipped so the build system knows which sources to include, even when they’re not strictly dependant, as a form of dynamic dependency injection
- Predacessor to LTO (and ThinLTO)