عجفت الغور

swe tea

computers

Profiling a Warehouse Scale Computer

  • Kanev, Svilen, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. “Profiling a Warehouse-Scale Computer.” In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 158–69. Portland Oregon: ACM, 2015. https://doi.org/10.1145/2749469.2750392.

  • They only used C++, since it made it simpler

  • only on Ivy Bridge machines

  • No “killer application to optimize for, large chunks of compute are data locality bound and CPU stall bound, suggests that 2 wide SMT is not sufficient to eliminate the bulk of the overheads

    • What is a 2 wide SMT anyways?
    • I’m assuming it means 2 instructions at once, but not all instructions are parallelizable
    • workload diversity is very real, we’ve gotten a range of compute that’s wide enough for this not to matter
      • At the start, 50 hottest binaries account for 80% of execution
      • Three years later, top 50 are only 60%
      • Coverage decreases more than 5% per over the course of 3 years
      • Also does not include public clouds
    • Applications, as they grow more diverse and fatter, have gotten more flat profiles themselves
      • What would this look like for chatd?
  • “Data center tax” is very real, large chunks of your machine are going to be devoted to doing logging, rpc, ser/des

  • Yacine: top down measurement? never heard of this before

    • Yasin, Ahmad. “A Top-Down Method for Performance Analysis and Counters Architecture.” In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 35–44, 2014. https://doi.org/10.1109/ISPASS.2014.6844459.
    • Talks about core front-end and core back-end, what is that?
      • Front end:
        • instruction fetch
        • decode unit
        • branch prediction
        • uop cache
        • loop stream detector (?) - optimizes tight loops
      • Back end:
        • sched/reservation station
        • execution units
        • reorder buffer
        • register file
        • load/store units
    • Top down classifies pipeline slots into retiring (useful work), frontend bounc, backend bound, and bad speculation
    • They believe that cache problems (lots of lukewarm code) is why the frontend is the primary staller
      • i.e. binaries with 100s of mb
  • memcpy and memove() is 4-5% of datacenter cycles

    • as is encryption
  • 25% of datacenter tax is compressing and decompressing data

Dhalion

  • This paper was actually quite dull
  • Interesting bit is the split: metrics -> symptoms -> many to many -> diagnoses -> many to many -> resolvers
    • This whole thing is called a “policy”
  • Control loop system, explodes in complexity
    • Part of the control loop is blacklisting certain actions from occuring that previously didn’t move you towards your desired solution

PGO - Block 1 2026

Li, David Xinliang, Raksit Ashok, and Robert Hundt. “Lightweight Feedback-Directed Cross-Module Optimization.” Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, ACM, April 24, 2010, 53–61. https://doi.org/10.1145/1772954.1772964.

  • Two most important IPO passes are function inlining and indirect function call promotion
  • Basically:
    1. We want to have a smarter way of combining things without making fat .o files
    2. We can use FDO analysis to determine which functions are hot and worth inlining or promoting
    3. In order to do so, we can use a greedy algorithm to generate families, which help link these together
  • We push linking earlier by using the profiled data
  • Profile data is augmented FDO data, so
    • After training run is done, the in-memory info contains a callgraph, using a greedy clustering algorithm to decide on “friends”
    • afterwards, we have a standard FDO data (raw counts of how many times each branch was taken), and also module grouping decisions
  • In order to make this workable for Buck / Blaze / Bazel, an auxillery file needs to get shipped so the build system knows which sources to include, even when they’re not strictly dependant, as a form of dynamic dependency injection
  • Predacessor to LTO (and ThinLTO)