عجفت الغور

swe tea

computers

Profiling a Warehouse Scale Computer

  • Kanev, Svilen, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. “Profiling a Warehouse-Scale Computer.” In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 158–69. Portland Oregon: ACM, 2015. https://doi.org/10.1145/2749469.2750392.

  • They only used C++, since it made it simpler

  • only on Ivy Bridge machines

  • No “killer application to optimize for, large chunks of compute are data locality bound and CPU stall bound, suggests that 2 wide SMT is not sufficient to eliminate the bulk of the overheads

    • What is a 2 wide SMT anyways?
    • I’m assuming it means 2 instructions at once, but not all instructions are parallelizable
    • workload diversity is very real, we’ve gotten a range of compute that’s wide enough for this not to matter
      • At the start, 50 hottest binaries account for 80% of execution
      • Three years later, top 50 are only 60%
      • Coverage decreases more than 5% per over the course of 3 years
      • Also does not include public clouds
    • Applications, as they grow more diverse and fatter, have gotten more flat profiles themselves
      • What would this look like for chatd?
  • “Data center tax” is very real, large chunks of your machine are going to be devoted to doing logging, rpc, ser/des

  • Yacine: top down measurement? never heard of this before

    • Yasin, Ahmad. “A Top-Down Method for Performance Analysis and Counters Architecture.” In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 35–44, 2014. https://doi.org/10.1109/ISPASS.2014.6844459.
    • Talks about core front-end and core back-end, what is that?
      • Front end:
        • instruction fetch
        • decode unit
        • branch prediction
        • uop cache
        • loop stream detector (?) - optimizes tight loops
      • Back end:
        • sched/reservation station
        • execution units
        • reorder buffer
        • register file
        • load/store units
    • Top down classifies pipeline slots into retiring (useful work), frontend bounc, backend bound, and bad speculation
    • They believe that cache problems (lots of lukewarm code) is why the frontend is the primary staller
      • i.e. binaries with 100s of mb
  • memcpy and memove() is 4-5% of datacenter cycles

    • as is encryption
  • 25% of datacenter tax is compressing and decompressing data

Dhalion

  • This paper was actually quite dull
  • Interesting bit is the split: metrics -> symptoms -> many to many -> diagnoses -> many to many -> resolvers
    • This whole thing is called a “policy”
  • Control loop system, explodes in complexity
    • Part of the control loop is blacklisting certain actions from occuring that previously didn’t move you towards your desired solution