swe tea
- Weekly paper club/book club/video club
- Need to figure out timings (tuesday nights?)
- model off of ebpf reading groups and the distirbuted systems reading groups
Profiling a Warehouse Scale Computer
-
Kanev, Svilen, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. “Profiling a Warehouse-Scale Computer.” In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 158–69. Portland Oregon: ACM, 2015. https://doi.org/10.1145/2749469.2750392.
-
They only used C++, since it made it simpler
-
only on Ivy Bridge machines
-
No “killer application to optimize for, large chunks of compute are data locality bound and CPU stall bound, suggests that 2 wide SMT is not sufficient to eliminate the bulk of the overheads
- What is a 2 wide SMT anyways?
- I’m assuming it means 2 instructions at once, but not all instructions are parallelizable
- workload diversity is very real, we’ve gotten a range of compute that’s wide enough for this not to matter
- At the start, 50 hottest binaries account for 80% of execution
- Three years later, top 50 are only 60%
- Coverage decreases more than 5% per over the course of 3 years
- Also does not include public clouds
- Applications, as they grow more diverse and fatter, have gotten more flat profiles themselves
- What would this look like for chatd?
-
“Data center tax” is very real, large chunks of your machine are going to be devoted to doing logging, rpc, ser/des
-
Yacine: top down measurement? never heard of this before
- Yasin, Ahmad. “A Top-Down Method for Performance Analysis and Counters Architecture.” In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 35–44, 2014. https://doi.org/10.1109/ISPASS.2014.6844459.
- Talks about core front-end and core back-end, what is that?
- Front end:
- instruction fetch
- decode unit
- branch prediction
- uop cache
- loop stream detector (?) - optimizes tight loops
- Back end:
- sched/reservation station
- execution units
- reorder buffer
- register file
- load/store units
- Front end:
- Top down classifies pipeline slots into retiring (useful work), frontend bounc, backend bound, and bad speculation
- They believe that cache problems (lots of lukewarm code) is why the frontend is the primary staller
- i.e. binaries with 100s of mb
-
memcpy
andmemove()
is 4-5% of datacenter cycles- as is encryption
-
25% of datacenter tax is compressing and decompressing data
Dhalion
- This paper was actually quite dull
- Interesting bit is the split: metrics -> symptoms -> many to many -> diagnoses -> many to many -> resolvers
- This whole thing is called a “policy”
- Control loop system, explodes in complexity
- Part of the control loop is blacklisting certain actions from occuring that previously didn’t move you towards your desired solution