عجفت الغور

docker/containers

Tags: computers

Overview

  • Build on linux namespaces and cgroups
    • Namespaces segregate the system by allowing each container to have its own view of the system, via process ID’s, network interfaces, etc
    • Control groups segregate the physical hardware between them
    • Capabilities splits the privleages of the root user into distinct units that can be managed
    • seccomp filters system calls made by processes to restrict kernel functionality

Runtimes

  • LXC - tools, templates, and library and language bindings that use the kernel containment features directly - https://linuxcontainers.org/
  • libcontainer - golang library created for docker that uses the kernel containment features
  • runc - portable container runtime that implements the OCI spec, cli wrapper for libcontainer

High Levels

  • Docker - container engine, used to use LXC but then switched to libcontainer/runc
  • rkt - podnative design that was built by CoreOS (no longer developed)
  • containerd - daemon that manages the complete container lifecycle, from image transfer to exec to supervision. Used by kubernetes (k8s) and docker internally
  • CRI-o - lightweight container engine specifically for kube, alternative to running kube

Debian

Opening Docker Images

Tupperware vs Borg vs Kube

  • Twine: A Unified Cluster Management System for Shared Infrastructure

  • Regional level (borg is cluster level, but presents a virtual abstract regional level API)

    • Kubernetes only bothers presenting a 4-5k “cluster” level
    • Regional level control also introduces significant risk. Rate-limiting destructive operations is a hard learned lesson that people learn many times over. Borg favors external sharding (as in each borg group is on a cluster level), whereas Tupperware shards internally into 20 per region.
  • Most interesting part of the paper was that the compute infra is focused on one CPU 64GB ram boxes, with 4 per sled

  • Task control is interesting, borg and kube do this via SIGINT or SIGTERM for preemption. Task control is much more customizable, but leads to potential issues where people are not handling unplanned maintanence. Since the expectation for Borg is that you get 5 minutes to vacate a machine, even unplanned maint events are fine for most applications. Tupperware does not do this, which causes some issues with developer incentives. The hard problem (unplanned maint) should be solved first, since the simpler problem (handling maint) is a reduced form of the hard one

  • Host profiles are interesting, because they allow developers to have fine grained control over their machines. Borg does not offer this, mostly because the foundational teams at Google provide fleet wide improvements (e.g. hugepages). Leaving this up to the user is interesting, but definitely presents problems of stacking since some of these must be enabled on a host level and the cartesian product of configs lead to maintenance headaches.

  • The rebalancer is interesting because it iteratively moves you towards something that fits your usage profile, rather than trying to wait for the solver first. Borg will wait for the solver before actually starting your task, whereas tupperware tries to first start your job somewhere, and then move it to an appropriate place.

    • Borg keeps track of what your job historically looks like and incorporates that into the solver, rather than trying to look at it each time
  • Entitlement fragementation seems like a huge problem with noisey neighbors and an overloaded abstraction. It’s good they’re changing it.

  • Evaluation of TCO is a little sketchy? 33% reduction one is from shard manager, 48% one is from offloading, is this a real measurement number?

  • Overall, like rocks, this is another FB project with quite a few knobs, which I think seems good in the moment but does not bode well for the future.

Borg Autopilot

  • https://dl.acm.org/doi/pdf/10.1145/3342195.3387524
  • Verticle and horiztonal autoscaling combination is the most interesting, since most fall into horizontal autoscaling
  • Difficulty is a programmer problem, vertical autoscaling provides for a good developer experience, since you don’t especially need to do anything different or need to rearchitect your thing
  • Horiztonal demands that you become totally stateless, thus limiting the usage of scaling up particularly high
  • Borg and other schedulers are then put into a strange domain where
  • Why not cluster oversubscription?
    • Cluster oversubscription operates on a faster timescale, this needs to reduce into a different timescale
    • Size limits are required to respond, not to necessarily be oversbuscriped. Prod tasks can also not be oversubscribed
    • However, better cluster limits means that you have less slack which means that you won’t have as much free batch tier