Containers/Mini-summit 2008 notes
< Containers
Intros (8:36am)
Dave Hansen Eric Biederman Jason Byron, Red Hat Joe Ruscio, Evergrid Joe McDonald HP China Sonny Rao HP HP Matine Silberman HP Sandy Harris NEC Japan John Schultz, AOL Pavel Emelyanov, Parallels/OpenVZ Denis Lunev, Parallels/OpenVZ Andrey Mirkin, Parallels/OpenVZ Constant Chan Benjamin Thery, Bull Daniel Lezcano, IBM Serge Hallyn, IBM Oren Laadan, Columbia University
On Phone:
Amy Griffis, HP Dhaval Giani, IBM Peter Zijlstra
(Later walk-ins):
Paul Menage, Google
Contents
Namespaces and containers
Why do various companies want containers?
IBM, Google: workload management EB: using containers as improved chroot HP: wants similar to ibm, plus security parallels: hosted providers
sysfs issues
EB gives status: should go into next merge window
mini-namespaces
NFS clients should behave differently on diff. containers currently uses single sunrpc transport for all containers Dave: is there a list of all openvz mini-ns? EB: proposal: create little filesystems still store everything in nsproxy currently: some people want same process in different netns's almost possible now, but can't open new sockets namespace enter: 3 purposes login monitoring configuring may be worth prototyping the proposal address mqns, or sunrpc, or fuse? DH: openvz addresses this using one big clone(), right? (yes)
userid namespaces
EB summarizes his proposal userid ns is unsharable without privilege userids, capabilities, security labels become ns-local hierarchical like pidns openvz: just does chroot DH: observers that system vs. app containers have different requirements EB: so with userid namespaces, user has god-like powers over created namespaces EB+SH will talk about hacking something this week during ols Uses: user unttrusted mounts build systems
device namespaces
tty namespaces rejected should be solved with generic device namespaces virtualize the major:minor->device mapping reserved device numbers (unnamed) created with /proc? get_unnamed_device() tty ideas: use selinux ptys use user namespaces use legacy ptys leverage ptyfs Suka is not on, so he gets volunteered to do pure /dev/pts fs approach
per-container LSMs:
SH: thinks LSMs should handle it EB: original purpose of chroot set up policies from inside container creating smack container inside selinux would be ideal
entering a container
netns: identified using pid of a ns sh: can we solve this using EB's namespace filesystems proposal? (EB goes to the board to demonstrate his proposal) PM: Can we use control groups? PE: Can we re-use /proc/pid/ ? EB: could have a ns with no processes in it Example of command using this: ip set eth0 netns <pid> becomes ip set eth0 netns /proc/<pid>/ DL: a real netns problem is knowing when a childns has died the netnsfs mount could solve that PE: EB, can you send POC patches for the namespace? EB and EM will both send their own POC.
DL: people have complained about needing CAP_SYS_ADMIN to unshare ns
EB: example, setuid root sysvipc-using program could be fooled
PE: Entering a container:
reasons: monitoring enter an administrative command DH: how do you do it now? PE: numerical ID for each VE, use it to enter EB: one need for entering: /sbin/hotplug (someone): does hijack suffice? EB: two cases: partial entering full entering sys_hijack does not address partial entering DH: why need partial entering? fs stuff can be done without entering PM: privileged process PE: will look at hijack patches someone will re-send hijack to containers@ EB: if we can do sys_hijack cleanly, we can use it to solve kthread problem
Control Groups and Resource Management
Checkpoint/Restart [CR]
Uses of CR
- migration and live migration: e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints
- suspend/resume (aka hibernation): e.g. for hibernation, gang-scheduling and priority running, OS maintenance
- failure recovery / fault tolerance: periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)
- time-travel: periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)
- [PE] fast-launch: reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.
- [EB] remote fork: e.g. in a cluster
(the last two scenarios are likely to require adjustments during, or after, the restart to tolerate changes in the file system or otherwise in the environment)
- [EB,OL] distributed checkpoint: the ability to checkpoint and restart a distributed application
across multiple nodes as a whole.
EB reminded that at the last kernel summit nobody complained about the wish to add CR capabilities to the kernel. The issue was and remains related to technical choices.
General design
- Kernel-space vs user-space
OL: the issue of kernel-space vs. user-space is pivotal to design. kernel support is mandatory to provide completeness and transparency. Even the recent experience with "cryo" demonstrated that users-space requires the kernel to expose a very fine-grained API.
Everyone agreed to aim at a monolithic interface, such that nearly all of the CR will be done in the kernel. The kernel will return (checkpoint) or receive (restart) a blob with the image of the state of the container.
- Kernel-module ?
OL: can we implement mostly in a kernel module and then move CR into the kernel later ?
EB: better to add CR functionality gradually directly to the kernel.
- Compatibility between kernels
DLu: there is an issue with compatibility between kernels - even same kernel compiled with different options and/or compiler, and also if the kernel ABI changes.
OL: suggest to use an intermediate representation for the checkpoint image to avoid the issue as much as possible; conversion, if needed, will take place with userland tools. No aim to bridge ABI changes in case of migration: instead, fail the restart.
EB: format the blob such that userland tools it will be possible to parse it and easily detect a version/configuration mismatch.
- Streaming checkpoint image ?
DLu: using sequential file (non seek-able) like a socket for the checkpoint image is a challenge.
OL: with proper planning it is not complicated to achieve, and it has advantage of possible to pass through a filter, e.g. for compression, encryption, format conversion etc.
- Checkpoint operation
The procedure will entail five steps:
- Pre-dump
- Freeze the container
- Dump
- Thaw/Kill the container
- Post-dump
"pre-dump" works before freezing the container, e.g. the pre-copy for live migration and minimize application downtime.
"post-dump" works after the container resumes execution, e.g. in the case of a checkpoint (not migration) write-back the data to secondary storage, again to minimize application downtime.
OL: we should be able to checkpoint from inside the container, keep that in mind for later (also relates to the freezer).
- Restart operation
Restart is done by first creating a container, then creating the process tree in it, and then each process restores its own state. This allows to re-use existing kernel code (e.g., restoring a memory region is a simple matter of calling mmap() and populating it).
OL: suggest that the process tree be created in userspace.
DLu: prefer to do everything, including process creation, in the kernel, his experience shows that it isn't difficult.
- Error recovery
Should checkpoint fail, the container should continue execution without noticing it. If either checkpoint or restart fail, there should be a way to inform the caller/user of the reason (something more informative than -EBUSY).
Road plan
A this point we want to create a proof of concept and CR a simple application. We will add iteratively more and more kernel resources.
The first items to address:
- Create a container object (the context on which CR operates)
- Extend the container freezer cgroup ?)
- Interface via syscall or ioctl ?
First step - a simple application: a single process, not using any files, no signal pending, no IPC etc. Need to save state (registers, IDs), memory maps and contents (except for read-only portions, e.g. text). Assume that the file system state doesn't change between checkpoint and restart.
Next steps:
- process hierarchy and relationships (multiple tasks and zombies)
- multiple threads (and shared memory)
- open files: regular file, fifo, pipe, socket-pair
- signals, timers
- TBD
Documentation
DH: proof of concept requires explicit documentation of what can be checkpointed and what cannot be checkpointed, as well as what will be the error returned in response to a failure.