Difference between revisions of "Containers/Mini-summit 2008 notes"
| Paulmenage (talk | contribs) | DaveHansen (talk | contribs)  | ||
| Line 185: | Line 185: | ||
| requires the kernel to expose a very fine-grained API. | requires the kernel to expose a very fine-grained API. | ||
| − | Everyone agreed to aim at a monolithic interface, such that nearly | + | Everyone (except DaveHansen) agreed to aim at a monolithic interface, | 
| − | all of the CR will be done in the kernel. The kernel will return | + | such that nearly all of the CR will be done in the kernel. The kernel | 
| − | (checkpoint) or receive (restart) a blob with the image of the state | + | will return (checkpoint) or receive (restart) a blob with the image | 
| − | of the container. | + | of the state of the container. | 
| * '''Kernel-module ?''' | * '''Kernel-module ?''' | ||
Latest revision as of 19:33, 27 July 2008
Intros (8:36am)
       Dave Hansen
       Eric Biederman
       Jason Byron, Red Hat
       Joe Ruscio, Evergrid
       Joe McDonald
       HP China
       Sonny Rao
       HP
       HP
       Matine Silberman HP
       Sandy Harris
       NEC Japan
       John Schultz, AOL
       Pavel Emelyanov, Parallels/OpenVZ
       Denis Lunev, Parallels/OpenVZ
       Andrey Mirkin, Parallels/OpenVZ
       Constant Chan
       Benjamin Thery, Bull
       Daniel Lezcano, IBM
       Serge Hallyn, IBM
       Oren Laadan, Columbia University
On Phone:
       Amy Griffis, HP
       Dhaval Giani, IBM
       Peter Zijlstra
(Later walk-ins):
Paul Menage, Google
Contents
Namespaces and containers[edit]
Why do various companies want containers?
       IBM, Google: workload management
       EB: using containers as improved chroot
       HP: wants similar to ibm, plus security
       parallels: hosted providers
sysfs issues
EB gives status: should go into next merge window
mini-namespaces
       NFS
               clients should behave differently on diff. containers
               currently uses single sunrpc transport for all containers
       Dave: is there a list of all openvz mini-ns?
       EB:
               proposal:
                       create little filesystems
                       still store everything in nsproxy
               currently:
                       some people want same process in different netns's
                       almost possible now, but can't open new sockets
               namespace enter:
                       3 purposes
                               login
                               monitoring
                               configuring
               may be worth prototyping the proposal
                       address mqns, or sunrpc, or fuse?
       DH:
               openvz addresses this using one big clone(), right?
               (yes)
userid namespaces
       EB summarizes his proposal
               userid ns is unsharable without privilege
               userids, capabilities, security labels become ns-local
               hierarchical like pidns
       openvz: just does chroot
       DH:
               observers that system vs. app containers have different requirements
       EB:
               so with userid namespaces, user has god-like powers over created namespaces
       EB+SH will talk about hacking something this week during ols
       Uses:
               user unttrusted mounts
               build systems
device namespaces
       tty namespaces rejected
       should be solved with generic device namespaces
               virtualize the major:minor->device mapping
       reserved device numbers (unnamed)
               created with /proc?
               get_unnamed_device()
       tty ideas:
               use selinux ptys
               use user namespaces
               use legacy ptys
               leverage ptyfs
       Suka is not on, so he gets volunteered to do pure /dev/pts fs approach
per-container LSMs:
       SH: thinks LSMs should handle it
       EB:
               original purpose of chroot
               set up policies from inside container
               creating smack container inside selinux would be ideal
entering a container
       netns: identified using pid of a ns
       sh: can we solve this using EB's namespace filesystems proposal?
       (EB goes to the board to demonstrate his proposal)
       PM: Can we use control groups?
       PE: Can we re-use /proc/pid/ ?
       EB: could have a ns with no processes in it
       Example of command using this:
               ip set eth0 netns <pid>
               becomes
               ip set eth0 netns /proc/<pid>/
       DL:
               a real netns problem is knowing when a childns has died
               the netnsfs mount could solve that
       PE: EB, can you send POC patches for the namespace?
               EB and EM will both send their own POC.
DL: people have complained about needing CAP_SYS_ADMIN to unshare ns
EB: example, setuid root sysvipc-using program could be fooled
PE: Entering a container:
       reasons:
               monitoring
               enter an administrative command
       DH: how do you do it now?
       PE: numerical ID for each VE, use it to enter
       EB:
               one need for entering: /sbin/hotplug
       (someone): does hijack suffice?
       EB: two cases:
               partial entering
               full entering
               sys_hijack does not address partial entering
       DH:
               why need partial entering?
               fs stuff can be done without entering
       PM: privileged process
       PE:
               will look at hijack patches
               someone will re-send hijack to containers@
               EB:
                       if we can do sys_hijack cleanly,
                       we can use it to solve kthread problem
Control Groups and Resource Management[edit]
Checkpoint/Restart [CR][edit]
Uses of CR[edit]
- migration and live migration: e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints
- suspend/resume (aka hibernation): e.g. for hibernation, gang-scheduling and priority running, OS maintenance
- failure recovery / fault tolerance: periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)
- time-travel: periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)
- [PE] fast-launch: reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.
- [EB] remote fork: e.g. in a cluster
(the last two scenarios are likely to require adjustments during, or after, the restart to tolerate changes in the file system or otherwise in the environment)
- [EB,OL] distributed checkpoint: the ability to checkpoint and restart a distributed application
across multiple nodes as a whole.
EB reminded that at the last kernel summit nobody complained about the wish to add CR capabilities to the kernel. The issue was and remains related to technical choices.
General design[edit]
- Kernel-space vs user-space
OL: the issue of kernel-space vs. user-space is pivotal to design. kernel support is mandatory to provide completeness and transparency. Even the recent experience with "cryo" demonstrated that users-space requires the kernel to expose a very fine-grained API.
Everyone (except DaveHansen) agreed to aim at a monolithic interface, such that nearly all of the CR will be done in the kernel. The kernel will return (checkpoint) or receive (restart) a blob with the image of the state of the container.
- Kernel-module ?
OL: can we implement mostly in a kernel module and then move CR into the kernel later ?
EB: better to add CR functionality gradually directly to the kernel.
- Compatibility between kernels
DLu: there is an issue with compatibility between kernels - even same kernel compiled with different options and/or compiler, and also if the kernel ABI changes.
OL: suggest to use an intermediate representation for the checkpoint image to avoid the issue as much as possible; conversion, if needed, will take place with userland tools. No aim to bridge ABI changes in case of migration: instead, fail the restart.
EB: format the blob such that userland tools it will be possible to parse it and easily detect a version/configuration mismatch.
- Streaming checkpoint image ?
DLu: using sequential file (non seek-able) like a socket for the checkpoint image is a challenge.
OL: with proper planning it is not complicated to achieve, and it has advantage of possible to pass through a filter, e.g. for compression, encryption, format conversion etc.
- Checkpoint operation
The procedure will entail five steps:
- Pre-dump
- Freeze the container
- Dump
- Thaw/Kill the container
- Post-dump
"pre-dump" works before freezing the container, e.g. the pre-copy for live migration and minimize application downtime.
"post-dump" works after the container resumes execution, e.g. in the case of a checkpoint (not migration) write-back the data to secondary storage, again to minimize application downtime.
OL: we should be able to checkpoint from inside the container, keep that in mind for later (also relates to the freezer).
- Restart operation
Restart is done by first creating a container, then creating the process tree in it, and then each process restores its own state. This allows to re-use existing kernel code (e.g., restoring a memory region is a simple matter of calling mmap() and populating it).
OL: suggest that the process tree be created in userspace.
DLu: prefer to do everything, including process creation, in the kernel, his experience shows that it isn't difficult.
- Error recovery
Should checkpoint fail, the container should continue execution without noticing it. If either checkpoint or restart fail, there should be a way to inform the caller/user of the reason (something more informative than -EBUSY).
Road plan[edit]
A this point we want to create a proof of concept and CR a simple application. We will add iteratively more and more kernel resources.
The first items to address:
- Create a container object (the context on which CR operates)
- Extend the container freezer cgroup ?)
- Interface via syscall or ioctl ?
First step - a simple application: a single process, not using any files, no signal pending, no IPC etc. Need to save state (registers, IDs), memory maps and contents (except for read-only portions, e.g. text). Assume that the file system state doesn't change between checkpoint and restart.
Next steps:
- process hierarchy and relationships (multiple tasks and zombies)
- multiple threads (and shared memory)
- open files: regular file, fifo, pipe, socket-pair
- signals, timers
- TBD
Documentation[edit]
DH: proof of concept requires explicit documentation of what can be checkpointed and what cannot be checkpointed, as well as what will be the error returned in response to a failure.
