Latest revision |
Your text |
Line 1: |
Line 1: |
− | [[Category: Containers]]
| |
− |
| |
| Intros (8:36am) | | Intros (8:36am) |
| | | |
| Dave Hansen | | Dave Hansen |
| Eric Biederman | | Eric Biederman |
− | Jason Byron, Red Hat | + | Jason Byron redhat |
− | Joe Ruscio, Evergrid | + | Joe Rusio - evergreen |
| Joe McDonald | | Joe McDonald |
| HP China | | HP China |
Line 15: |
Line 13: |
| Sandy Harris | | Sandy Harris |
| NEC Japan | | NEC Japan |
− | John Schultz, AOL | + | John Schultz aol |
| Pavel Emelyanov, Parallels/OpenVZ | | Pavel Emelyanov, Parallels/OpenVZ |
| Denis Lunev, Parallels/OpenVZ | | Denis Lunev, Parallels/OpenVZ |
− | Andrey Mirkin, Parallels/OpenVZ | + | (?) |
− | Constant Chan
| + | Benjamin |
− | Benjamin Thery, Bull | + | Daniel |
− | Daniel Lezcano, IBM | + | Serge |
− | Serge Hallyn, IBM | |
− | Oren Laadan, Columbia University
| |
| | | |
| On Phone: | | On Phone: |
− | Amy Griffis, HP | + | Amy Griffith HP |
− | Dhaval Giani, IBM
| |
− | Peter Zijlstra
| |
| | | |
− | (Later walk-ins): | + | (Later walk-ins) |
− | Paul Menage, Google
| |
| | | |
− | == Namespaces and containers ==
| + | Topics: |
| | | |
| Why do various companies want containers? | | Why do various companies want containers? |
− | IBM, Google: workload management | + | ibm: workload management |
| EB: using containers as improved chroot | | EB: using containers as improved chroot |
| HP: wants similar to ibm, plus security | | HP: wants similar to ibm, plus security |
Line 146: |
Line 139: |
| if we can do sys_hijack cleanly, | | if we can do sys_hijack cleanly, |
| we can use it to solve kthread problem | | we can use it to solve kthread problem |
− |
| |
− | == Control Groups and Resource Management ==
| |
− |
| |
− | == Checkpoint/Restart [CR] ==
| |
− |
| |
− | === Uses of CR ===
| |
− |
| |
− | * '''migration and live migration:''' e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints
| |
− |
| |
− | * '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance
| |
− |
| |
− | * '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)
| |
− |
| |
− | * '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)
| |
− |
| |
− | * [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.
| |
− |
| |
− | * [EB] '''remote fork:''' e.g. in a cluster
| |
− |
| |
− | (the last two scenarios are likely to require adjustments during,
| |
− | or after, the restart to tolerate changes in the file system or
| |
− | otherwise in the environment)
| |
− |
| |
− | * [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application
| |
− | across multiple nodes as a whole.
| |
− |
| |
− | EB reminded that at the last kernel summit nobody complained about the
| |
− | wish to add CR capabilities to the kernel. The issue was and remains
| |
− | related to technical choices.
| |
− |
| |
− | === General design ===
| |
− |
| |
− | * '''Kernel-space vs user-space'''
| |
− |
| |
− | OL: the issue of kernel-space vs. user-space is pivotal to design.
| |
− | kernel support is mandatory to provide completeness and transparency.
| |
− | Even the recent experience with "cryo" demonstrated that users-space
| |
− | requires the kernel to expose a very fine-grained API.
| |
− |
| |
− | Everyone (except DaveHansen) agreed to aim at a monolithic interface,
| |
− | such that nearly all of the CR will be done in the kernel. The kernel
| |
− | will return (checkpoint) or receive (restart) a blob with the image
| |
− | of the state of the container.
| |
− |
| |
− | * '''Kernel-module ?'''
| |
− |
| |
− | OL: can we implement mostly in a kernel module and then move CR into
| |
− | the kernel later ?
| |
− |
| |
− | EB: better to add CR functionality gradually directly to the kernel.
| |
− |
| |
− | * '''Compatibility between kernels'''
| |
− |
| |
− | DLu: there is an issue with compatibility between kernels - even same
| |
− | kernel compiled with different options and/or compiler, and also if
| |
− | the kernel ABI changes.
| |
− |
| |
− | OL: suggest to use an intermediate representation for the checkpoint
| |
− | image to avoid the issue as much as possible; conversion, if needed,
| |
− | will take place with userland tools. No aim to bridge ABI changes in
| |
− | case of migration: instead, fail the restart.
| |
− |
| |
− | EB: format the blob such that userland tools it will be possible to
| |
− | parse it and easily detect a version/configuration mismatch.
| |
− |
| |
− | * '''Streaming checkpoint image ?'''
| |
− |
| |
− | DLu: using sequential file (non seek-able) like a socket for the
| |
− | checkpoint image is a challenge.
| |
− |
| |
− | OL: with proper planning it is not complicated to achieve, and it has
| |
− | advantage of possible to pass through a filter, e.g. for compression,
| |
− | encryption, format conversion etc.
| |
− |
| |
− | * '''Checkpoint operation'''
| |
− |
| |
− | The procedure will entail five steps:
| |
− | # Pre-dump
| |
− | # Freeze the container
| |
− | # Dump
| |
− | # Thaw/Kill the container
| |
− | # Post-dump
| |
− |
| |
− | "pre-dump" works before freezing the container, e.g. the pre-copy for
| |
− | live migration and minimize application downtime.
| |
− |
| |
− | "post-dump" works after the container resumes execution, e.g. in the
| |
− | case of a checkpoint (not migration) write-back the data to secondary
| |
− | storage, again to minimize application downtime.
| |
− |
| |
− | OL: we should be able to checkpoint from inside the container, keep
| |
− | that in mind for later (also relates to the freezer).
| |
− |
| |
− | * '''Restart operation'''
| |
− |
| |
− | Restart is done by first creating a container, then creating the
| |
− | process tree in it, and then each process restores its own state.
| |
− | This allows to re-use existing kernel code (e.g., restoring a memory
| |
− | region is a simple matter of calling mmap() and populating it).
| |
− |
| |
− | OL: suggest that the process tree be created in userspace.
| |
− |
| |
− | DLu: prefer to do everything, including process creation, in the
| |
− | kernel, his experience shows that it isn't difficult.
| |
− |
| |
− | * '''Error recovery'''
| |
− |
| |
− | Should checkpoint fail, the container should continue execution
| |
− | without noticing it. If either checkpoint or restart fail, there
| |
− | should be a way to inform the caller/user of the reason (something
| |
− | more informative than -EBUSY).
| |
− |
| |
− | === Road plan ===
| |
− |
| |
− | A this point we want to create a proof of concept and CR a simple
| |
− | application. We will add iteratively more and more kernel resources.
| |
− |
| |
− | The first items to address:
| |
− | # Create a container object (the context on which CR operates)
| |
− | # Extend the container freezer cgroup ?)
| |
− | # Interface via syscall or ioctl ?
| |
− |
| |
− | First step - a simple application:
| |
− | a single process, not using any files, no signal pending, no IPC etc.
| |
− | Need to save state (registers, IDs), memory maps and contents (except
| |
− | for read-only portions, e.g. text).
| |
− | Assume that the file system state doesn't change between checkpoint
| |
− | and restart.
| |
− |
| |
− | Next steps:
| |
− | # process hierarchy and relationships (multiple tasks and zombies)
| |
− | # multiple threads (and shared memory)
| |
− | # open files: regular file, fifo, pipe, socket-pair
| |
− | # signals, timers
| |
− | # TBD
| |
− |
| |
− | === Documentation ===
| |
− |
| |
− | DH: proof of concept requires explicit documentation of what can be
| |
− | checkpointed and what cannot be checkpointed, as well as what will
| |
− | be the error returned in response to a failure.
| |