| Latest revision |
Your text |
| Line 6: |
Line 6: |
| | Eric Biederman | | Eric Biederman |
| | Jason Byron, Red Hat | | Jason Byron, Red Hat |
| − | Joe Ruscio, Evergrid | + | Joe Rusio, Evergreen |
| | Joe McDonald | | Joe McDonald |
| | HP China | | HP China |
| Line 18: |
Line 18: |
| | Pavel Emelyanov, Parallels/OpenVZ | | Pavel Emelyanov, Parallels/OpenVZ |
| | Denis Lunev, Parallels/OpenVZ | | Denis Lunev, Parallels/OpenVZ |
| − | Andrey Mirkin, Parallels/OpenVZ
| |
| | Constant Chan | | Constant Chan |
| | Benjamin Thery, Bull | | Benjamin Thery, Bull |
| | Daniel Lezcano, IBM | | Daniel Lezcano, IBM |
| | Serge Hallyn, IBM | | Serge Hallyn, IBM |
| − | Oren Laadan, Columbia University
| |
| | | | |
| | On Phone: | | On Phone: |
| − | Amy Griffis, HP | + | Amy Griffith HP |
| | Dhaval Giani, IBM | | Dhaval Giani, IBM |
| − | Peter Zijlstra
| |
| | | | |
| − | (Later walk-ins): | + | (Later walk-ins) |
| − | Paul Menage, Google
| |
| | | | |
| − | == Namespaces and containers ==
| + | Topics: |
| | | | |
| | Why do various companies want containers? | | Why do various companies want containers? |
| − | IBM, Google: workload management | + | ibm: workload management |
| | EB: using containers as improved chroot | | EB: using containers as improved chroot |
| | HP: wants similar to ibm, plus security | | HP: wants similar to ibm, plus security |
| Line 146: |
Line 142: |
| | if we can do sys_hijack cleanly, | | if we can do sys_hijack cleanly, |
| | we can use it to solve kthread problem | | we can use it to solve kthread problem |
| − |
| |
| − | == Control Groups and Resource Management ==
| |
| − |
| |
| − | == Checkpoint/Restart [CR] ==
| |
| − |
| |
| − | === Uses of CR ===
| |
| − |
| |
| − | * '''migration and live migration:''' e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints
| |
| − |
| |
| − | * '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance
| |
| − |
| |
| − | * '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)
| |
| − |
| |
| − | * '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)
| |
| − |
| |
| − | * [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.
| |
| − |
| |
| − | * [EB] '''remote fork:''' e.g. in a cluster
| |
| − |
| |
| − | (the last two scenarios are likely to require adjustments during,
| |
| − | or after, the restart to tolerate changes in the file system or
| |
| − | otherwise in the environment)
| |
| − |
| |
| − | * [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application
| |
| − | across multiple nodes as a whole.
| |
| − |
| |
| − | EB reminded that at the last kernel summit nobody complained about the
| |
| − | wish to add CR capabilities to the kernel. The issue was and remains
| |
| − | related to technical choices.
| |
| − |
| |
| − | === General design ===
| |
| − |
| |
| − | * '''Kernel-space vs user-space'''
| |
| − |
| |
| − | OL: the issue of kernel-space vs. user-space is pivotal to design.
| |
| − | kernel support is mandatory to provide completeness and transparency.
| |
| − | Even the recent experience with "cryo" demonstrated that users-space
| |
| − | requires the kernel to expose a very fine-grained API.
| |
| − |
| |
| − | Everyone (except DaveHansen) agreed to aim at a monolithic interface,
| |
| − | such that nearly all of the CR will be done in the kernel. The kernel
| |
| − | will return (checkpoint) or receive (restart) a blob with the image
| |
| − | of the state of the container.
| |
| − |
| |
| − | * '''Kernel-module ?'''
| |
| − |
| |
| − | OL: can we implement mostly in a kernel module and then move CR into
| |
| − | the kernel later ?
| |
| − |
| |
| − | EB: better to add CR functionality gradually directly to the kernel.
| |
| − |
| |
| − | * '''Compatibility between kernels'''
| |
| − |
| |
| − | DLu: there is an issue with compatibility between kernels - even same
| |
| − | kernel compiled with different options and/or compiler, and also if
| |
| − | the kernel ABI changes.
| |
| − |
| |
| − | OL: suggest to use an intermediate representation for the checkpoint
| |
| − | image to avoid the issue as much as possible; conversion, if needed,
| |
| − | will take place with userland tools. No aim to bridge ABI changes in
| |
| − | case of migration: instead, fail the restart.
| |
| − |
| |
| − | EB: format the blob such that userland tools it will be possible to
| |
| − | parse it and easily detect a version/configuration mismatch.
| |
| − |
| |
| − | * '''Streaming checkpoint image ?'''
| |
| − |
| |
| − | DLu: using sequential file (non seek-able) like a socket for the
| |
| − | checkpoint image is a challenge.
| |
| − |
| |
| − | OL: with proper planning it is not complicated to achieve, and it has
| |
| − | advantage of possible to pass through a filter, e.g. for compression,
| |
| − | encryption, format conversion etc.
| |
| − |
| |
| − | * '''Checkpoint operation'''
| |
| − |
| |
| − | The procedure will entail five steps:
| |
| − | # Pre-dump
| |
| − | # Freeze the container
| |
| − | # Dump
| |
| − | # Thaw/Kill the container
| |
| − | # Post-dump
| |
| − |
| |
| − | "pre-dump" works before freezing the container, e.g. the pre-copy for
| |
| − | live migration and minimize application downtime.
| |
| − |
| |
| − | "post-dump" works after the container resumes execution, e.g. in the
| |
| − | case of a checkpoint (not migration) write-back the data to secondary
| |
| − | storage, again to minimize application downtime.
| |
| − |
| |
| − | OL: we should be able to checkpoint from inside the container, keep
| |
| − | that in mind for later (also relates to the freezer).
| |
| − |
| |
| − | * '''Restart operation'''
| |
| − |
| |
| − | Restart is done by first creating a container, then creating the
| |
| − | process tree in it, and then each process restores its own state.
| |
| − | This allows to re-use existing kernel code (e.g., restoring a memory
| |
| − | region is a simple matter of calling mmap() and populating it).
| |
| − |
| |
| − | OL: suggest that the process tree be created in userspace.
| |
| − |
| |
| − | DLu: prefer to do everything, including process creation, in the
| |
| − | kernel, his experience shows that it isn't difficult.
| |
| − |
| |
| − | * '''Error recovery'''
| |
| − |
| |
| − | Should checkpoint fail, the container should continue execution
| |
| − | without noticing it. If either checkpoint or restart fail, there
| |
| − | should be a way to inform the caller/user of the reason (something
| |
| − | more informative than -EBUSY).
| |
| − |
| |
| − | === Road plan ===
| |
| − |
| |
| − | A this point we want to create a proof of concept and CR a simple
| |
| − | application. We will add iteratively more and more kernel resources.
| |
| − |
| |
| − | The first items to address:
| |
| − | # Create a container object (the context on which CR operates)
| |
| − | # Extend the container freezer cgroup ?)
| |
| − | # Interface via syscall or ioctl ?
| |
| − |
| |
| − | First step - a simple application:
| |
| − | a single process, not using any files, no signal pending, no IPC etc.
| |
| − | Need to save state (registers, IDs), memory maps and contents (except
| |
| − | for read-only portions, e.g. text).
| |
| − | Assume that the file system state doesn't change between checkpoint
| |
| − | and restart.
| |
| − |
| |
| − | Next steps:
| |
| − | # process hierarchy and relationships (multiple tasks and zombies)
| |
| − | # multiple threads (and shared memory)
| |
| − | # open files: regular file, fifo, pipe, socket-pair
| |
| − | # signals, timers
| |
| − | # TBD
| |
| − |
| |
| − | === Documentation ===
| |
| − |
| |
| − | DH: proof of concept requires explicit documentation of what can be
| |
| − | checkpointed and what cannot be checkpointed, as well as what will
| |
| − | be the error returned in response to a failure.
| |