2
edits
Changes
no edit summary
(Later walk-ins)
Why do various companies want containers?
if we can do sys_hijack cleanly,
we can use it to solve kthread problem
== Checkpoint/Restart [CR] ==
=== Uses of CR ===
* '''migration and live migration:''' e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints
* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance
* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)
* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)
* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.
* [EB] '''remote fork:''' e.g. in a cluster
(the last two scenarios are likely to require adjustments during,
or after, the restart to tolerate changes in the file system or
otherwise in the environment)
* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application
across multiple nodes as a whole.
EB reminded that at the last kernel summit nobody complained about the
wish to add CR capabilities to the kernel. The issue was and remains
related to technical choices.
=== General design ===
* '''Kernel-space vs user-space'''
OL: the issue of kernel-space vs. user-space is pivotal to design.
kernel support is mandatory to provide completeness and transparency.
Even the recent experience with "cryo" demonstrated that users-space
requires the kernel to expose a very fine-grained API.
Everyone agreed to aim at a monolithic interface, such that nearly
all of the CR will be done in the kernel. The kernel will return
(checkpoint) or receive (restart) a blob with the image of the state
of the container.
* '''Kernel-module ?'''
OL: can we implement mostly in a kernel module and then move CR into
the kernel later ?
EB: better to add CR functionality gradually directly to the kernel.
* '''Compatibility between kernels'''
DLu: there is an issue with compatibility between kernels - even same
kernel compiled with different options and/or compiler, and also if
the kernel ABI changes.
OL: suggest to use an intermediate representation for the checkpoint
image to avoid the issue as much as possible; conversion, if needed,
will take place with userland tools. No aim to bridge ABI changes in
case of migration: instead, fail the restart.
EB: format the blob such that userland tools it will be possible to
parse it and easily detect a version/configuration mismatch.
* '''Streaming checkpoint image ?'''
DLu: using sequential file (non seek-able) like a socket for the
checkpoint image is a challenge.
OL: with proper planning it is not complicated to achieve, and it has
advantage of possible to pass through a filter, e.g. for compression,
encryption, format conversion etc.
* '''Checkpoint operation'''
The procedure will entail five steps:
# Pre-dump
# Freeze the container
# Dump
# Thaw/Kill the container
# Post-dump
"pre-dump" works before freezing the container, e.g. the pre-copy for
live migration and minimize application downtime.
"post-dump" works after the container resumes execution, e.g. in the
case of a checkpoint (not migration) write-back the data to secondary
storage, again to minimize application downtime.
OL: we should be able to checkpoint from inside the container, keep
that in mind for later (also relates to the freezer).
* '''Restart operation'''
Restart is done by first creating a container, then creating the
process tree in it, and then each process restores its own state.
This allows to re-use existing kernel code (e.g., restoring a memory
region is a simple matter of calling mmap() and populating it).
OL: suggest that the process tree be created in userspace.
DLu: prefer to do everything, including process creation, in the
kernel, his experience shows that it isn't difficult.
* '''Error recovery'''
Should checkpoint fail, the container should continue execution
without noticing it. If either checkpoint or restart fail, there
should be a way to inform the caller/user of the reason (something
more informative than -EBUSY).
=== Road plan ===
A this point we want to create a proof of concept and CR a simple
application. We will add iteratively more and more kernel resources.
The first items to address:
# Create a container object (the context on which CR operates)
# Extend the container freezer cgroup ?)
# Interface via syscall or ioctl ?
First step - a simple application:
a single process, not using any files, no signal pending, no IPC etc.
Need to save state (registers, IDs), memory maps and contents (except
for read-only portions, e.g. text).
Assume that the file system state doesn't change between checkpoint
and restart.
Next steps:
# process hierarchy and relationships (multiple tasks and zombies)
# multiple threads (and shared memory)
# open files: regular file, fifo, pipe, socket-pair
# signals, timers
# TBD
=== Documentation ===
DH: proof of concept requires explicit documentation of what can be
checkpointed and what cannot be checkpointed, as well as what will
be the error returned in response to a failure.