Open main menu

OpenVZ Virtuozzo Containers Wiki β

Changes

Containers/Mini-summit 2008 notes

5,602 bytes added, 19:33, 27 July 2008
no edit summary
Eric Biederman
Jason Byron, Red Hat
Joe RusioRuscio, EvergreenEvergrid
Joe McDonald
HP China
Pavel Emelyanov, Parallels/OpenVZ
Denis Lunev, Parallels/OpenVZ
Andrey Mirkin, Parallels/OpenVZ
Constant Chan
Benjamin Thery, Bull
Daniel Lezcano, IBM
Serge Hallyn, IBM
Oren Laadan, Columbia University
On Phone:
Amy Griffith Griffis, HP Dhaval Giani, IBM Peter Zijlstra
(Later walk-ins): Paul Menage, Google
Topics:== Namespaces and containers ==
Why do various companies want containers?
ibmIBM, Google: workload management
EB: using containers as improved chroot
HP: wants similar to ibm, plus security
if we can do sys_hijack cleanly,
we can use it to solve kthread problem
 
== Control Groups and Resource Management ==
 
== Checkpoint/Restart [CR] ==
 
=== Uses of CR ===
 
* '''migration and live migration:''' e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints
 
* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance
 
* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)
 
* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)
 
* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.
 
* [EB] '''remote fork:''' e.g. in a cluster
 
(the last two scenarios are likely to require adjustments during,
or after, the restart to tolerate changes in the file system or
otherwise in the environment)
 
* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application
across multiple nodes as a whole.
 
EB reminded that at the last kernel summit nobody complained about the
wish to add CR capabilities to the kernel. The issue was and remains
related to technical choices.
 
=== General design ===
 
* '''Kernel-space vs user-space'''
 
OL: the issue of kernel-space vs. user-space is pivotal to design.
kernel support is mandatory to provide completeness and transparency.
Even the recent experience with "cryo" demonstrated that users-space
requires the kernel to expose a very fine-grained API.
 
Everyone (except DaveHansen) agreed to aim at a monolithic interface,
such that nearly all of the CR will be done in the kernel. The kernel
will return (checkpoint) or receive (restart) a blob with the image
of the state of the container.
 
* '''Kernel-module ?'''
 
OL: can we implement mostly in a kernel module and then move CR into
the kernel later ?
 
EB: better to add CR functionality gradually directly to the kernel.
 
* '''Compatibility between kernels'''
 
DLu: there is an issue with compatibility between kernels - even same
kernel compiled with different options and/or compiler, and also if
the kernel ABI changes.
 
OL: suggest to use an intermediate representation for the checkpoint
image to avoid the issue as much as possible; conversion, if needed,
will take place with userland tools. No aim to bridge ABI changes in
case of migration: instead, fail the restart.
 
EB: format the blob such that userland tools it will be possible to
parse it and easily detect a version/configuration mismatch.
 
* '''Streaming checkpoint image ?'''
 
DLu: using sequential file (non seek-able) like a socket for the
checkpoint image is a challenge.
 
OL: with proper planning it is not complicated to achieve, and it has
advantage of possible to pass through a filter, e.g. for compression,
encryption, format conversion etc.
 
* '''Checkpoint operation'''
 
The procedure will entail five steps:
# Pre-dump
# Freeze the container
# Dump
# Thaw/Kill the container
# Post-dump
 
"pre-dump" works before freezing the container, e.g. the pre-copy for
live migration and minimize application downtime.
 
"post-dump" works after the container resumes execution, e.g. in the
case of a checkpoint (not migration) write-back the data to secondary
storage, again to minimize application downtime.
 
OL: we should be able to checkpoint from inside the container, keep
that in mind for later (also relates to the freezer).
 
* '''Restart operation'''
 
Restart is done by first creating a container, then creating the
process tree in it, and then each process restores its own state.
This allows to re-use existing kernel code (e.g., restoring a memory
region is a simple matter of calling mmap() and populating it).
 
OL: suggest that the process tree be created in userspace.
 
DLu: prefer to do everything, including process creation, in the
kernel, his experience shows that it isn't difficult.
 
* '''Error recovery'''
 
Should checkpoint fail, the container should continue execution
without noticing it. If either checkpoint or restart fail, there
should be a way to inform the caller/user of the reason (something
more informative than -EBUSY).
 
=== Road plan ===
 
A this point we want to create a proof of concept and CR a simple
application. We will add iteratively more and more kernel resources.
 
The first items to address:
# Create a container object (the context on which CR operates)
# Extend the container freezer cgroup ?)
# Interface via syscall or ioctl ?
 
First step - a simple application:
a single process, not using any files, no signal pending, no IPC etc.
Need to save state (registers, IDs), memory maps and contents (except
for read-only portions, e.g. text).
Assume that the file system state doesn't change between checkpoint
and restart.
 
Next steps:
# process hierarchy and relationships (multiple tasks and zombies)
# multiple threads (and shared memory)
# open files: regular file, fifo, pipe, socket-pair
# signals, timers
# TBD
 
=== Documentation ===
 
DH: proof of concept requires explicit documentation of what can be
checkpointed and what cannot be checkpointed, as well as what will
be the error returned in response to a failure.
2
edits