Open main menu

OpenVZ Virtuozzo Containers Wiki β

Editing Containers/Mini-summit 2008 notes

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 1: Line 1:
[[Category: Containers]]
 
 
 
Intros (8:36am)
 
Intros (8:36am)
  
 
         Dave Hansen
 
         Dave Hansen
 
         Eric Biederman
 
         Eric Biederman
         Jason Byron, Red Hat
+
         Jason Byron redhat
         Joe Ruscio, Evergrid
+
         Joe Rusio - evergreen
 
         Joe McDonald
 
         Joe McDonald
 
         HP China
 
         HP China
Line 15: Line 13:
 
         Sandy Harris
 
         Sandy Harris
 
         NEC Japan
 
         NEC Japan
         John Schultz, AOL
+
         John Schultz aol
 
         Pavel Emelyanov, Parallels/OpenVZ
 
         Pavel Emelyanov, Parallels/OpenVZ
 
         Denis Lunev, Parallels/OpenVZ
 
         Denis Lunev, Parallels/OpenVZ
        Andrey Mirkin, Parallels/OpenVZ
 
 
         Constant Chan
 
         Constant Chan
         Benjamin Thery, Bull
+
         Benjamin
         Daniel Lezcano, IBM
+
         Daniel
         Serge Hallyn, IBM
+
         Serge
        Oren Laadan, Columbia University
 
  
 
On Phone:
 
On Phone:
         Amy Griffis, HP
+
         Amy Griffith HP
        Dhaval Giani, IBM
 
        Peter Zijlstra
 
  
(Later walk-ins):
+
(Later walk-ins)
        Paul Menage, Google
 
  
== Namespaces and containers ==
+
Topics:
  
 
Why do various companies want containers?
 
Why do various companies want containers?
         IBM, Google: workload management
+
         ibm: workload management
 
         EB: using containers as improved chroot
 
         EB: using containers as improved chroot
 
         HP: wants similar to ibm, plus security
 
         HP: wants similar to ibm, plus security
Line 146: Line 139:
 
                         if we can do sys_hijack cleanly,
 
                         if we can do sys_hijack cleanly,
 
                         we can use it to solve kthread problem
 
                         we can use it to solve kthread problem
 
== Control Groups and Resource Management ==
 
 
== Checkpoint/Restart [CR] ==
 
 
=== Uses of CR ===
 
 
* '''migration and live migration:'''  e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints
 
 
* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance
 
 
* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)
 
 
* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)
 
 
* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.
 
 
* [EB] '''remote fork:''' e.g. in a cluster
 
 
(the last two scenarios are likely to require adjustments during,
 
or after, the restart to tolerate changes in the file system or
 
otherwise in the environment)
 
 
* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application
 
across multiple nodes as a whole.
 
 
EB reminded that at the last kernel summit nobody complained about the
 
wish to add CR capabilities to the kernel. The issue was and remains
 
related to technical choices.
 
 
=== General design ===
 
 
* '''Kernel-space vs user-space'''
 
 
OL: the issue of kernel-space vs. user-space is pivotal to design.
 
kernel support is mandatory to provide completeness and transparency.
 
Even the recent experience with "cryo" demonstrated that users-space
 
requires the kernel to expose a very fine-grained API.
 
 
Everyone (except DaveHansen) agreed to aim at a monolithic interface,
 
such that nearly all of the CR will be done in the kernel. The kernel
 
will return (checkpoint) or receive (restart) a blob with the image
 
of the state of the container.
 
 
* '''Kernel-module ?'''
 
 
OL: can we implement mostly in a kernel module and then move CR into
 
the kernel later ?
 
 
EB: better to add CR functionality gradually directly to the kernel.
 
 
* '''Compatibility between kernels'''
 
 
DLu: there is an issue with compatibility between kernels - even same
 
kernel compiled with different options and/or compiler, and also if
 
the kernel ABI changes.
 
 
OL: suggest to use an intermediate representation for the checkpoint
 
image to avoid the issue as much as possible; conversion, if needed,
 
will take place with userland tools. No aim to bridge ABI changes in
 
case of migration: instead, fail the restart.
 
 
EB: format the blob such that userland tools it will be possible to
 
parse it and easily detect a version/configuration mismatch.
 
 
* '''Streaming checkpoint image ?'''
 
 
DLu: using sequential file (non seek-able) like a socket for the
 
checkpoint image is a challenge.
 
 
OL: with proper planning it is not complicated to achieve, and it has
 
advantage of possible to pass through a filter, e.g. for compression,
 
encryption, format conversion etc.
 
 
* '''Checkpoint operation'''
 
 
The procedure will entail five steps:
 
# Pre-dump
 
# Freeze the container
 
# Dump
 
# Thaw/Kill the container
 
# Post-dump
 
 
"pre-dump" works before freezing the container, e.g. the pre-copy for
 
live migration and minimize application downtime.
 
 
"post-dump" works after the container resumes execution, e.g. in the
 
case of a checkpoint (not migration) write-back the data to secondary
 
storage, again to minimize application downtime.
 
 
OL: we should be able to checkpoint from inside the container, keep
 
that in mind for later (also relates to the freezer).
 
 
* '''Restart operation'''
 
 
Restart is done by first creating a container, then creating the
 
process tree in it, and then each process restores its own state.
 
This allows to re-use existing kernel code (e.g., restoring a memory
 
region is a simple matter of calling mmap() and populating it).
 
 
OL: suggest that the process tree be created in userspace.
 
 
DLu: prefer to do everything, including process creation, in the
 
kernel, his experience shows that it isn't difficult.
 
 
* '''Error recovery'''
 
 
Should checkpoint fail, the container should continue execution
 
without noticing it. If either checkpoint or restart fail, there
 
should be a way to inform the caller/user of the reason (something
 
more informative than -EBUSY).
 
 
=== Road plan ===
 
 
A this point we want to create a proof of concept and CR a simple
 
application. We will add iteratively more and more kernel resources.
 
 
The first items to address:
 
# Create a container object (the context on which CR operates)
 
# Extend the container freezer cgroup  ?)
 
# Interface via syscall or ioctl ?
 
 
First step - a simple application:
 
a single process, not using any files, no signal pending, no IPC etc.
 
Need to save state (registers, IDs), memory maps and contents (except
 
for read-only portions, e.g. text).
 
Assume that the file system state doesn't change between checkpoint
 
and restart.
 
 
Next steps:
 
# process hierarchy and relationships (multiple tasks and zombies)
 
# multiple threads (and shared memory)
 
# open files: regular file, fifo, pipe, socket-pair
 
# signals, timers
 
# TBD
 
 
=== Documentation ===
 
 
DH: proof of concept requires explicit documentation of what can be
 
checkpointed and what cannot be checkpointed, as well as what will
 
be the error returned in response to a failure.
 

Please note that all contributions to OpenVZ Virtuozzo Containers Wiki may be edited, altered, or removed by other contributors. If you don't want your writing to be edited mercilessly, then don't submit it here.
If you are going to add external links to an article, read the External links policy first!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)