Changes

Containers/Mini-summit 2008 notes

5,369 bytes added, 18:46, 23 July 2008

no edit summary

(Later walk-ins)

~~Topics:~~== Namespaces and containers ==

Why do various companies want containers?

if we can do sys_hijack cleanly,

we can use it to solve kthread problem

== Checkpoint/Restart [CR] ==

=== Uses of CR ===

* '''migration and live migration:''' e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints

* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance

* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)

* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)

* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.

* [EB] '''remote fork:''' e.g. in a cluster

(the last two scenarios are likely to require adjustments during,

or after, the restart to tolerate changes in the file system or

otherwise in the environment)

* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application

across multiple nodes as a whole.

EB reminded that at the last kernel summit nobody complained about the

wish to add CR capabilities to the kernel. The issue was and remains

related to technical choices.

=== General design ===

* '''Kernel-space vs user-space'''

OL: the issue of kernel-space vs. user-space is pivotal to design.

kernel support is mandatory to provide completeness and transparency.

Even the recent experience with "cryo" demonstrated that users-space

requires the kernel to expose a very fine-grained API.

Everyone agreed to aim at a monolithic interface, such that nearly

all of the CR will be done in the kernel. The kernel will return

(checkpoint) or receive (restart) a blob with the image of the state

of the container.

* '''Kernel-module ?'''

OL: can we implement mostly in a kernel module and then move CR into

the kernel later ?

EB: better to add CR functionality gradually directly to the kernel.

* '''Compatibility between kernels'''

DLu: there is an issue with compatibility between kernels - even same

kernel compiled with different options and/or compiler, and also if

the kernel ABI changes.

OL: suggest to use an intermediate representation for the checkpoint

image to avoid the issue as much as possible; conversion, if needed,

will take place with userland tools. No aim to bridge ABI changes in

case of migration: instead, fail the restart.

EB: format the blob such that userland tools it will be possible to

parse it and easily detect a version/configuration mismatch.

* '''Streaming checkpoint image ?'''

DLu: using sequential file (non seek-able) like a socket for the

checkpoint image is a challenge.

OL: with proper planning it is not complicated to achieve, and it has

advantage of possible to pass through a filter, e.g. for compression,

encryption, format conversion etc.

* '''Checkpoint operation'''

The procedure will entail five steps:

# Pre-dump

# Freeze the container

# Dump

# Thaw/Kill the container

# Post-dump

"pre-dump" works before freezing the container, e.g. the pre-copy for

live migration and minimize application downtime.

"post-dump" works after the container resumes execution, e.g. in the

case of a checkpoint (not migration) write-back the data to secondary

storage, again to minimize application downtime.

OL: we should be able to checkpoint from inside the container, keep

that in mind for later (also relates to the freezer).

* '''Restart operation'''

Restart is done by first creating a container, then creating the

process tree in it, and then each process restores its own state.

This allows to re-use existing kernel code (e.g., restoring a memory

region is a simple matter of calling mmap() and populating it).

OL: suggest that the process tree be created in userspace.

DLu: prefer to do everything, including process creation, in the

kernel, his experience shows that it isn't difficult.

* '''Error recovery'''

Should checkpoint fail, the container should continue execution

without noticing it. If either checkpoint or restart fail, there

should be a way to inform the caller/user of the reason (something

more informative than -EBUSY).

=== Road plan ===

A this point we want to create a proof of concept and CR a simple

application. We will add iteratively more and more kernel resources.

The first items to address:

# Create a container object (the context on which CR operates)

# Extend the container freezer cgroup ?)

# Interface via syscall or ioctl ?

First step - a simple application:

a single process, not using any files, no signal pending, no IPC etc.

Need to save state (registers, IDs), memory maps and contents (except

for read-only portions, e.g. text).

Assume that the file system state doesn't change between checkpoint

and restart.

Next steps:

# process hierarchy and relationships (multiple tasks and zombies)

# multiple threads (and shared memory)

# open files: regular file, fifo, pipe, socket-pair

# signals, timers

# TBD

=== Documentation ===

DH: proof of concept requires explicit documentation of what can be

checkpointed and what cannot be checkpointed, as well as what will

be the error returned in response to a failure.

Orenl

2

edits

OpenVZ Virtuozzo Containers Wiki β

Changes

Containers/Mini-summit 2008 notes

OpenVZ Virtuozzo Containers Wiki ^β