Containers/Zap Patch

2008-08-08T19:23:12Z

DaveHansen: New page: > +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count) > +{ > + mm_segment_t oldfs; > + int ret; > + > + oldfs = get_fs(); > + set_fs(KERNEL_DS); > + ret = cr_uwri...

> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> + mm_segment_t oldfs;
> + int ret;
> +
> + oldfs = get_fs();
> + set_fs(KERNEL_DS);
> + ret = cr_uwrite(ctx, buf, count);
> + set_fs(oldfs);
> +
> + return ret;
> +}

get_fs()/set_fs() always feels a bit ouch, and this way you have
to use __force to avoid the warnings about __user pointer casts
in sparse.
I wonder if you can use splice_read/splice_write to get around
this problem.

Containers/Mini-summit 2008 notes

2008-07-27T19:33:48Z

DaveHansen:

[[Category: Containers]]

Intros (8:36am)

Dave Hansen
Eric Biederman
Jason Byron, Red Hat
Joe Ruscio, Evergrid
Joe McDonald
HP China
Sonny Rao
HP
HP
Matine Silberman HP
Sandy Harris
NEC Japan
John Schultz, AOL
Pavel Emelyanov, Parallels/OpenVZ
Denis Lunev, Parallels/OpenVZ
Andrey Mirkin, Parallels/OpenVZ
Constant Chan
Benjamin Thery, Bull
Daniel Lezcano, IBM
Serge Hallyn, IBM
Oren Laadan, Columbia University

On Phone:
Amy Griffis, HP
Dhaval Giani, IBM
Peter Zijlstra

(Later walk-ins):
Paul Menage, Google

== Namespaces and containers ==

Why do various companies want containers?
IBM, Google: workload management
EB: using containers as improved chroot
HP: wants similar to ibm, plus security
parallels: hosted providers

sysfs issues
EB gives status: should go into next merge window

mini-namespaces
NFS
clients should behave differently on diff. containers
currently uses single sunrpc transport for all containers
Dave: is there a list of all openvz mini-ns?
EB:
proposal:
create little filesystems
still store everything in nsproxy
currently:
some people want same process in different netns's
almost possible now, but can't open new sockets
namespace enter:
3 purposes
login
monitoring
configuring
may be worth prototyping the proposal
address mqns, or sunrpc, or fuse?
DH:
openvz addresses this using one big clone(), right?
(yes)

userid namespaces
EB summarizes his proposal
userid ns is unsharable without privilege
userids, capabilities, security labels become ns-local
hierarchical like pidns
openvz: just does chroot
DH:
observers that system vs. app containers have different requirements
EB:
so with userid namespaces, user has god-like powers over created namespaces
EB+SH will talk about hacking something this week during ols
Uses:
user unttrusted mounts
build systems

device namespaces
tty namespaces rejected
should be solved with generic device namespaces
virtualize the major:minor->device mapping
reserved device numbers (unnamed)
created with /proc?
get_unnamed_device()
tty ideas:
use selinux ptys
use user namespaces
use legacy ptys
leverage ptyfs
Suka is not on, so he gets volunteered to do pure /dev/pts fs approach

per-container LSMs:
SH: thinks LSMs should handle it
EB:
original purpose of chroot
set up policies from inside container
creating smack container inside selinux would be ideal

entering a container
netns: identified using pid of a ns
sh: can we solve this using EB's namespace filesystems proposal?
(EB goes to the board to demonstrate his proposal)
PM: Can we use control groups?
PE: Can we re-use /proc/pid/ ?
EB: could have a ns with no processes in it
Example of command using this:
ip set eth0 netns <pid>
becomes
ip set eth0 netns /proc/<pid>/
DL:
a real netns problem is knowing when a childns has died
the netnsfs mount could solve that
PE: EB, can you send POC patches for the namespace?
EB and EM will both send their own POC.

DL: people have complained about needing CAP_SYS_ADMIN to unshare ns
EB: example, setuid root sysvipc-using program could be fooled

PE: Entering a container:
reasons:
monitoring
enter an administrative command
DH: how do you do it now?
PE: numerical ID for each VE, use it to enter
EB:
one need for entering: /sbin/hotplug
(someone): does hijack suffice?
EB: two cases:
partial entering
full entering
sys_hijack does not address partial entering
DH:
why need partial entering?
fs stuff can be done without entering
PM: privileged process
PE:
will look at hijack patches
someone will re-send hijack to containers@
EB:
if we can do sys_hijack cleanly,
we can use it to solve kthread problem

== Control Groups and Resource Management ==

== Checkpoint/Restart [CR] ==

=== Uses of CR ===

* '''migration and live migration:''' e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints

* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance

* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)

* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)

* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.

* [EB] '''remote fork:''' e.g. in a cluster

(the last two scenarios are likely to require adjustments during,
or after, the restart to tolerate changes in the file system or
otherwise in the environment)

* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application
across multiple nodes as a whole.

EB reminded that at the last kernel summit nobody complained about the
wish to add CR capabilities to the kernel. The issue was and remains
related to technical choices.

=== General design ===

* '''Kernel-space vs user-space'''

OL: the issue of kernel-space vs. user-space is pivotal to design.
kernel support is mandatory to provide completeness and transparency.
Even the recent experience with "cryo" demonstrated that users-space
requires the kernel to expose a very fine-grained API.

Everyone (except DaveHansen) agreed to aim at a monolithic interface,
such that nearly all of the CR will be done in the kernel. The kernel
will return (checkpoint) or receive (restart) a blob with the image
of the state of the container.

* '''Kernel-module ?'''

OL: can we implement mostly in a kernel module and then move CR into
the kernel later ?

EB: better to add CR functionality gradually directly to the kernel.

* '''Compatibility between kernels'''

DLu: there is an issue with compatibility between kernels - even same
kernel compiled with different options and/or compiler, and also if
the kernel ABI changes.

OL: suggest to use an intermediate representation for the checkpoint
image to avoid the issue as much as possible; conversion, if needed,
will take place with userland tools. No aim to bridge ABI changes in
case of migration: instead, fail the restart.

EB: format the blob such that userland tools it will be possible to
parse it and easily detect a version/configuration mismatch.

* '''Streaming checkpoint image ?'''

DLu: using sequential file (non seek-able) like a socket for the
checkpoint image is a challenge.

OL: with proper planning it is not complicated to achieve, and it has
advantage of possible to pass through a filter, e.g. for compression,
encryption, format conversion etc.

* '''Checkpoint operation'''

The procedure will entail five steps:
# Pre-dump
# Freeze the container
# Dump
# Thaw/Kill the container
# Post-dump

"pre-dump" works before freezing the container, e.g. the pre-copy for
live migration and minimize application downtime.

"post-dump" works after the container resumes execution, e.g. in the
case of a checkpoint (not migration) write-back the data to secondary
storage, again to minimize application downtime.

OL: we should be able to checkpoint from inside the container, keep
that in mind for later (also relates to the freezer).

* '''Restart operation'''

Restart is done by first creating a container, then creating the
process tree in it, and then each process restores its own state.
This allows to re-use existing kernel code (e.g., restoring a memory
region is a simple matter of calling mmap() and populating it).

OL: suggest that the process tree be created in userspace.

DLu: prefer to do everything, including process creation, in the
kernel, his experience shows that it isn't difficult.

* '''Error recovery'''

Should checkpoint fail, the container should continue execution
without noticing it. If either checkpoint or restart fail, there
should be a way to inform the caller/user of the reason (something
more informative than -EBUSY).

=== Road plan ===

A this point we want to create a proof of concept and CR a simple
application. We will add iteratively more and more kernel resources.

The first items to address:
# Create a container object (the context on which CR operates)
# Extend the container freezer cgroup ?)
# Interface via syscall or ioctl ?

First step - a simple application:
a single process, not using any files, no signal pending, no IPC etc.
Need to save state (registers, IDs), memory maps and contents (except
for read-only portions, e.g. text).
Assume that the file system state doesn't change between checkpoint
and restart.

Next steps:
# process hierarchy and relationships (multiple tasks and zombies)
# multiple threads (and shared memory)
# open files: regular file, fifo, pipe, socket-pair
# signals, timers
# TBD

=== Documentation ===

DH: proof of concept requires explicit documentation of what can be
checkpointed and what cannot be checkpointed, as well as what will
be the error returned in response to a failure.

OpenVZ Virtuozzo Containers Wiki - User contributions [en]

Containers/Zap Patch

Containers/Mini-summit 2008 notes