<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.openvz.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Paulmenage</id>
	<title>OpenVZ Virtuozzo Containers Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.openvz.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Paulmenage"/>
	<link rel="alternate" type="text/html" href="https://wiki.openvz.org/Special:Contributions/Paulmenage"/>
	<updated>2026-06-13T19:43:29Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.31.1</generator>
	<entry>
		<id>https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6245</id>
		<title>Containers/Mini-summit 2008 notes</title>
		<link rel="alternate" type="text/html" href="https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6245"/>
		<updated>2008-07-23T19:32:47Z</updated>

		<summary type="html">&lt;p&gt;Paulmenage: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: Containers]]&lt;br /&gt;
&lt;br /&gt;
Intros (8:36am)&lt;br /&gt;
&lt;br /&gt;
        Dave Hansen&lt;br /&gt;
        Eric Biederman&lt;br /&gt;
        Jason Byron, Red Hat&lt;br /&gt;
        Joe Ruscio, Evergrid&lt;br /&gt;
        Joe McDonald&lt;br /&gt;
        HP China&lt;br /&gt;
        Sonny Rao&lt;br /&gt;
        HP&lt;br /&gt;
        HP&lt;br /&gt;
        Matine Silberman HP&lt;br /&gt;
        Sandy Harris&lt;br /&gt;
        NEC Japan&lt;br /&gt;
        John Schultz, AOL&lt;br /&gt;
        Pavel Emelyanov, Parallels/OpenVZ&lt;br /&gt;
        Denis Lunev, Parallels/OpenVZ&lt;br /&gt;
        Andrey Mirkin, Parallels/OpenVZ&lt;br /&gt;
        Constant Chan&lt;br /&gt;
        Benjamin Thery, Bull&lt;br /&gt;
        Daniel Lezcano, IBM&lt;br /&gt;
        Serge Hallyn, IBM&lt;br /&gt;
        Oren Laadan, Columbia University&lt;br /&gt;
&lt;br /&gt;
On Phone:&lt;br /&gt;
        Amy Griffis, HP&lt;br /&gt;
        Dhaval Giani, IBM&lt;br /&gt;
        Peter Zijlstra&lt;br /&gt;
&lt;br /&gt;
(Later walk-ins):&lt;br /&gt;
        Paul Menage, Google&lt;br /&gt;
&lt;br /&gt;
== Namespaces and containers ==&lt;br /&gt;
&lt;br /&gt;
Why do various companies want containers?&lt;br /&gt;
        IBM, Google: workload management&lt;br /&gt;
        EB: using containers as improved chroot&lt;br /&gt;
        HP: wants similar to ibm, plus security&lt;br /&gt;
        parallels: hosted providers&lt;br /&gt;
&lt;br /&gt;
sysfs issues&lt;br /&gt;
        EB gives status: should go into next merge window&lt;br /&gt;
&lt;br /&gt;
mini-namespaces&lt;br /&gt;
        NFS&lt;br /&gt;
                clients should behave differently on diff. containers&lt;br /&gt;
                currently uses single sunrpc transport for all containers&lt;br /&gt;
        Dave: is there a list of all openvz mini-ns?&lt;br /&gt;
        EB:&lt;br /&gt;
                proposal:&lt;br /&gt;
                        create little filesystems&lt;br /&gt;
                        still store everything in nsproxy&lt;br /&gt;
                currently:&lt;br /&gt;
                        some people want same process in different netns's&lt;br /&gt;
                        almost possible now, but can't open new sockets&lt;br /&gt;
                namespace enter:&lt;br /&gt;
                        3 purposes&lt;br /&gt;
                                login&lt;br /&gt;
                                monitoring&lt;br /&gt;
                                configuring&lt;br /&gt;
                may be worth prototyping the proposal&lt;br /&gt;
                        address mqns, or sunrpc, or fuse?&lt;br /&gt;
        DH:&lt;br /&gt;
                openvz addresses this using one big clone(), right?&lt;br /&gt;
                (yes)&lt;br /&gt;
&lt;br /&gt;
userid namespaces&lt;br /&gt;
        EB summarizes his proposal&lt;br /&gt;
                userid ns is unsharable without privilege&lt;br /&gt;
                userids, capabilities, security labels become ns-local&lt;br /&gt;
                hierarchical like pidns&lt;br /&gt;
        openvz: just does chroot&lt;br /&gt;
        DH:&lt;br /&gt;
                observers that system vs. app containers have different requirements&lt;br /&gt;
        EB:&lt;br /&gt;
                so with userid namespaces, user has god-like powers over created namespaces&lt;br /&gt;
        EB+SH will talk about hacking something this week during ols&lt;br /&gt;
        Uses:&lt;br /&gt;
                user unttrusted mounts&lt;br /&gt;
                build systems&lt;br /&gt;
&lt;br /&gt;
device namespaces&lt;br /&gt;
        tty namespaces rejected&lt;br /&gt;
        should be solved with generic device namespaces&lt;br /&gt;
                virtualize the major:minor-&amp;gt;device mapping&lt;br /&gt;
        reserved device numbers (unnamed)&lt;br /&gt;
                created with /proc?&lt;br /&gt;
                get_unnamed_device()&lt;br /&gt;
        tty ideas:&lt;br /&gt;
                use selinux ptys&lt;br /&gt;
                use user namespaces&lt;br /&gt;
                use legacy ptys&lt;br /&gt;
                leverage ptyfs&lt;br /&gt;
        Suka is not on, so he gets volunteered to do pure /dev/pts fs approach&lt;br /&gt;
&lt;br /&gt;
per-container LSMs:&lt;br /&gt;
        SH: thinks LSMs should handle it&lt;br /&gt;
        EB:&lt;br /&gt;
                original purpose of chroot&lt;br /&gt;
                set up policies from inside container&lt;br /&gt;
                creating smack container inside selinux would be ideal&lt;br /&gt;
&lt;br /&gt;
entering a  container&lt;br /&gt;
        netns: identified using pid of a ns&lt;br /&gt;
        sh: can we solve this using EB's namespace filesystems proposal?&lt;br /&gt;
        (EB goes to the board to demonstrate his proposal)&lt;br /&gt;
        PM: Can we use control groups?&lt;br /&gt;
        PE: Can we re-use /proc/pid/ ?&lt;br /&gt;
        EB: could have a ns with no processes in it&lt;br /&gt;
        Example of command using this:&lt;br /&gt;
                ip set eth0 netns &amp;lt;pid&amp;gt;&lt;br /&gt;
                becomes&lt;br /&gt;
                ip set eth0 netns /proc/&amp;lt;pid&amp;gt;/&lt;br /&gt;
        DL:&lt;br /&gt;
                a real netns problem is knowing when a childns has died&lt;br /&gt;
                the netnsfs mount could solve that&lt;br /&gt;
        PE: EB, can you send POC patches for the namespace?&lt;br /&gt;
                EB and EM will both send their own POC.&lt;br /&gt;
&lt;br /&gt;
DL: people have complained about needing CAP_SYS_ADMIN to unshare ns&lt;br /&gt;
        EB: example, setuid root sysvipc-using program could be fooled&lt;br /&gt;
&lt;br /&gt;
PE: Entering a container:&lt;br /&gt;
        reasons:&lt;br /&gt;
                monitoring&lt;br /&gt;
                enter an administrative command&lt;br /&gt;
        DH: how do you do it now?&lt;br /&gt;
        PE: numerical ID for each VE, use it to enter&lt;br /&gt;
        EB:&lt;br /&gt;
                one need for entering: /sbin/hotplug&lt;br /&gt;
        (someone): does hijack suffice?&lt;br /&gt;
        EB: two cases:&lt;br /&gt;
                partial entering&lt;br /&gt;
                full entering&lt;br /&gt;
                sys_hijack does not address partial entering&lt;br /&gt;
        DH:&lt;br /&gt;
                why need partial entering?&lt;br /&gt;
                fs stuff can be done without entering&lt;br /&gt;
        PM: privileged process&lt;br /&gt;
        PE:&lt;br /&gt;
                will look at hijack patches&lt;br /&gt;
                someone will re-send hijack to containers@&lt;br /&gt;
                EB:&lt;br /&gt;
                        if we can do sys_hijack cleanly,&lt;br /&gt;
                        we can use it to solve kthread problem&lt;br /&gt;
&lt;br /&gt;
== Control Groups and Resource Management ==&lt;br /&gt;
&lt;br /&gt;
== Checkpoint/Restart [CR] ==&lt;br /&gt;
&lt;br /&gt;
=== Uses of CR ===&lt;br /&gt;
&lt;br /&gt;
* '''migration and live migration:'''  e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints&lt;br /&gt;
&lt;br /&gt;
* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance&lt;br /&gt;
&lt;br /&gt;
* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)&lt;br /&gt;
&lt;br /&gt;
* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)&lt;br /&gt;
&lt;br /&gt;
* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.&lt;br /&gt;
&lt;br /&gt;
* [EB] '''remote fork:''' e.g. in a cluster&lt;br /&gt;
&lt;br /&gt;
(the last two scenarios are likely to require adjustments during,&lt;br /&gt;
or after, the restart to tolerate changes in the file system or&lt;br /&gt;
otherwise in the environment)&lt;br /&gt;
&lt;br /&gt;
* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application&lt;br /&gt;
across multiple nodes as a whole.&lt;br /&gt;
&lt;br /&gt;
EB reminded that at the last kernel summit nobody complained about the&lt;br /&gt;
wish to add CR capabilities to the kernel. The issue was and remains&lt;br /&gt;
related to technical choices. &lt;br /&gt;
&lt;br /&gt;
=== General design ===&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-space vs user-space'''&lt;br /&gt;
&lt;br /&gt;
OL: the issue of kernel-space vs. user-space is pivotal to design.&lt;br /&gt;
kernel support is mandatory to provide completeness and transparency.&lt;br /&gt;
Even the recent experience with &amp;quot;cryo&amp;quot; demonstrated that users-space&lt;br /&gt;
requires the kernel to expose a very fine-grained API.&lt;br /&gt;
&lt;br /&gt;
Everyone agreed to aim at a monolithic interface, such that nearly&lt;br /&gt;
all of the CR will be done in the kernel. The kernel will return&lt;br /&gt;
(checkpoint) or receive (restart) a blob with the image of the state&lt;br /&gt;
of the container.&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-module ?'''&lt;br /&gt;
&lt;br /&gt;
OL: can we implement mostly in a kernel module and then move CR into&lt;br /&gt;
the kernel later ?&lt;br /&gt;
&lt;br /&gt;
EB: better to add CR functionality gradually directly to the kernel.&lt;br /&gt;
&lt;br /&gt;
* '''Compatibility between kernels'''&lt;br /&gt;
&lt;br /&gt;
DLu: there is an issue with compatibility between kernels - even same&lt;br /&gt;
kernel compiled with different options and/or compiler, and also if&lt;br /&gt;
the kernel ABI changes.&lt;br /&gt;
&lt;br /&gt;
OL: suggest to use an intermediate representation for the checkpoint&lt;br /&gt;
image to avoid the issue as much as possible; conversion, if needed,&lt;br /&gt;
will take place with userland tools. No aim to bridge ABI changes in&lt;br /&gt;
case of migration: instead, fail the restart. &lt;br /&gt;
&lt;br /&gt;
EB: format the blob such that userland tools it will be possible to &lt;br /&gt;
parse it and easily detect a version/configuration mismatch. &lt;br /&gt;
&lt;br /&gt;
* '''Streaming checkpoint image ?'''&lt;br /&gt;
&lt;br /&gt;
DLu: using sequential file (non seek-able) like a socket for the&lt;br /&gt;
checkpoint image is a challenge.&lt;br /&gt;
&lt;br /&gt;
OL: with proper planning it is not complicated to achieve, and it has&lt;br /&gt;
advantage of possible to pass through a filter, e.g. for compression,&lt;br /&gt;
encryption, format conversion etc.&lt;br /&gt;
&lt;br /&gt;
* '''Checkpoint operation'''&lt;br /&gt;
&lt;br /&gt;
The procedure will entail five steps:&lt;br /&gt;
# Pre-dump&lt;br /&gt;
# Freeze the container&lt;br /&gt;
# Dump&lt;br /&gt;
# Thaw/Kill the container&lt;br /&gt;
# Post-dump&lt;br /&gt;
&lt;br /&gt;
&amp;quot;pre-dump&amp;quot; works before freezing the container, e.g. the pre-copy for&lt;br /&gt;
live migration and minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;post-dump&amp;quot; works after the container resumes execution, e.g. in the&lt;br /&gt;
case of a checkpoint (not migration) write-back the data to secondary&lt;br /&gt;
storage, again to minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
OL: we should be able to checkpoint from inside the container, keep &lt;br /&gt;
that in mind for later (also relates to the freezer).&lt;br /&gt;
&lt;br /&gt;
* '''Restart operation'''&lt;br /&gt;
&lt;br /&gt;
Restart is done by first creating a container, then creating the&lt;br /&gt;
process tree in it, and then each process restores its own state. &lt;br /&gt;
This allows to re-use existing kernel code (e.g., restoring a memory&lt;br /&gt;
region is a simple matter of calling mmap() and populating it). &lt;br /&gt;
&lt;br /&gt;
OL: suggest that the process tree be created in userspace. &lt;br /&gt;
&lt;br /&gt;
DLu: prefer to do everything, including process creation, in the &lt;br /&gt;
kernel, his experience shows that it isn't difficult.&lt;br /&gt;
&lt;br /&gt;
* '''Error recovery'''&lt;br /&gt;
&lt;br /&gt;
Should checkpoint fail, the container should continue execution&lt;br /&gt;
without noticing it. If either checkpoint or restart fail, there &lt;br /&gt;
should be a way to inform the caller/user of the reason (something&lt;br /&gt;
more informative than -EBUSY). &lt;br /&gt;
&lt;br /&gt;
=== Road plan ===&lt;br /&gt;
&lt;br /&gt;
A this point we want to create a proof of concept and CR a simple&lt;br /&gt;
application. We will add iteratively more and more kernel resources.&lt;br /&gt;
&lt;br /&gt;
The first items to address:&lt;br /&gt;
# Create a container object (the context on which CR operates)&lt;br /&gt;
# Extend the container freezer cgroup  ?)&lt;br /&gt;
# Interface via syscall or ioctl ?&lt;br /&gt;
&lt;br /&gt;
First step - a simple application:&lt;br /&gt;
a single process, not using any files, no signal pending, no IPC etc.&lt;br /&gt;
Need to save state (registers, IDs), memory maps and contents (except&lt;br /&gt;
for read-only portions, e.g. text).&lt;br /&gt;
Assume that the file system state doesn't change between checkpoint&lt;br /&gt;
and restart.&lt;br /&gt;
&lt;br /&gt;
Next steps:&lt;br /&gt;
# process hierarchy and relationships (multiple tasks and zombies)&lt;br /&gt;
# multiple threads (and shared memory)&lt;br /&gt;
# open files: regular file, fifo, pipe, socket-pair&lt;br /&gt;
# signals, timers&lt;br /&gt;
# TBD&lt;br /&gt;
&lt;br /&gt;
=== Documentation ===&lt;br /&gt;
&lt;br /&gt;
DH: proof of concept requires explicit documentation of what can be&lt;br /&gt;
checkpointed and what cannot be checkpointed, as well as what will&lt;br /&gt;
be the error returned in response to a failure.&lt;/div&gt;</summary>
		<author><name>Paulmenage</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6244</id>
		<title>Containers/Mini-summit 2008 notes</title>
		<link rel="alternate" type="text/html" href="https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6244"/>
		<updated>2008-07-23T19:31:38Z</updated>

		<summary type="html">&lt;p&gt;Paulmenage: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: Containers]]&lt;br /&gt;
&lt;br /&gt;
Intros (8:36am)&lt;br /&gt;
&lt;br /&gt;
        Dave Hansen&lt;br /&gt;
        Eric Biederman&lt;br /&gt;
        Jason Byron, Red Hat&lt;br /&gt;
        Joe Ruscio, Evergrid&lt;br /&gt;
        Joe McDonald&lt;br /&gt;
        HP China&lt;br /&gt;
        Sonny Rao&lt;br /&gt;
        HP&lt;br /&gt;
        HP&lt;br /&gt;
        Matine Silberman HP&lt;br /&gt;
        Sandy Harris&lt;br /&gt;
        NEC Japan&lt;br /&gt;
        John Schultz, AOL&lt;br /&gt;
        Pavel Emelyanov, Parallels/OpenVZ&lt;br /&gt;
        Denis Lunev, Parallels/OpenVZ&lt;br /&gt;
        Andrey Mirkin, Parallels/OpenVZ&lt;br /&gt;
        Constant Chan&lt;br /&gt;
        Benjamin Thery, Bull&lt;br /&gt;
        Daniel Lezcano, IBM&lt;br /&gt;
        Serge Hallyn, IBM&lt;br /&gt;
        Oren Laadan, Columbia University&lt;br /&gt;
&lt;br /&gt;
On Phone:&lt;br /&gt;
        Amy Griffis, HP&lt;br /&gt;
        Dhaval Giani, IBM&lt;br /&gt;
        Peter Zijlstra&lt;br /&gt;
&lt;br /&gt;
(Later walk-ins):&lt;br /&gt;
        Paul Menage, Google&lt;br /&gt;
&lt;br /&gt;
== Namespaces and containers ==&lt;br /&gt;
&lt;br /&gt;
Why do various companies want containers?&lt;br /&gt;
        ibm: workload management&lt;br /&gt;
        EB: using containers as improved chroot&lt;br /&gt;
        HP: wants similar to ibm, plus security&lt;br /&gt;
        parallels: hosted providers&lt;br /&gt;
&lt;br /&gt;
sysfs issues&lt;br /&gt;
        EB gives status: should go into next merge window&lt;br /&gt;
&lt;br /&gt;
mini-namespaces&lt;br /&gt;
        NFS&lt;br /&gt;
                clients should behave differently on diff. containers&lt;br /&gt;
                currently uses single sunrpc transport for all containers&lt;br /&gt;
        Dave: is there a list of all openvz mini-ns?&lt;br /&gt;
        EB:&lt;br /&gt;
                proposal:&lt;br /&gt;
                        create little filesystems&lt;br /&gt;
                        still store everything in nsproxy&lt;br /&gt;
                currently:&lt;br /&gt;
                        some people want same process in different netns's&lt;br /&gt;
                        almost possible now, but can't open new sockets&lt;br /&gt;
                namespace enter:&lt;br /&gt;
                        3 purposes&lt;br /&gt;
                                login&lt;br /&gt;
                                monitoring&lt;br /&gt;
                                configuring&lt;br /&gt;
                may be worth prototyping the proposal&lt;br /&gt;
                        address mqns, or sunrpc, or fuse?&lt;br /&gt;
        DH:&lt;br /&gt;
                openvz addresses this using one big clone(), right?&lt;br /&gt;
                (yes)&lt;br /&gt;
&lt;br /&gt;
userid namespaces&lt;br /&gt;
        EB summarizes his proposal&lt;br /&gt;
                userid ns is unsharable without privilege&lt;br /&gt;
                userids, capabilities, security labels become ns-local&lt;br /&gt;
                hierarchical like pidns&lt;br /&gt;
        openvz: just does chroot&lt;br /&gt;
        DH:&lt;br /&gt;
                observers that system vs. app containers have different requirements&lt;br /&gt;
        EB:&lt;br /&gt;
                so with userid namespaces, user has god-like powers over created namespaces&lt;br /&gt;
        EB+SH will talk about hacking something this week during ols&lt;br /&gt;
        Uses:&lt;br /&gt;
                user unttrusted mounts&lt;br /&gt;
                build systems&lt;br /&gt;
&lt;br /&gt;
device namespaces&lt;br /&gt;
        tty namespaces rejected&lt;br /&gt;
        should be solved with generic device namespaces&lt;br /&gt;
                virtualize the major:minor-&amp;gt;device mapping&lt;br /&gt;
        reserved device numbers (unnamed)&lt;br /&gt;
                created with /proc?&lt;br /&gt;
                get_unnamed_device()&lt;br /&gt;
        tty ideas:&lt;br /&gt;
                use selinux ptys&lt;br /&gt;
                use user namespaces&lt;br /&gt;
                use legacy ptys&lt;br /&gt;
                leverage ptyfs&lt;br /&gt;
        Suka is not on, so he gets volunteered to do pure /dev/pts fs approach&lt;br /&gt;
&lt;br /&gt;
per-container LSMs:&lt;br /&gt;
        SH: thinks LSMs should handle it&lt;br /&gt;
        EB:&lt;br /&gt;
                original purpose of chroot&lt;br /&gt;
                set up policies from inside container&lt;br /&gt;
                creating smack container inside selinux would be ideal&lt;br /&gt;
&lt;br /&gt;
entering a  container&lt;br /&gt;
        netns: identified using pid of a ns&lt;br /&gt;
        sh: can we solve this using EB's namespace filesystems proposal?&lt;br /&gt;
        (EB goes to the board to demonstrate his proposal)&lt;br /&gt;
        PM: Can we use control groups?&lt;br /&gt;
        PE: Can we re-use /proc/pid/ ?&lt;br /&gt;
        EB: could have a ns with no processes in it&lt;br /&gt;
        Example of command using this:&lt;br /&gt;
                ip set eth0 netns &amp;lt;pid&amp;gt;&lt;br /&gt;
                becomes&lt;br /&gt;
                ip set eth0 netns /proc/&amp;lt;pid&amp;gt;/&lt;br /&gt;
        DL:&lt;br /&gt;
                a real netns problem is knowing when a childns has died&lt;br /&gt;
                the netnsfs mount could solve that&lt;br /&gt;
        PE: EB, can you send POC patches for the namespace?&lt;br /&gt;
                EB and EM will both send their own POC.&lt;br /&gt;
&lt;br /&gt;
DL: people have complained about needing CAP_SYS_ADMIN to unshare ns&lt;br /&gt;
        EB: example, setuid root sysvipc-using program could be fooled&lt;br /&gt;
&lt;br /&gt;
PE: Entering a container:&lt;br /&gt;
        reasons:&lt;br /&gt;
                monitoring&lt;br /&gt;
                enter an administrative command&lt;br /&gt;
        DH: how do you do it now?&lt;br /&gt;
        PE: numerical ID for each VE, use it to enter&lt;br /&gt;
        EB:&lt;br /&gt;
                one need for entering: /sbin/hotplug&lt;br /&gt;
        (someone): does hijack suffice?&lt;br /&gt;
        EB: two cases:&lt;br /&gt;
                partial entering&lt;br /&gt;
                full entering&lt;br /&gt;
                sys_hijack does not address partial entering&lt;br /&gt;
        DH:&lt;br /&gt;
                why need partial entering?&lt;br /&gt;
                fs stuff can be done without entering&lt;br /&gt;
        PM: privileged process&lt;br /&gt;
        PE:&lt;br /&gt;
                will look at hijack patches&lt;br /&gt;
                someone will re-send hijack to containers@&lt;br /&gt;
                EB:&lt;br /&gt;
                        if we can do sys_hijack cleanly,&lt;br /&gt;
                        we can use it to solve kthread problem&lt;br /&gt;
&lt;br /&gt;
== Checkpoint/Restart [CR] ==&lt;br /&gt;
&lt;br /&gt;
=== Uses of CR ===&lt;br /&gt;
&lt;br /&gt;
* '''migration and live migration:'''  e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints&lt;br /&gt;
&lt;br /&gt;
* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance&lt;br /&gt;
&lt;br /&gt;
* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)&lt;br /&gt;
&lt;br /&gt;
* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)&lt;br /&gt;
&lt;br /&gt;
* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.&lt;br /&gt;
&lt;br /&gt;
* [EB] '''remote fork:''' e.g. in a cluster&lt;br /&gt;
&lt;br /&gt;
(the last two scenarios are likely to require adjustments during,&lt;br /&gt;
or after, the restart to tolerate changes in the file system or&lt;br /&gt;
otherwise in the environment)&lt;br /&gt;
&lt;br /&gt;
* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application&lt;br /&gt;
across multiple nodes as a whole.&lt;br /&gt;
&lt;br /&gt;
EB reminded that at the last kernel summit nobody complained about the&lt;br /&gt;
wish to add CR capabilities to the kernel. The issue was and remains&lt;br /&gt;
related to technical choices. &lt;br /&gt;
&lt;br /&gt;
=== General design ===&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-space vs user-space'''&lt;br /&gt;
&lt;br /&gt;
OL: the issue of kernel-space vs. user-space is pivotal to design.&lt;br /&gt;
kernel support is mandatory to provide completeness and transparency.&lt;br /&gt;
Even the recent experience with &amp;quot;cryo&amp;quot; demonstrated that users-space&lt;br /&gt;
requires the kernel to expose a very fine-grained API.&lt;br /&gt;
&lt;br /&gt;
Everyone agreed to aim at a monolithic interface, such that nearly&lt;br /&gt;
all of the CR will be done in the kernel. The kernel will return&lt;br /&gt;
(checkpoint) or receive (restart) a blob with the image of the state&lt;br /&gt;
of the container.&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-module ?'''&lt;br /&gt;
&lt;br /&gt;
OL: can we implement mostly in a kernel module and then move CR into&lt;br /&gt;
the kernel later ?&lt;br /&gt;
&lt;br /&gt;
EB: better to add CR functionality gradually directly to the kernel.&lt;br /&gt;
&lt;br /&gt;
* '''Compatibility between kernels'''&lt;br /&gt;
&lt;br /&gt;
DLu: there is an issue with compatibility between kernels - even same&lt;br /&gt;
kernel compiled with different options and/or compiler, and also if&lt;br /&gt;
the kernel ABI changes.&lt;br /&gt;
&lt;br /&gt;
OL: suggest to use an intermediate representation for the checkpoint&lt;br /&gt;
image to avoid the issue as much as possible; conversion, if needed,&lt;br /&gt;
will take place with userland tools. No aim to bridge ABI changes in&lt;br /&gt;
case of migration: instead, fail the restart. &lt;br /&gt;
&lt;br /&gt;
EB: format the blob such that userland tools it will be possible to &lt;br /&gt;
parse it and easily detect a version/configuration mismatch. &lt;br /&gt;
&lt;br /&gt;
* '''Streaming checkpoint image ?'''&lt;br /&gt;
&lt;br /&gt;
DLu: using sequential file (non seek-able) like a socket for the&lt;br /&gt;
checkpoint image is a challenge.&lt;br /&gt;
&lt;br /&gt;
OL: with proper planning it is not complicated to achieve, and it has&lt;br /&gt;
advantage of possible to pass through a filter, e.g. for compression,&lt;br /&gt;
encryption, format conversion etc.&lt;br /&gt;
&lt;br /&gt;
* '''Checkpoint operation'''&lt;br /&gt;
&lt;br /&gt;
The procedure will entail five steps:&lt;br /&gt;
# Pre-dump&lt;br /&gt;
# Freeze the container&lt;br /&gt;
# Dump&lt;br /&gt;
# Thaw/Kill the container&lt;br /&gt;
# Post-dump&lt;br /&gt;
&lt;br /&gt;
&amp;quot;pre-dump&amp;quot; works before freezing the container, e.g. the pre-copy for&lt;br /&gt;
live migration and minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;post-dump&amp;quot; works after the container resumes execution, e.g. in the&lt;br /&gt;
case of a checkpoint (not migration) write-back the data to secondary&lt;br /&gt;
storage, again to minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
OL: we should be able to checkpoint from inside the container, keep &lt;br /&gt;
that in mind for later (also relates to the freezer).&lt;br /&gt;
&lt;br /&gt;
* '''Restart operation'''&lt;br /&gt;
&lt;br /&gt;
Restart is done by first creating a container, then creating the&lt;br /&gt;
process tree in it, and then each process restores its own state. &lt;br /&gt;
This allows to re-use existing kernel code (e.g., restoring a memory&lt;br /&gt;
region is a simple matter of calling mmap() and populating it). &lt;br /&gt;
&lt;br /&gt;
OL: suggest that the process tree be created in userspace. &lt;br /&gt;
&lt;br /&gt;
DLu: prefer to do everything, including process creation, in the &lt;br /&gt;
kernel, his experience shows that it isn't difficult.&lt;br /&gt;
&lt;br /&gt;
* '''Error recovery'''&lt;br /&gt;
&lt;br /&gt;
Should checkpoint fail, the container should continue execution&lt;br /&gt;
without noticing it. If either checkpoint or restart fail, there &lt;br /&gt;
should be a way to inform the caller/user of the reason (something&lt;br /&gt;
more informative than -EBUSY). &lt;br /&gt;
&lt;br /&gt;
=== Road plan ===&lt;br /&gt;
&lt;br /&gt;
A this point we want to create a proof of concept and CR a simple&lt;br /&gt;
application. We will add iteratively more and more kernel resources.&lt;br /&gt;
&lt;br /&gt;
The first items to address:&lt;br /&gt;
# Create a container object (the context on which CR operates)&lt;br /&gt;
# Extend the container freezer cgroup  ?)&lt;br /&gt;
# Interface via syscall or ioctl ?&lt;br /&gt;
&lt;br /&gt;
First step - a simple application:&lt;br /&gt;
a single process, not using any files, no signal pending, no IPC etc.&lt;br /&gt;
Need to save state (registers, IDs), memory maps and contents (except&lt;br /&gt;
for read-only portions, e.g. text).&lt;br /&gt;
Assume that the file system state doesn't change between checkpoint&lt;br /&gt;
and restart.&lt;br /&gt;
&lt;br /&gt;
Next steps:&lt;br /&gt;
# process hierarchy and relationships (multiple tasks and zombies)&lt;br /&gt;
# multiple threads (and shared memory)&lt;br /&gt;
# open files: regular file, fifo, pipe, socket-pair&lt;br /&gt;
# signals, timers&lt;br /&gt;
# TBD&lt;br /&gt;
&lt;br /&gt;
=== Documentation ===&lt;br /&gt;
&lt;br /&gt;
DH: proof of concept requires explicit documentation of what can be&lt;br /&gt;
checkpointed and what cannot be checkpointed, as well as what will&lt;br /&gt;
be the error returned in response to a failure.&lt;/div&gt;</summary>
		<author><name>Paulmenage</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6243</id>
		<title>Containers/Mini-summit 2008 notes</title>
		<link rel="alternate" type="text/html" href="https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6243"/>
		<updated>2008-07-23T19:04:54Z</updated>

		<summary type="html">&lt;p&gt;Paulmenage: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: Containers]]&lt;br /&gt;
&lt;br /&gt;
Intros (8:36am)&lt;br /&gt;
&lt;br /&gt;
        Dave Hansen&lt;br /&gt;
        Eric Biederman&lt;br /&gt;
        Jason Byron, Red Hat&lt;br /&gt;
        Joe Ruscio, Evergrid&lt;br /&gt;
        Joe McDonald&lt;br /&gt;
        HP China&lt;br /&gt;
        Sonny Rao&lt;br /&gt;
        HP&lt;br /&gt;
        HP&lt;br /&gt;
        Matine Silberman HP&lt;br /&gt;
        Sandy Harris&lt;br /&gt;
        NEC Japan&lt;br /&gt;
        John Schultz, AOL&lt;br /&gt;
        Pavel Emelyanov, Parallels/OpenVZ&lt;br /&gt;
        Denis Lunev, Parallels/OpenVZ&lt;br /&gt;
        Andrey Mirkin, Parallels/OpenVZ&lt;br /&gt;
        Constant Chan&lt;br /&gt;
        Benjamin Thery, Bull&lt;br /&gt;
        Daniel Lezcano, IBM&lt;br /&gt;
        Serge Hallyn, IBM&lt;br /&gt;
        Oren Laadan, Columbia University&lt;br /&gt;
&lt;br /&gt;
On Phone:&lt;br /&gt;
        Amy Griffis, HP&lt;br /&gt;
        Dhaval Giani, IBM&lt;br /&gt;
&lt;br /&gt;
(Later walk-ins):&lt;br /&gt;
        Paul Menage, Google&lt;br /&gt;
&lt;br /&gt;
== Namespaces and containers ==&lt;br /&gt;
&lt;br /&gt;
Why do various companies want containers?&lt;br /&gt;
        ibm: workload management&lt;br /&gt;
        EB: using containers as improved chroot&lt;br /&gt;
        HP: wants similar to ibm, plus security&lt;br /&gt;
        parallels: hosted providers&lt;br /&gt;
&lt;br /&gt;
sysfs issues&lt;br /&gt;
        EB gives status: should go into next merge window&lt;br /&gt;
&lt;br /&gt;
mini-namespaces&lt;br /&gt;
        NFS&lt;br /&gt;
                clients should behave differently on diff. containers&lt;br /&gt;
                currently uses single sunrpc transport for all containers&lt;br /&gt;
        Dave: is there a list of all openvz mini-ns?&lt;br /&gt;
        EB:&lt;br /&gt;
                proposal:&lt;br /&gt;
                        create little filesystems&lt;br /&gt;
                        still store everything in nsproxy&lt;br /&gt;
                currently:&lt;br /&gt;
                        some people want same process in different netns's&lt;br /&gt;
                        almost possible now, but can't open new sockets&lt;br /&gt;
                namespace enter:&lt;br /&gt;
                        3 purposes&lt;br /&gt;
                                login&lt;br /&gt;
                                monitoring&lt;br /&gt;
                                configuring&lt;br /&gt;
                may be worth prototyping the proposal&lt;br /&gt;
                        address mqns, or sunrpc, or fuse?&lt;br /&gt;
        DH:&lt;br /&gt;
                openvz addresses this using one big clone(), right?&lt;br /&gt;
                (yes)&lt;br /&gt;
&lt;br /&gt;
userid namespaces&lt;br /&gt;
        EB summarizes his proposal&lt;br /&gt;
                userid ns is unsharable without privilege&lt;br /&gt;
                userids, capabilities, security labels become ns-local&lt;br /&gt;
                hierarchical like pidns&lt;br /&gt;
        openvz: just does chroot&lt;br /&gt;
        DH:&lt;br /&gt;
                observers that system vs. app containers have different requirements&lt;br /&gt;
        EB:&lt;br /&gt;
                so with userid namespaces, user has god-like powers over created namespaces&lt;br /&gt;
        EB+SH will talk about hacking something this week during ols&lt;br /&gt;
        Uses:&lt;br /&gt;
                user unttrusted mounts&lt;br /&gt;
                build systems&lt;br /&gt;
&lt;br /&gt;
device namespaces&lt;br /&gt;
        tty namespaces rejected&lt;br /&gt;
        should be solved with generic device namespaces&lt;br /&gt;
                virtualize the major:minor-&amp;gt;device mapping&lt;br /&gt;
        reserved device numbers (unnamed)&lt;br /&gt;
                created with /proc?&lt;br /&gt;
                get_unnamed_device()&lt;br /&gt;
        tty ideas:&lt;br /&gt;
                use selinux ptys&lt;br /&gt;
                use user namespaces&lt;br /&gt;
                use legacy ptys&lt;br /&gt;
                leverage ptyfs&lt;br /&gt;
        Suka is not on, so he gets volunteered to do pure /dev/pts fs approach&lt;br /&gt;
&lt;br /&gt;
per-container LSMs:&lt;br /&gt;
        SH: thinks LSMs should handle it&lt;br /&gt;
        EB:&lt;br /&gt;
                original purpose of chroot&lt;br /&gt;
                set up policies from inside container&lt;br /&gt;
                creating smack container inside selinux would be ideal&lt;br /&gt;
&lt;br /&gt;
entering a  container&lt;br /&gt;
        netns: identified using pid of a ns&lt;br /&gt;
        sh: can we solve this using EB's namespace filesystems proposal?&lt;br /&gt;
        (EB goes to the board to demonstrate his proposal)&lt;br /&gt;
        PM: Can we use control groups?&lt;br /&gt;
        PE: Can we re-use /proc/pid/ ?&lt;br /&gt;
        EB: could have a ns with no processes in it&lt;br /&gt;
        Example of command using this:&lt;br /&gt;
                ip set eth0 netns &amp;lt;pid&amp;gt;&lt;br /&gt;
                becomes&lt;br /&gt;
                ip set eth0 netns /proc/&amp;lt;pid&amp;gt;/&lt;br /&gt;
        DL:&lt;br /&gt;
                a real netns problem is knowing when a childns has died&lt;br /&gt;
                the netnsfs mount could solve that&lt;br /&gt;
        PE: EB, can you send POC patches for the namespace?&lt;br /&gt;
                EB and EM will both send their own POC.&lt;br /&gt;
&lt;br /&gt;
DL: people have complained about needing CAP_SYS_ADMIN to unshare ns&lt;br /&gt;
        EB: example, setuid root sysvipc-using program could be fooled&lt;br /&gt;
&lt;br /&gt;
PE: Entering a container:&lt;br /&gt;
        reasons:&lt;br /&gt;
                monitoring&lt;br /&gt;
                enter an administrative command&lt;br /&gt;
        DH: how do you do it now?&lt;br /&gt;
        PE: numerical ID for each VE, use it to enter&lt;br /&gt;
        EB:&lt;br /&gt;
                one need for entering: /sbin/hotplug&lt;br /&gt;
        (someone): does hijack suffice?&lt;br /&gt;
        EB: two cases:&lt;br /&gt;
                partial entering&lt;br /&gt;
                full entering&lt;br /&gt;
                sys_hijack does not address partial entering&lt;br /&gt;
        DH:&lt;br /&gt;
                why need partial entering?&lt;br /&gt;
                fs stuff can be done without entering&lt;br /&gt;
        PM: privileged process&lt;br /&gt;
        PE:&lt;br /&gt;
                will look at hijack patches&lt;br /&gt;
                someone will re-send hijack to containers@&lt;br /&gt;
                EB:&lt;br /&gt;
                        if we can do sys_hijack cleanly,&lt;br /&gt;
                        we can use it to solve kthread problem&lt;br /&gt;
&lt;br /&gt;
== Checkpoint/Restart [CR] ==&lt;br /&gt;
&lt;br /&gt;
=== Uses of CR ===&lt;br /&gt;
&lt;br /&gt;
* '''migration and live migration:'''  e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints&lt;br /&gt;
&lt;br /&gt;
* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance&lt;br /&gt;
&lt;br /&gt;
* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)&lt;br /&gt;
&lt;br /&gt;
* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)&lt;br /&gt;
&lt;br /&gt;
* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.&lt;br /&gt;
&lt;br /&gt;
* [EB] '''remote fork:''' e.g. in a cluster&lt;br /&gt;
&lt;br /&gt;
(the last two scenarios are likely to require adjustments during,&lt;br /&gt;
or after, the restart to tolerate changes in the file system or&lt;br /&gt;
otherwise in the environment)&lt;br /&gt;
&lt;br /&gt;
* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application&lt;br /&gt;
across multiple nodes as a whole.&lt;br /&gt;
&lt;br /&gt;
EB reminded that at the last kernel summit nobody complained about the&lt;br /&gt;
wish to add CR capabilities to the kernel. The issue was and remains&lt;br /&gt;
related to technical choices. &lt;br /&gt;
&lt;br /&gt;
=== General design ===&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-space vs user-space'''&lt;br /&gt;
&lt;br /&gt;
OL: the issue of kernel-space vs. user-space is pivotal to design.&lt;br /&gt;
kernel support is mandatory to provide completeness and transparency.&lt;br /&gt;
Even the recent experience with &amp;quot;cryo&amp;quot; demonstrated that users-space&lt;br /&gt;
requires the kernel to expose a very fine-grained API.&lt;br /&gt;
&lt;br /&gt;
Everyone agreed to aim at a monolithic interface, such that nearly&lt;br /&gt;
all of the CR will be done in the kernel. The kernel will return&lt;br /&gt;
(checkpoint) or receive (restart) a blob with the image of the state&lt;br /&gt;
of the container.&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-module ?'''&lt;br /&gt;
&lt;br /&gt;
OL: can we implement mostly in a kernel module and then move CR into&lt;br /&gt;
the kernel later ?&lt;br /&gt;
&lt;br /&gt;
EB: better to add CR functionality gradually directly to the kernel.&lt;br /&gt;
&lt;br /&gt;
* '''Compatibility between kernels'''&lt;br /&gt;
&lt;br /&gt;
DLu: there is an issue with compatibility between kernels - even same&lt;br /&gt;
kernel compiled with different options and/or compiler, and also if&lt;br /&gt;
the kernel ABI changes.&lt;br /&gt;
&lt;br /&gt;
OL: suggest to use an intermediate representation for the checkpoint&lt;br /&gt;
image to avoid the issue as much as possible; conversion, if needed,&lt;br /&gt;
will take place with userland tools. No aim to bridge ABI changes in&lt;br /&gt;
case of migration: instead, fail the restart. &lt;br /&gt;
&lt;br /&gt;
EB: format the blob such that userland tools it will be possible to &lt;br /&gt;
parse it and easily detect a version/configuration mismatch. &lt;br /&gt;
&lt;br /&gt;
* '''Streaming checkpoint image ?'''&lt;br /&gt;
&lt;br /&gt;
DLu: using sequential file (non seek-able) like a socket for the&lt;br /&gt;
checkpoint image is a challenge.&lt;br /&gt;
&lt;br /&gt;
OL: with proper planning it is not complicated to achieve, and it has&lt;br /&gt;
advantage of possible to pass through a filter, e.g. for compression,&lt;br /&gt;
encryption, format conversion etc.&lt;br /&gt;
&lt;br /&gt;
* '''Checkpoint operation'''&lt;br /&gt;
&lt;br /&gt;
The procedure will entail five steps:&lt;br /&gt;
# Pre-dump&lt;br /&gt;
# Freeze the container&lt;br /&gt;
# Dump&lt;br /&gt;
# Thaw/Kill the container&lt;br /&gt;
# Post-dump&lt;br /&gt;
&lt;br /&gt;
&amp;quot;pre-dump&amp;quot; works before freezing the container, e.g. the pre-copy for&lt;br /&gt;
live migration and minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;post-dump&amp;quot; works after the container resumes execution, e.g. in the&lt;br /&gt;
case of a checkpoint (not migration) write-back the data to secondary&lt;br /&gt;
storage, again to minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
OL: we should be able to checkpoint from inside the container, keep &lt;br /&gt;
that in mind for later (also relates to the freezer).&lt;br /&gt;
&lt;br /&gt;
* '''Restart operation'''&lt;br /&gt;
&lt;br /&gt;
Restart is done by first creating a container, then creating the&lt;br /&gt;
process tree in it, and then each process restores its own state. &lt;br /&gt;
This allows to re-use existing kernel code (e.g., restoring a memory&lt;br /&gt;
region is a simple matter of calling mmap() and populating it). &lt;br /&gt;
&lt;br /&gt;
OL: suggest that the process tree be created in userspace. &lt;br /&gt;
&lt;br /&gt;
DLu: prefer to do everything, including process creation, in the &lt;br /&gt;
kernel, his experience shows that it isn't difficult.&lt;br /&gt;
&lt;br /&gt;
* '''Error recovery'''&lt;br /&gt;
&lt;br /&gt;
Should checkpoint fail, the container should continue execution&lt;br /&gt;
without noticing it. If either checkpoint or restart fail, there &lt;br /&gt;
should be a way to inform the caller/user of the reason (something&lt;br /&gt;
more informative than -EBUSY). &lt;br /&gt;
&lt;br /&gt;
=== Road plan ===&lt;br /&gt;
&lt;br /&gt;
A this point we want to create a proof of concept and CR a simple&lt;br /&gt;
application. We will add iteratively more and more kernel resources.&lt;br /&gt;
&lt;br /&gt;
The first items to address:&lt;br /&gt;
# Create a container object (the context on which CR operates)&lt;br /&gt;
# Extend the container freezer cgroup  ?)&lt;br /&gt;
# Interface via syscall or ioctl ?&lt;br /&gt;
&lt;br /&gt;
First step - a simple application:&lt;br /&gt;
a single process, not using any files, no signal pending, no IPC etc.&lt;br /&gt;
Need to save state (registers, IDs), memory maps and contents (except&lt;br /&gt;
for read-only portions, e.g. text).&lt;br /&gt;
Assume that the file system state doesn't change between checkpoint&lt;br /&gt;
and restart.&lt;br /&gt;
&lt;br /&gt;
Next steps:&lt;br /&gt;
# process hierarchy and relationships (multiple tasks and zombies)&lt;br /&gt;
# multiple threads (and shared memory)&lt;br /&gt;
# open files: regular file, fifo, pipe, socket-pair&lt;br /&gt;
# signals, timers&lt;br /&gt;
# TBD&lt;br /&gt;
&lt;br /&gt;
=== Documentation ===&lt;br /&gt;
&lt;br /&gt;
DH: proof of concept requires explicit documentation of what can be&lt;br /&gt;
checkpointed and what cannot be checkpointed, as well as what will&lt;br /&gt;
be the error returned in response to a failure.&lt;/div&gt;</summary>
		<author><name>Paulmenage</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008&amp;diff=6206</id>
		<title>Containers/Mini-summit 2008</title>
		<link rel="alternate" type="text/html" href="https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008&amp;diff=6206"/>
		<updated>2008-07-18T18:02:25Z</updated>

		<summary type="html">&lt;p&gt;Paulmenage: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;There will be a containers mini-summit at the [http://www.linuxsymposium.org/2008/ OLS'08]. This page is for organizing this mini-summit. Feel free to edit.&lt;br /&gt;
&lt;br /&gt;
'''When''': 22nd of July 2008, 8:30-16:30&amp;lt;br/&amp;gt;&lt;br /&gt;
'''Where''': Ottawa, ON, Canada, Novotel Hotel (Albion A).&lt;br /&gt;
&lt;br /&gt;
== Proposal ==&lt;br /&gt;
&lt;br /&gt;
The mini-summit proposal sent to OLS organizers. '''See [[/Proposal|proposal]]'''.&lt;br /&gt;
&lt;br /&gt;
== Topics to discuss ==&lt;br /&gt;
&lt;br /&gt;
* Device accessibility cgroup (maybe with remap ability)&lt;br /&gt;
* TTYs&lt;br /&gt;
* Syslog&lt;br /&gt;
* Checkpoint/restart&lt;br /&gt;
* Memory controllers&lt;br /&gt;
* more?..&lt;br /&gt;
&lt;br /&gt;
== List of attendees ==&lt;br /&gt;
Please fill in your name here if you are going to attend, or email kir at openvz dot org if you are too lazy. Surely the list is not final, so put your name even if you are not sure you can make it.&lt;br /&gt;
&lt;br /&gt;
This list is in no particular order.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- Put this in three columns if browser is smart enough --&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;-moz-column-count:3; -webkit-column-count:3; column-count:3; text-align: left; background: #fefef0; border: 1px solid #ddddc0;&amp;quot;&amp;gt;&lt;br /&gt;
# Pavel Emelyanov&lt;br /&gt;
# Denis Lunev&lt;br /&gt;
# Andrey Mirkin&lt;br /&gt;
# Serge Hallyn&lt;br /&gt;
# Dave Hansen&lt;br /&gt;
# Daniel Lezcano&lt;br /&gt;
# Srivatsa Vaddagiri&lt;br /&gt;
# Balbir Singh&lt;br /&gt;
# Sukadev Bhattiprolu&lt;br /&gt;
# Paul Menage&lt;br /&gt;
# Eric W. Biederman&lt;br /&gt;
# Oren Laadan&lt;br /&gt;
# Yamamoto Takashi&lt;br /&gt;
# Kamezawa Hiroyuki&lt;br /&gt;
# Benjamin Thery&lt;br /&gt;
# Herbert Pötzl&lt;br /&gt;
# Oleg Nesterov&lt;br /&gt;
# Dhaval Giani&lt;br /&gt;
# Bart Trojanowski&lt;br /&gt;
# Joseph Ruscio&lt;br /&gt;
# Constant Chan&lt;br /&gt;
# Linda Knippers&lt;br /&gt;
# Satoshi Uchida&lt;br /&gt;
# Masahiko Takahashi&lt;br /&gt;
# Martine Silbermann&lt;br /&gt;
# Benoit des Ligneris&lt;br /&gt;
# Patrick Naubert&lt;br /&gt;
# Daisuke Nishimura&lt;br /&gt;
# Sudhir Kumar&lt;br /&gt;
# Munehiro Ikeda&lt;br /&gt;
# Kamalesh Babulal&lt;br /&gt;
# John Schulz&lt;br /&gt;
# Poornima Nayak&lt;br /&gt;
# Gyuil Cha&lt;br /&gt;
# YoungHo Kim&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Agenda ==&lt;br /&gt;
&lt;br /&gt;
* Namespaces/Containers  (8:30am-11am)&lt;br /&gt;
** sysfs issues (and any /proc issues)&lt;br /&gt;
*** uevents/hotplug&lt;br /&gt;
** Network namespaces issues&lt;br /&gt;
*** multiple namespaces in one process&lt;br /&gt;
** Device namespace design?&lt;br /&gt;
** User namespace&lt;br /&gt;
** Additional needed namespaces&lt;br /&gt;
*** Small namespaces ''What to do with small subsystem that might need virtualization. E.g. in openvz we have FUSE, binfmt_misc and some other small stuff virtualized. But how to merge it in mainline? Create a separate namespace for each? Mere them into one? How to call this then?''&lt;br /&gt;
** Handling filesystem/namespace synchronization  (not sure what the issue is)&lt;br /&gt;
** Container design&lt;br /&gt;
*** How to enter a container&lt;br /&gt;
*** Nature of a 'container' — kernel object or userspace fiction&lt;br /&gt;
&lt;br /&gt;
* Cgroups+Resource management  (11:30-2pm)&lt;br /&gt;
** Cgroup implementation&lt;br /&gt;
*** Locking (don't let cgroup_lock() become the BKL)&lt;br /&gt;
*** Transactional attachment&lt;br /&gt;
*** &amp;quot;procs&amp;quot; file&lt;br /&gt;
*** User-space notification API&lt;br /&gt;
**** Resource counter hit soft/hard limit&lt;br /&gt;
**** Task entered/left cgroup&lt;br /&gt;
**** OOM occurred&lt;br /&gt;
*** Binary statistics API&lt;br /&gt;
** Existing cgroups&lt;br /&gt;
*** Memory&lt;br /&gt;
**** Supporting over-commit and guarantees&lt;br /&gt;
**** Soft-limits&lt;br /&gt;
**** Hierarchical borrowing - in kernel or userspace?&lt;br /&gt;
**** Per-cgroup refault information?&lt;br /&gt;
*** Kernel memory&lt;br /&gt;
*** Device&lt;br /&gt;
*** Memrlimit&lt;br /&gt;
**** Some push-back over this - can we give real use cases?&lt;br /&gt;
*** CPU scheduler&lt;br /&gt;
** Additional cgroups and their design&lt;br /&gt;
*** Swap (separate subsystem or merge with memory?)&lt;br /&gt;
*** Disk I/O (several proposed designs)&lt;br /&gt;
*** Network traffic classification&lt;br /&gt;
*** Freezer&lt;br /&gt;
*** Signaller&lt;br /&gt;
*** OOM Handler&lt;br /&gt;
** libcg - userspace explotation of control groups/resource management&lt;br /&gt;
*** Overview so far&lt;br /&gt;
*** Is kernel-based reclassification needed?&lt;br /&gt;
*** Real use-cases&lt;br /&gt;
*** Future directions&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Checkpoint/Restart  (2:30pm-5pm)&lt;br /&gt;
** Documentation : Look at &amp;quot;See Also&amp;quot; section below&lt;br /&gt;
** Goals and expectations of this summit&lt;br /&gt;
*** identify, discuss and (if possible) agree on the general design&lt;br /&gt;
*** identify, discuss and (if possible) agree on the technical points&lt;br /&gt;
*** decide on priorities for different components (eg. high, medium, low)  such that the final outcome is a practical road-map that would keep us busy for (at least) until the next OLS (though the &amp;quot;O&amp;quot; may change ;)&lt;br /&gt;
** What are the problems that the linux community can solve with the checkpoint/restart ?&lt;br /&gt;
** Preparing the kernel internals&lt;br /&gt;
*** How we implement it without affecting long term maintainability ?&lt;br /&gt;
*** What are the kernel subsystems, process resources and framework for CR ?&lt;br /&gt;
*** Which pieces to target first ?&lt;br /&gt;
&lt;br /&gt;
The following technical points can be discussed during the mini-summit if we have time or later at the OLS.&lt;br /&gt;
&lt;br /&gt;
** Checkpointing / Restarting&lt;br /&gt;
*** Reaching a quescient point - network, processes, aio, avoiding side effects of quiesce/revive&lt;br /&gt;
*** Checkpoint - signal handler ? syscall ? crfs ? process hierarchy, resource dependencies, system and process resources&lt;br /&gt;
*** Restarting - New binary format handler ? converting between formats (from older kernel to newer)&lt;br /&gt;
*** Notification to processes which explicitly wish to be notified about quiesce, checkpoint and restart - container state ? new signals ?&lt;br /&gt;
** Determining the userspace API - Posix 1003.1m ?&lt;br /&gt;
** Passing the kernel internal state to/from userspace - coredump like file ? swap per container ? netlinks, CR filesystem ? army of different call for the CR (proc, existing syscalls, ...)&lt;br /&gt;
** Hopefully we can continue to discuss in the next days and get a bit of a hackfest going during OLS :)&lt;br /&gt;
&lt;br /&gt;
== Moderators ==&lt;br /&gt;
&lt;br /&gt;
* Namespaces/containers: Serge Hallyn, Dave Hansen&lt;br /&gt;
* Cgroups and resource management: Paul Menage, Balbir Singh, Dhaval Giani&lt;br /&gt;
* Checkpoint/restart: Daniel Lezcano, Oren Laadan&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* http://www.linuxsymposium.org/2008/cfp.php — OLS call for papers&lt;br /&gt;
* https://lists.linux-foundation.org/pipermail/containers/2008-January/009688.html&lt;br /&gt;
* http://openvz.org/pipermail/devel/2008-July/012891.html&lt;br /&gt;
* Checkpoint/Restart&lt;br /&gt;
** Zap : http://www.ncl.cs.columbia.edu/publications/usenix2007_fordist.pdf&lt;br /&gt;
** Metacluster : http://lxc.sourceforge.net/doc/ols2006/lxc-ols2006.pdf&lt;br /&gt;
** OpenVZ : [[Checkpointing and live migration]]&lt;br /&gt;
** Checkpoint/Restart technology : http://en.wikipedia.org/wiki/Application_checkpointing&lt;br /&gt;
** Virtual Servers and Checkpoint/Restart in Mainstream Linux : Sigops document&lt;br /&gt;
** Remote fork: http://www.cse.nd.edu/~dthain/courses/classconf/wowsys2004/talks/rfork.pdf&lt;br /&gt;
** Vmadump : http://bproc.sourceforge.net/c268.html&lt;br /&gt;
** Posix CR : http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Admin/CPR_OG/sgi_html/ch03.html&lt;br /&gt;
** An OS services overview : http://sw-eng.falls-church.va.us/itsg/P08V31.htm&lt;br /&gt;
&lt;br /&gt;
[[Category: Containers]]&lt;br /&gt;
[[Category: Events]]&lt;/div&gt;</summary>
		<author><name>Paulmenage</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008&amp;diff=6205</id>
		<title>Containers/Mini-summit 2008</title>
		<link rel="alternate" type="text/html" href="https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008&amp;diff=6205"/>
		<updated>2008-07-18T17:58:40Z</updated>

		<summary type="html">&lt;p&gt;Paulmenage: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;There will be a containers mini-summit at the [http://www.linuxsymposium.org/2008/ OLS'08]. This page is for organizing this mini-summit. Feel free to edit.&lt;br /&gt;
&lt;br /&gt;
'''When''': 22nd of July 2008, 8:30-16:30&amp;lt;br/&amp;gt;&lt;br /&gt;
'''Where''': Ottawa, ON, Canada, Novotel Hotel (Albion A).&lt;br /&gt;
&lt;br /&gt;
== Proposal ==&lt;br /&gt;
&lt;br /&gt;
The mini-summit proposal sent to OLS organizers. '''See [[/Proposal|proposal]]'''.&lt;br /&gt;
&lt;br /&gt;
== Topics to discuss ==&lt;br /&gt;
&lt;br /&gt;
* Device accessibility cgroup (maybe with remap ability)&lt;br /&gt;
* TTYs&lt;br /&gt;
* Syslog&lt;br /&gt;
* Checkpoint/restart&lt;br /&gt;
* Memory controllers&lt;br /&gt;
* more?..&lt;br /&gt;
&lt;br /&gt;
== List of attendees ==&lt;br /&gt;
Please fill in your name here if you are going to attend, or email kir at openvz dot org if you are too lazy. Surely the list is not final, so put your name even if you are not sure you can make it.&lt;br /&gt;
&lt;br /&gt;
This list is in no particular order.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- Put this in three columns if browser is smart enough --&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;-moz-column-count:3; -webkit-column-count:3; column-count:3; text-align: left; background: #fefef0; border: 1px solid #ddddc0;&amp;quot;&amp;gt;&lt;br /&gt;
# Pavel Emelyanov&lt;br /&gt;
# Denis Lunev&lt;br /&gt;
# Andrey Mirkin&lt;br /&gt;
# Serge Hallyn&lt;br /&gt;
# Dave Hansen&lt;br /&gt;
# Daniel Lezcano&lt;br /&gt;
# Srivatsa Vaddagiri&lt;br /&gt;
# Balbir Singh&lt;br /&gt;
# Sukadev Bhattiprolu&lt;br /&gt;
# Paul Menage&lt;br /&gt;
# Eric W. Biederman&lt;br /&gt;
# Oren Laadan&lt;br /&gt;
# Yamamoto Takashi&lt;br /&gt;
# Kamezawa Hiroyuki&lt;br /&gt;
# Benjamin Thery&lt;br /&gt;
# Herbert Pötzl&lt;br /&gt;
# Oleg Nesterov&lt;br /&gt;
# Dhaval Giani&lt;br /&gt;
# Bart Trojanowski&lt;br /&gt;
# Joseph Ruscio&lt;br /&gt;
# Constant Chan&lt;br /&gt;
# Linda Knippers&lt;br /&gt;
# Satoshi Uchida&lt;br /&gt;
# Masahiko Takahashi&lt;br /&gt;
# Martine Silbermann&lt;br /&gt;
# Benoit des Ligneris&lt;br /&gt;
# Patrick Naubert&lt;br /&gt;
# Daisuke Nishimura&lt;br /&gt;
# Sudhir Kumar&lt;br /&gt;
# Munehiro Ikeda&lt;br /&gt;
# Kamalesh Babulal&lt;br /&gt;
# John Schulz&lt;br /&gt;
# Poornima Nayak&lt;br /&gt;
# Gyuil Cha&lt;br /&gt;
# YoungHo Kim&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Agenda ==&lt;br /&gt;
&lt;br /&gt;
* Namespaces/Containers  (8:30am-11am)&lt;br /&gt;
** sysfs issues (and any /proc issues)&lt;br /&gt;
*** uevents/hotplug&lt;br /&gt;
** Network namespaces issues&lt;br /&gt;
*** multiple namespaces in one process&lt;br /&gt;
** Device namespace design?&lt;br /&gt;
** User namespace&lt;br /&gt;
** Additional needed namespaces&lt;br /&gt;
*** Small namespaces ''What to do with small subsystem that might need virtualization. E.g. in openvz we have FUSE, binfmt_misc and some other small stuff virtualized. But how to merge it in mainline? Create a separate namespace for each? Mere them into one? How to call this then?''&lt;br /&gt;
** Handling filesystem/namespace synchronization  (not sure what the issue is)&lt;br /&gt;
** Container design&lt;br /&gt;
*** How to enter a container&lt;br /&gt;
*** Nature of a 'container' — kernel object or userspace fiction&lt;br /&gt;
&lt;br /&gt;
* Cgroups+Resource management  (11:30-2pm)&lt;br /&gt;
** Cgroup implementation&lt;br /&gt;
*** Locking (don't let cgroup_lock() become the BKL)&lt;br /&gt;
*** Transactional attachment&lt;br /&gt;
*** &amp;quot;procs&amp;quot; file&lt;br /&gt;
*** User-space notification API&lt;br /&gt;
*** Binary statistics API&lt;br /&gt;
** Existing cgroups&lt;br /&gt;
*** Memory&lt;br /&gt;
**** Supporting over-commit and guarantees&lt;br /&gt;
**** Soft-limits&lt;br /&gt;
**** Hierarchical borrowing - in kernel or userspace?&lt;br /&gt;
**** Per-cgroup refault information?&lt;br /&gt;
*** Kernel memory&lt;br /&gt;
*** Device&lt;br /&gt;
*** Memrlimit&lt;br /&gt;
**** Some push-back over this - can we give real use cases?&lt;br /&gt;
*** CPU scheduler&lt;br /&gt;
** Additional cgroups and their design&lt;br /&gt;
*** Swap (separate subsystem or merge with memory?)&lt;br /&gt;
*** Disk I/O (competing designs)&lt;br /&gt;
*** Network traffic classification&lt;br /&gt;
*** Freezer&lt;br /&gt;
*** Signaller&lt;br /&gt;
*** OOM Handler&lt;br /&gt;
** libcg - userspace explotation of control groups/resource management&lt;br /&gt;
*** Overview so far&lt;br /&gt;
*** Real use-cases&lt;br /&gt;
*** Future directions&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Checkpoint/Restart  (2:30pm-5pm)&lt;br /&gt;
** Documentation : Look at &amp;quot;See Also&amp;quot; section below&lt;br /&gt;
** Goals and expectations of this summit&lt;br /&gt;
*** identify, discuss and (if possible) agree on the general design&lt;br /&gt;
*** identify, discuss and (if possible) agree on the technical points&lt;br /&gt;
*** decide on priorities for different components (eg. high, medium, low)  such that the final outcome is a practical road-map that would keep us busy for (at least) until the next OLS (though the &amp;quot;O&amp;quot; may change ;)&lt;br /&gt;
** What are the problems that the linux community can solve with the checkpoint/restart ?&lt;br /&gt;
** Preparing the kernel internals&lt;br /&gt;
*** How we implement it without affecting long term maintainability ?&lt;br /&gt;
*** What are the kernel subsystems, process resources and framework for CR ?&lt;br /&gt;
*** Which pieces to target first ?&lt;br /&gt;
&lt;br /&gt;
The following technical points can be discussed during the mini-summit if we have time or later at the OLS.&lt;br /&gt;
&lt;br /&gt;
** Checkpointing / Restarting&lt;br /&gt;
*** Reaching a quescient point - network, processes, aio, avoiding side effects of quiesce/revive&lt;br /&gt;
*** Checkpoint - signal handler ? syscall ? crfs ? process hierarchy, resource dependencies, system and process resources&lt;br /&gt;
*** Restarting - New binary format handler ? converting between formats (from older kernel to newer)&lt;br /&gt;
*** Notification to processes which explicitly wish to be notified about quiesce, checkpoint and restart - container state ? new signals ?&lt;br /&gt;
** Determining the userspace API - Posix 1003.1m ?&lt;br /&gt;
** Passing the kernel internal state to/from userspace - coredump like file ? swap per container ? netlinks, CR filesystem ? army of different call for the CR (proc, existing syscalls, ...)&lt;br /&gt;
** Hopefully we can continue to discuss in the next days and get a bit of a hackfest going during OLS :)&lt;br /&gt;
&lt;br /&gt;
== Moderators ==&lt;br /&gt;
&lt;br /&gt;
* Namespaces/containers: Serge Hallyn, Dave Hansen&lt;br /&gt;
* Cgroups and resource management: Paul Menage, Balbir Singh, Dhaval Giani&lt;br /&gt;
* Checkpoint/restart: Daniel Lezcano, Oren Laadan&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* http://www.linuxsymposium.org/2008/cfp.php — OLS call for papers&lt;br /&gt;
* https://lists.linux-foundation.org/pipermail/containers/2008-January/009688.html&lt;br /&gt;
* http://openvz.org/pipermail/devel/2008-July/012891.html&lt;br /&gt;
* Checkpoint/Restart&lt;br /&gt;
** Zap : http://www.ncl.cs.columbia.edu/publications/usenix2007_fordist.pdf&lt;br /&gt;
** Metacluster : http://lxc.sourceforge.net/doc/ols2006/lxc-ols2006.pdf&lt;br /&gt;
** OpenVZ : [[Checkpointing and live migration]]&lt;br /&gt;
** Checkpoint/Restart technology : http://en.wikipedia.org/wiki/Application_checkpointing&lt;br /&gt;
** Virtual Servers and Checkpoint/Restart in Mainstream Linux : Sigops document&lt;br /&gt;
** Remote fork: http://www.cse.nd.edu/~dthain/courses/classconf/wowsys2004/talks/rfork.pdf&lt;br /&gt;
** Vmadump : http://bproc.sourceforge.net/c268.html&lt;br /&gt;
** Posix CR : http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Admin/CPR_OG/sgi_html/ch03.html&lt;br /&gt;
** An OS services overview : http://sw-eng.falls-church.va.us/itsg/P08V31.htm&lt;br /&gt;
&lt;br /&gt;
[[Category: Containers]]&lt;br /&gt;
[[Category: Events]]&lt;/div&gt;</summary>
		<author><name>Paulmenage</name></author>
		
	</entry>
</feed>