<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.openvz.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=DaveHansen</id>
	<title>OpenVZ Virtuozzo Containers Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.openvz.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=DaveHansen"/>
	<link rel="alternate" type="text/html" href="https://wiki.openvz.org/Special:Contributions/DaveHansen"/>
	<updated>2026-06-10T00:59:08Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.31.1</generator>
	<entry>
		<id>https://wiki.openvz.org/index.php?title=Containers/Zap_Patch&amp;diff=6306</id>
		<title>Containers/Zap Patch</title>
		<link rel="alternate" type="text/html" href="https://wiki.openvz.org/index.php?title=Containers/Zap_Patch&amp;diff=6306"/>
		<updated>2008-08-08T19:23:12Z</updated>

		<summary type="html">&lt;p&gt;DaveHansen: New page:  &amp;gt; +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count) &amp;gt; +{ &amp;gt; +     mm_segment_t oldfs; &amp;gt; +     int ret; &amp;gt; + &amp;gt; +     oldfs = get_fs(); &amp;gt; +     set_fs(KERNEL_DS); &amp;gt; +     ret = cr_uwri...&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
&amp;gt; +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)&lt;br /&gt;
&amp;gt; +{&lt;br /&gt;
&amp;gt; +     mm_segment_t oldfs;&lt;br /&gt;
&amp;gt; +     int ret;&lt;br /&gt;
&amp;gt; +&lt;br /&gt;
&amp;gt; +     oldfs = get_fs();&lt;br /&gt;
&amp;gt; +     set_fs(KERNEL_DS);&lt;br /&gt;
&amp;gt; +     ret = cr_uwrite(ctx, buf, count);&lt;br /&gt;
&amp;gt; +     set_fs(oldfs);&lt;br /&gt;
&amp;gt; +&lt;br /&gt;
&amp;gt; +     return ret;&lt;br /&gt;
&amp;gt; +}&lt;br /&gt;
&lt;br /&gt;
get_fs()/set_fs() always feels a bit ouch, and this way you have&lt;br /&gt;
to use __force to avoid the warnings about __user pointer casts&lt;br /&gt;
in sparse.&lt;br /&gt;
I wonder if you can use splice_read/splice_write to get around&lt;br /&gt;
this problem.&lt;/div&gt;</summary>
		<author><name>DaveHansen</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6251</id>
		<title>Containers/Mini-summit 2008 notes</title>
		<link rel="alternate" type="text/html" href="https://wiki.openvz.org/index.php?title=Containers/Mini-summit_2008_notes&amp;diff=6251"/>
		<updated>2008-07-27T19:33:48Z</updated>

		<summary type="html">&lt;p&gt;DaveHansen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: Containers]]&lt;br /&gt;
&lt;br /&gt;
Intros (8:36am)&lt;br /&gt;
&lt;br /&gt;
        Dave Hansen&lt;br /&gt;
        Eric Biederman&lt;br /&gt;
        Jason Byron, Red Hat&lt;br /&gt;
        Joe Ruscio, Evergrid&lt;br /&gt;
        Joe McDonald&lt;br /&gt;
        HP China&lt;br /&gt;
        Sonny Rao&lt;br /&gt;
        HP&lt;br /&gt;
        HP&lt;br /&gt;
        Matine Silberman HP&lt;br /&gt;
        Sandy Harris&lt;br /&gt;
        NEC Japan&lt;br /&gt;
        John Schultz, AOL&lt;br /&gt;
        Pavel Emelyanov, Parallels/OpenVZ&lt;br /&gt;
        Denis Lunev, Parallels/OpenVZ&lt;br /&gt;
        Andrey Mirkin, Parallels/OpenVZ&lt;br /&gt;
        Constant Chan&lt;br /&gt;
        Benjamin Thery, Bull&lt;br /&gt;
        Daniel Lezcano, IBM&lt;br /&gt;
        Serge Hallyn, IBM&lt;br /&gt;
        Oren Laadan, Columbia University&lt;br /&gt;
&lt;br /&gt;
On Phone:&lt;br /&gt;
        Amy Griffis, HP&lt;br /&gt;
        Dhaval Giani, IBM&lt;br /&gt;
        Peter Zijlstra&lt;br /&gt;
&lt;br /&gt;
(Later walk-ins):&lt;br /&gt;
        Paul Menage, Google&lt;br /&gt;
&lt;br /&gt;
== Namespaces and containers ==&lt;br /&gt;
&lt;br /&gt;
Why do various companies want containers?&lt;br /&gt;
        IBM, Google: workload management&lt;br /&gt;
        EB: using containers as improved chroot&lt;br /&gt;
        HP: wants similar to ibm, plus security&lt;br /&gt;
        parallels: hosted providers&lt;br /&gt;
&lt;br /&gt;
sysfs issues&lt;br /&gt;
        EB gives status: should go into next merge window&lt;br /&gt;
&lt;br /&gt;
mini-namespaces&lt;br /&gt;
        NFS&lt;br /&gt;
                clients should behave differently on diff. containers&lt;br /&gt;
                currently uses single sunrpc transport for all containers&lt;br /&gt;
        Dave: is there a list of all openvz mini-ns?&lt;br /&gt;
        EB:&lt;br /&gt;
                proposal:&lt;br /&gt;
                        create little filesystems&lt;br /&gt;
                        still store everything in nsproxy&lt;br /&gt;
                currently:&lt;br /&gt;
                        some people want same process in different netns's&lt;br /&gt;
                        almost possible now, but can't open new sockets&lt;br /&gt;
                namespace enter:&lt;br /&gt;
                        3 purposes&lt;br /&gt;
                                login&lt;br /&gt;
                                monitoring&lt;br /&gt;
                                configuring&lt;br /&gt;
                may be worth prototyping the proposal&lt;br /&gt;
                        address mqns, or sunrpc, or fuse?&lt;br /&gt;
        DH:&lt;br /&gt;
                openvz addresses this using one big clone(), right?&lt;br /&gt;
                (yes)&lt;br /&gt;
&lt;br /&gt;
userid namespaces&lt;br /&gt;
        EB summarizes his proposal&lt;br /&gt;
                userid ns is unsharable without privilege&lt;br /&gt;
                userids, capabilities, security labels become ns-local&lt;br /&gt;
                hierarchical like pidns&lt;br /&gt;
        openvz: just does chroot&lt;br /&gt;
        DH:&lt;br /&gt;
                observers that system vs. app containers have different requirements&lt;br /&gt;
        EB:&lt;br /&gt;
                so with userid namespaces, user has god-like powers over created namespaces&lt;br /&gt;
        EB+SH will talk about hacking something this week during ols&lt;br /&gt;
        Uses:&lt;br /&gt;
                user unttrusted mounts&lt;br /&gt;
                build systems&lt;br /&gt;
&lt;br /&gt;
device namespaces&lt;br /&gt;
        tty namespaces rejected&lt;br /&gt;
        should be solved with generic device namespaces&lt;br /&gt;
                virtualize the major:minor-&amp;gt;device mapping&lt;br /&gt;
        reserved device numbers (unnamed)&lt;br /&gt;
                created with /proc?&lt;br /&gt;
                get_unnamed_device()&lt;br /&gt;
        tty ideas:&lt;br /&gt;
                use selinux ptys&lt;br /&gt;
                use user namespaces&lt;br /&gt;
                use legacy ptys&lt;br /&gt;
                leverage ptyfs&lt;br /&gt;
        Suka is not on, so he gets volunteered to do pure /dev/pts fs approach&lt;br /&gt;
&lt;br /&gt;
per-container LSMs:&lt;br /&gt;
        SH: thinks LSMs should handle it&lt;br /&gt;
        EB:&lt;br /&gt;
                original purpose of chroot&lt;br /&gt;
                set up policies from inside container&lt;br /&gt;
                creating smack container inside selinux would be ideal&lt;br /&gt;
&lt;br /&gt;
entering a  container&lt;br /&gt;
        netns: identified using pid of a ns&lt;br /&gt;
        sh: can we solve this using EB's namespace filesystems proposal?&lt;br /&gt;
        (EB goes to the board to demonstrate his proposal)&lt;br /&gt;
        PM: Can we use control groups?&lt;br /&gt;
        PE: Can we re-use /proc/pid/ ?&lt;br /&gt;
        EB: could have a ns with no processes in it&lt;br /&gt;
        Example of command using this:&lt;br /&gt;
                ip set eth0 netns &amp;lt;pid&amp;gt;&lt;br /&gt;
                becomes&lt;br /&gt;
                ip set eth0 netns /proc/&amp;lt;pid&amp;gt;/&lt;br /&gt;
        DL:&lt;br /&gt;
                a real netns problem is knowing when a childns has died&lt;br /&gt;
                the netnsfs mount could solve that&lt;br /&gt;
        PE: EB, can you send POC patches for the namespace?&lt;br /&gt;
                EB and EM will both send their own POC.&lt;br /&gt;
&lt;br /&gt;
DL: people have complained about needing CAP_SYS_ADMIN to unshare ns&lt;br /&gt;
        EB: example, setuid root sysvipc-using program could be fooled&lt;br /&gt;
&lt;br /&gt;
PE: Entering a container:&lt;br /&gt;
        reasons:&lt;br /&gt;
                monitoring&lt;br /&gt;
                enter an administrative command&lt;br /&gt;
        DH: how do you do it now?&lt;br /&gt;
        PE: numerical ID for each VE, use it to enter&lt;br /&gt;
        EB:&lt;br /&gt;
                one need for entering: /sbin/hotplug&lt;br /&gt;
        (someone): does hijack suffice?&lt;br /&gt;
        EB: two cases:&lt;br /&gt;
                partial entering&lt;br /&gt;
                full entering&lt;br /&gt;
                sys_hijack does not address partial entering&lt;br /&gt;
        DH:&lt;br /&gt;
                why need partial entering?&lt;br /&gt;
                fs stuff can be done without entering&lt;br /&gt;
        PM: privileged process&lt;br /&gt;
        PE:&lt;br /&gt;
                will look at hijack patches&lt;br /&gt;
                someone will re-send hijack to containers@&lt;br /&gt;
                EB:&lt;br /&gt;
                        if we can do sys_hijack cleanly,&lt;br /&gt;
                        we can use it to solve kthread problem&lt;br /&gt;
&lt;br /&gt;
== Control Groups and Resource Management ==&lt;br /&gt;
&lt;br /&gt;
== Checkpoint/Restart [CR] ==&lt;br /&gt;
&lt;br /&gt;
=== Uses of CR ===&lt;br /&gt;
&lt;br /&gt;
* '''migration and live migration:'''  e.g. for load balancing, maintenance, clusters and SSIs, etc. may or may not assume a shared file system between endpoints&lt;br /&gt;
&lt;br /&gt;
* '''suspend/resume (aka hibernation):''' e.g. for hibernation, gang-scheduling and priority running, OS maintenance&lt;br /&gt;
&lt;br /&gt;
* '''failure recovery / fault tolerance:''' periodic checkpoints, and restart from most recent (unlike the previous scenarios, here the applications continue to execute after the checkpoint, perhaps modify the file system)&lt;br /&gt;
&lt;br /&gt;
* '''time-travel:''' periodic checkpoints and restart from any previous checkpoint (here, too, attention is required to capturing the state of the file system as well)&lt;br /&gt;
&lt;br /&gt;
* [PE] '''fast-launch:''' reduce start-up time of heavy applications by restarting from a preset checkpoint instead of launching from scratch.&lt;br /&gt;
&lt;br /&gt;
* [EB] '''remote fork:''' e.g. in a cluster&lt;br /&gt;
&lt;br /&gt;
(the last two scenarios are likely to require adjustments during,&lt;br /&gt;
or after, the restart to tolerate changes in the file system or&lt;br /&gt;
otherwise in the environment)&lt;br /&gt;
&lt;br /&gt;
* [EB,OL] '''distributed checkpoint:''' the ability to checkpoint and restart a distributed application&lt;br /&gt;
across multiple nodes as a whole.&lt;br /&gt;
&lt;br /&gt;
EB reminded that at the last kernel summit nobody complained about the&lt;br /&gt;
wish to add CR capabilities to the kernel. The issue was and remains&lt;br /&gt;
related to technical choices. &lt;br /&gt;
&lt;br /&gt;
=== General design ===&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-space vs user-space'''&lt;br /&gt;
&lt;br /&gt;
OL: the issue of kernel-space vs. user-space is pivotal to design.&lt;br /&gt;
kernel support is mandatory to provide completeness and transparency.&lt;br /&gt;
Even the recent experience with &amp;quot;cryo&amp;quot; demonstrated that users-space&lt;br /&gt;
requires the kernel to expose a very fine-grained API.&lt;br /&gt;
&lt;br /&gt;
Everyone (except DaveHansen) agreed to aim at a monolithic interface,&lt;br /&gt;
such that nearly all of the CR will be done in the kernel. The kernel&lt;br /&gt;
will return (checkpoint) or receive (restart) a blob with the image&lt;br /&gt;
of the state of the container.&lt;br /&gt;
&lt;br /&gt;
* '''Kernel-module ?'''&lt;br /&gt;
&lt;br /&gt;
OL: can we implement mostly in a kernel module and then move CR into&lt;br /&gt;
the kernel later ?&lt;br /&gt;
&lt;br /&gt;
EB: better to add CR functionality gradually directly to the kernel.&lt;br /&gt;
&lt;br /&gt;
* '''Compatibility between kernels'''&lt;br /&gt;
&lt;br /&gt;
DLu: there is an issue with compatibility between kernels - even same&lt;br /&gt;
kernel compiled with different options and/or compiler, and also if&lt;br /&gt;
the kernel ABI changes.&lt;br /&gt;
&lt;br /&gt;
OL: suggest to use an intermediate representation for the checkpoint&lt;br /&gt;
image to avoid the issue as much as possible; conversion, if needed,&lt;br /&gt;
will take place with userland tools. No aim to bridge ABI changes in&lt;br /&gt;
case of migration: instead, fail the restart. &lt;br /&gt;
&lt;br /&gt;
EB: format the blob such that userland tools it will be possible to &lt;br /&gt;
parse it and easily detect a version/configuration mismatch. &lt;br /&gt;
&lt;br /&gt;
* '''Streaming checkpoint image ?'''&lt;br /&gt;
&lt;br /&gt;
DLu: using sequential file (non seek-able) like a socket for the&lt;br /&gt;
checkpoint image is a challenge.&lt;br /&gt;
&lt;br /&gt;
OL: with proper planning it is not complicated to achieve, and it has&lt;br /&gt;
advantage of possible to pass through a filter, e.g. for compression,&lt;br /&gt;
encryption, format conversion etc.&lt;br /&gt;
&lt;br /&gt;
* '''Checkpoint operation'''&lt;br /&gt;
&lt;br /&gt;
The procedure will entail five steps:&lt;br /&gt;
# Pre-dump&lt;br /&gt;
# Freeze the container&lt;br /&gt;
# Dump&lt;br /&gt;
# Thaw/Kill the container&lt;br /&gt;
# Post-dump&lt;br /&gt;
&lt;br /&gt;
&amp;quot;pre-dump&amp;quot; works before freezing the container, e.g. the pre-copy for&lt;br /&gt;
live migration and minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;post-dump&amp;quot; works after the container resumes execution, e.g. in the&lt;br /&gt;
case of a checkpoint (not migration) write-back the data to secondary&lt;br /&gt;
storage, again to minimize application downtime.&lt;br /&gt;
&lt;br /&gt;
OL: we should be able to checkpoint from inside the container, keep &lt;br /&gt;
that in mind for later (also relates to the freezer).&lt;br /&gt;
&lt;br /&gt;
* '''Restart operation'''&lt;br /&gt;
&lt;br /&gt;
Restart is done by first creating a container, then creating the&lt;br /&gt;
process tree in it, and then each process restores its own state. &lt;br /&gt;
This allows to re-use existing kernel code (e.g., restoring a memory&lt;br /&gt;
region is a simple matter of calling mmap() and populating it). &lt;br /&gt;
&lt;br /&gt;
OL: suggest that the process tree be created in userspace. &lt;br /&gt;
&lt;br /&gt;
DLu: prefer to do everything, including process creation, in the &lt;br /&gt;
kernel, his experience shows that it isn't difficult.&lt;br /&gt;
&lt;br /&gt;
* '''Error recovery'''&lt;br /&gt;
&lt;br /&gt;
Should checkpoint fail, the container should continue execution&lt;br /&gt;
without noticing it. If either checkpoint or restart fail, there &lt;br /&gt;
should be a way to inform the caller/user of the reason (something&lt;br /&gt;
more informative than -EBUSY). &lt;br /&gt;
&lt;br /&gt;
=== Road plan ===&lt;br /&gt;
&lt;br /&gt;
A this point we want to create a proof of concept and CR a simple&lt;br /&gt;
application. We will add iteratively more and more kernel resources.&lt;br /&gt;
&lt;br /&gt;
The first items to address:&lt;br /&gt;
# Create a container object (the context on which CR operates)&lt;br /&gt;
# Extend the container freezer cgroup  ?)&lt;br /&gt;
# Interface via syscall or ioctl ?&lt;br /&gt;
&lt;br /&gt;
First step - a simple application:&lt;br /&gt;
a single process, not using any files, no signal pending, no IPC etc.&lt;br /&gt;
Need to save state (registers, IDs), memory maps and contents (except&lt;br /&gt;
for read-only portions, e.g. text).&lt;br /&gt;
Assume that the file system state doesn't change between checkpoint&lt;br /&gt;
and restart.&lt;br /&gt;
&lt;br /&gt;
Next steps:&lt;br /&gt;
# process hierarchy and relationships (multiple tasks and zombies)&lt;br /&gt;
# multiple threads (and shared memory)&lt;br /&gt;
# open files: regular file, fifo, pipe, socket-pair&lt;br /&gt;
# signals, timers&lt;br /&gt;
# TBD&lt;br /&gt;
&lt;br /&gt;
=== Documentation ===&lt;br /&gt;
&lt;br /&gt;
DH: proof of concept requires explicit documentation of what can be&lt;br /&gt;
checkpointed and what cannot be checkpointed, as well as what will&lt;br /&gt;
be the error returned in response to a failure.&lt;/div&gt;</summary>
		<author><name>DaveHansen</name></author>
		
	</entry>
</feed>