Changes

Jump to: navigation, search

Checkpointing and live migration

4,875 bytes added, 12:42, 6 September 2006
Checkpointing and live migration
CPT is an extension to OpenVZ kernel which allows to save full state of a running VE and to restore it later on the same or on a different host in a way transparent for running applications and network connections. This technique has several applications, the most important being live (zero-downtime) migration of VEs and taking an instant snapshot of a running VE for later resume, i. e. CheckPoinTing.

Before CPT, it was only possible to migrate a VE through shutdown and subsequent reboot. The procedure not only introduces quite a long downtime of network services, it is not transparent for clients using the VE, making impossible migration, when clients runs some tasks which are not tolerant to shutdowns.

Comparing to this old scheme, CPT allows to migrate a VE in a way, essentially invisible both for users of this VE and for external clients, using network services located inside VE. It still introduces a short delay in service, required for actual checkpoint/restore of the processes, but this delay is indistinguishable from a short interruption of network connectivity.

== Online migration ==

There is special utility vzmigrate in OpenVZ distribution intended to support VE migration. With it help one can perform live (zero-downtime) migration, i.e. while migration VPS hangs for a while and after migration it continues work as though nothing has happened. Online migration can be performed by
<pre>vzmigrate --online <host> VEID</pre>
command. During online migration all VE private data saved to an image file, which is transferred to target host.

== Manual Checkpoint and Restore Functions ==

<code>vzmigrate</code> is not strictly required to perform online migration. <code>vzctl</code> utility, accompanied with some file system backup tools, provides enough of power to do all the tasks.

VE can be checkpointed with command:
<pre>vzctl chkpnt VEID --dumpfile <path></pre>
This command saves all the state of running VE to dump file and stops the VE. If the option <code>--dumpfile</code> is not set, <code>vzctl</code> uses default path <code>/var/tmp/Dump.VEID</code>.

After this it is possible to restore the VE exactly in the same state executing:
<pre>vzctl restore VEID --dumpfile <path></pre>
If dump file and file system is transferred to another HW node, the same command can restore VE there with the same success.

It is critical requirement that file system at the moment of restore must be identical to the file system at the moment of checkpointing. If this requirement is not held, depending on severity of changes process of restoration can be aborted or the processes inside VE can see this as an external corruption of open files. When VE is restored on the same node where it was checkpointed, it is enough not to touch file system accessible by the VE. When VE is transferred to another node it is necessary to synchronize VE file system before restore. <code>vzctl</code> does not provide this functionality and external tools (f.e. <code>rsync</code>) are required.

== Step-by-step Checkpoint and Restore ==

Process of checkpointing can be performed by stages. It consists of three steps.

First step – suspending VE. At this stage CPT moves all the processes to special beforehand known state and stops VE network interfaces. This stage can be done by
<pre>vzctl chkpnt VEID --suspend</pre>
command. Second step – dumping VE. At this phase CPT saves state of processes and global state of VE to image file. All the process private data need to be saved: address space, register set, opened files/pipes/sockets, System V IPC structures, current working directory, signal handlers, timers, terminal settings, user identities (uid, gid, etc), process identities (pid, pgrp, sid, etc), rlimit and other data. This stage can be done by
<pre>vzctl chkpnt VEID --dump --dumpfile <path></pre>
command. Third step – killing or resuming processes. If the migration succeeds VE can be stopped with the command:
<pre>vzctl chkpnt VEID --kill</pre>
If migration failed by some reason or if the goal was taking a snapshot of VE state for later restore, CPT can resume VE with:
<pre>vzctl chkpnt VEID --resume</pre>

Process of restoring consists of two steps. The first step is to restore processes and to leave them in a special frozen state. After this step processes are ready to continue execution, however, in some cases CPT has to do some operations after process is woken up, therefore CPT sets process return point to function in our module. This stage can be done by
<pre>vzctl restore VEID --undump --dumpfile <path></pre>
command. Second step – waking up processes or killing them if restore process failed. After CPT wakes up process, it performs necessary operations in our function and continues execution. This stages can be done by
<pre>vzctl restore VEID --resume
vzctl restore VEID --kill</pre>
commands.
22
edits

Navigation menu