Open main menu

OpenVZ Virtuozzo Containers Wiki β

Changes

Checkpointing internals

212 bytes added, 09:16, 9 April 2008
VE, VPS -> container
The process of checkpoint/restore consists of two phases. The first phase is to save the running state of a process. This usually includes register set, address space, allocated resources, and other process private data. The second phase is to re-construct the original running process from the saved image and resume the execution exactly the point, where it was suspended.
There are several problems with existing checkpoint/restore systems. First, except some written-from-scratch process migration operating systems (such as Sprite), they can not preserve opened network connections. Second, general-purpose operating systems such as Unix were not designed to support process migration, so checkpoint/restore systems built on top of existing OSes OSs usually only support a limited set of applications. Third, all systems do not guarantee processes restoration on other side because of resource conflicts (e.g. there can be a process with such pidPID). OpenVZ gives a unique chance to solve all those problems and to implement full-fledged universal checkpoint/restore system, its intrinsic capability to isolate and to virtualize groups of processes allows to define a self-consistent state essentially for any configurations of VEs containers using all the kinds of resources, which are available inside VEa container.
The main features of OpenVZ checkpoint/restore system are:
* No run time overhead besides actual checkpoint/restore * Network connection migration support * Virtualization of pidsPIDs
* Image size minimization
== Checkpoint Module ==
Checkpoint module (<code>vzcpt</code>) provides the following general functionality by ioctls<code>ioctl</code>s:* <code>CPT_SUSPEND</code> – moving processes to frozen state (VE container suspending);* <code>CPT_DUMP</code> – collecting and saving all VPS container's data to image file (VE container dumping);* <code>CPT_KILL</code> – killing VEa container;
* <code>CPT_RESUME</code> – resuming processes from frozen state to running state.
Freezing all the processes before saving VE container state is necessary because processes inside VE the container can be connected via IPC, can send signals, share files, virtual memory and another objects. To guarantee self-consistency of saved state all the processes must be suspended and network connections must be stopped.
== Restore Module ==
Restore module (<code>vzrst</code>) provides the following general functionality by ioctls:
* <code>CPT_UNDUMP</code> – reconstructing processes and VE container's private data from image file (VE container undumping);* <code>CPT_KILL</code> – killing VEa container;
* <code>CPT_RESUME</code> – resuming processes from frozen state to running state.
After reconstructing all necessary kernel structures processes are placed in an uninterruptible state, so that processes cannot run before reconstruction of full VE container will be completed. Only after the whole VE container is restored it is possible to resume network connectivity and to wake up the processes. It is necessary to emphasize, it is impossible to reduce latency of migration waking up some processes before all the VE container is restored.
== Virtualization of pids PIDs ==
A process has an identifier (<code>PID</code>), which is unaltered while process lifecycle. So, it is necessary to restore pid PID after migration. But it is impossible to do this if there is another process with the same pidPID. This problem was solved in the following way.
Processes created inside VE container are assigned with pair of pidsPIDs: one is traditional pidPID, i.e. a global value which uniquely identifies the process in host OS. Another is virtual pid PID which is unique only inside VE container but can be used by several VEscontainers. Processes inside VE container communicate using only their virtual pidsPIDs, so that provided virtual pids PIDs are preserved while checkpointing/restore the whole VE container can be transferred to another hardware node not causing pid PID conflicts. Migrated processes get another global pids PIDs but this pid PID is invisible from inside VEcontainer.
Main drawback of this solution is that it is necessary to maintain mapping of virtual pids PIDs to global pidsPIDs, which introduces additional overhead for all the syscalls using a pid PID as an argument or a return value (f.e. <code>kill()</code>, <code>wait4()</code>, <code>fork()</code>). The overhead (~0.3%) is visible in the tests, which essentially do nothing but forking and stopping processes. This overhead appears only for online migrated VEscontainers. There are no overhead at all for VEscontainers, which never migrated.
== Image size minimization ==
CPT needs to save all the VE container data to a file after VE container was suspended and to transfer this file to the destination node before VE container can be restored. It means that migration latency is proportional to total size of this image file. Though actual image sizes are surprisingly small for typical tasks.
{| class="wikitable"
|}
They can be much larger, when the VE container to be migrated runs processes which use lots of virtual memory.
== Limitations ==
CPT implements migration of almost all kernel objects, but not all of them. When CPT sees that a process in VE container makes use of an unimplemented facility it does not allow migration.
Another kind of limitation is when a process uses some facilities, which are not available at target node. For example, a process can auto detect CPU at runtime and start using some instructions specific for this CPU: SSE2, CMOV instructions etc. In this case migration is possible only when destination node also supports those facilities.