Checkpointing internals
The process of checkpoint/restore consists of two phases. The first phase is to save the running state of a process. This usually includes register set, address space, allocated resources, and other process private data. The second phase is to re-construct the original running process from the saved image and resume the execution exactly the point, where it was suspended.
There are several problems with existing checkpoint/restore systems. First, except some written-from-scratch process migration operating systems (such as Sprite), they can not preserve opened network connections. Second, general-purpose operating systems such as Unix were not designed to support process migration, so checkpoint/restore systems built on top of existing OSes usually only support a limited set of applications. Third, all systems do not guarantee processes restoration on other side because of resource conflicts (e.g. there can be a process with such pid). OpenVZ gives a unique chance to solve all those problems and to implement full-fledged universal checkpoint/restore system, its intrinsic capability to isolate and to virtualize groups of processes allows to define a self-consistent state essentially for any configurations of VEs using all the kinds of resources, which are available inside VE.
The main features of OpenVZ checkpoint/restore system are:
- No run time overhead besides actual checkpoint/restore
- Network connection migration support
- Virtualization of pids
- Image size minimization
Contents
Modular Structure
Main functionality (checkpoint and restore functions) implemented as two separate kernel modules:
vzcpt - provides checkpoint functionality;
vzrst - provides restore functionality.
Checkpoint and restore are controlled via ioctl() calls on regular pseudo-files /proc/cpt and /proc/rst created in procfs. Ioctl commands are listed in <linux/cpt_ioctl.h>.
Checkpoint Module
Checkpoint module (vzcpt) provides the following general functionality by ioctls:
- CPT_SUSPEND– moving processes to frozen state (VE suspending);
- CPT_DUMP– collecting and saving all VPS data to image file (VE dumping);
- CPT_KILL– killing VE;
- CPT_RESUME– resuming processes from frozen state to running state.
Freezing all the processes before saving VE state is necessary because processes inside VE can be connected via IPC, can send signals, share files, virtual memory and another objects. To guarantee self-consistency of saved state all the processes must be suspended and network connections must be stopped.
Restore Module
Restore module (vzrst) provides the following general functionality by ioctls:
- CPT_UNDUMP– reconstructing processes and VE private data from image file (VE undumping);
- CPT_KILL– killing VE;
- CPT_RESUME– resuming processes from frozen state to running state.
After reconstructing all necessary kernel structures processes are placed in an uninterruptible state, so that processes cannot run before reconstruction of full VE will be completed. Only after the whole VE is restored it is possible to resume network connectivity and to wake up the processes. It is necessary to emphasize, it is impossible to reduce latency of migration waking up some processes before all the VE is restored.
Virtualization of pids
A process has an identifier (PID), which is unaltered while process lifecycle. So, it is necessary to restore pid after migration. But it is impossible to do this if there is another process with the same pid. This problem was solved in the following way.
Processes created inside VE are assigned with pair of pids: one is traditional pid, i.e. a global value which uniquely identifies the process in host OS. Another is virtual pid which is unique only inside VE but can be used by several VEs. Processes inside VE communicate using only their virtual pids, so that provided virtual pids are preserved while checkpointing/restore the whole VE can be transferred to another hardware node not causing pid conflicts. Migrated processes get another global pids but this pid is invisible from inside VE.
Main drawback of this solution is that it is necessary to maintain mapping of virtual pids to global pids, which introduces additional overhead for all the syscalls using a pid as an argument or a return value (f.e. kill(), wait4(), fork()). The overhead (~0.3%) is visible in the tests, which essentially do nothing but forking and stopping processes. This overhead appears only for online migrated VEs. There are no overhead at all for VEs, which never migrated.
Image size minimization
CPT needs to save all the VE data to a file after VE was suspended and to transfer this file to the destination node before VE can be restored. It means that migration latency is proportional to total size of this image file. Though actual image sizes are surprisingly small for typical tasks.
20 Mb || screen + bash + cxoffice + winword + small document 24 Mb || screen + bash + mozilla + Java VM 0.7 Mb || screen + 1 bash 3 Mb || screen + 8 bashes in chain 13 Mb || screen + acroread on 7 Mb pdf file 2 Mb || running full Zeus-4.2r4 0.9 Mb || sshd with one forked child and bash 3 Mb || mysqld 1 Mb || postgresql server 2.5 Mb || CommuniGate Pro 4.0.6 25 Mb || phhttps with LinuxThreads. Doc set is /var/www/manual/*.en| Size | Applications | 
|---|---|
| 2 Mb | idle apache with 8 preforked children | 
they can be much larger, when the VE to be migrated runs processes which use lots of virtual memory.
Limitations
CPT implements migration of almost all kernel objects, but not all of them. When CPT sees that a process in VE makes use of an unimplemented facility it does not allow migration.
Another kind of limitation is when a process uses some facilities, which are not available at target node. For example, a process can auto detect CPU at runtime and start using some instructions specific for this CPU: SSE2, CMOV instructions etc. In this case migration is possible only when destination node also supports those facilities.
Third kind of limitations is caused by applications, which use non-virtual capabilities, which are directly accessible at user level. F.e. a process could use CPU timestamps to calculate some timings, or it could use SMP CPU ID to optimize memory accesses. In this case completely transparent migration would be possible only using virtualization techniques provided by the latest Intel CPUs and it is not even clear, whether unavoidable overhead introduced at this level of virtualization is worth of maintaining such exotic applications.