Checkpointing internals/en
The process of checkpoint/restore consists of two phases. The first phase is to save the running state of a process. This usually includes register set, address space, allocated resources, and other process private data. The second phase is to re-construct the original running process from the saved image and resume the execution exactly the point, where it was suspended.
There are several problems with existing checkpoint/restore systems. First, except some written-from-scratch process migration operating systems (such as Sprite), they can not preserve opened network connections. Second, general-purpose operating systems such as Unix were not designed to support process migration, so checkpoint/restore systems built on top of existing OSs usually only support a limited set of applications. Third, all systems do not guarantee processes restoration on other side because of resource conflicts (e.g. there can be a process with such PID). OpenVZ gives a unique chance to solve all those problems and to implement full-fledged universal checkpoint/restore system, its intrinsic capability to isolate and to virtualize groups of processes allows to define a self-consistent state essentially for any configurations of containers using all the kinds of resources, which are available inside a container.
The main features of OpenVZ checkpoint/restore system are:
- No run time overhead besides actual checkpoint/restore
- Network connection migration support
- Virtualization of PIDs
- Image size minimization
Contents
Modular Structure[edit]
Main functionality (checkpoint and restore functions) implemented as two separate kernel modules:
vzcpt - provides checkpoint functionality;
vzrst - provides restore functionality.
Checkpoint and restore are controlled via ioctl()
calls on regular pseudo-files /proc/cpt
and /proc/rst
created in procfs
. Ioctl commands are listed in <linux/cpt_ioctl.h>
.
Checkpoint Module[edit]
Checkpoint module (vzcpt
) provides the following general functionality by ioctl
s:
CPT_SUSPEND
– moving processes to frozen state (container suspending);CPT_DUMP
– collecting and saving all container's data to image file (container dumping);CPT_KILL
– killing a container;CPT_RESUME
– resuming processes from frozen state to running state.
Freezing all the processes before saving container state is necessary because processes inside the container can be connected via IPC, can send signals, share files, virtual memory and another objects. To guarantee self-consistency of saved state all the processes must be suspended and network connections must be stopped.
Restore Module[edit]
Restore module (vzrst
) provides the following general functionality by ioctls:
CPT_UNDUMP
– reconstructing processes and container's private data from image file (container undumping);CPT_KILL
– killing a container;CPT_RESUME
– resuming processes from frozen state to running state.
After reconstructing all necessary kernel structures processes are placed in an uninterruptible state, so that processes cannot run before reconstruction of full container will be completed. Only after the whole container is restored it is possible to resume network connectivity and to wake up the processes. It is necessary to emphasize, it is impossible to reduce latency of migration waking up some processes before all the container is restored.
Virtualization of PIDs[edit]
A process has an identifier (PID
), which is unaltered while process lifecycle. So, it is necessary to restore PID after migration. But it is impossible to do this if there is another process with the same PID. This problem was solved in the following way.
Processes created inside container are assigned with pair of PIDs: one is traditional PID, i.e. a global value which uniquely identifies the process in host OS. Another is virtual PID which is unique only inside container but can be used by several containers. Processes inside container communicate using only their virtual PIDs, so that provided virtual PIDs are preserved while checkpointing/restore the whole container can be transferred to another hardware node not causing PID conflicts. Migrated processes get another global PIDs but this PID is invisible from inside container.
Main drawback of this solution is that it is necessary to maintain mapping of virtual PIDs to global PIDs, which introduces additional overhead for all the syscalls using a PID as an argument or a return value (f.e. kill()
, wait4()
, fork()
). The overhead (~0.3%) is visible in the tests, which essentially do nothing but forking and stopping processes. This overhead appears only for online migrated containers. There are no overhead at all for containers, which never migrated.
Image size minimization[edit]
CPT needs to save all the container data to a file after container was suspended and to transfer this file to the destination node before container can be restored. It means that migration latency is proportional to total size of this image file. Though actual image sizes are surprisingly small for typical tasks.
Size, Mb | Applications |
---|---|
2 | idle apache with 8 preforked children |
20 | screen + bash + cxoffice + winword + small document |
24 | screen + bash + mozilla + Java VM |
0.7 | screen + 1 bash |
3 | screen + 8 bashes in chain |
13 | screen + acroread on 7 Mb pdf file |
2 | running full Zeus-4.2r4 |
0.9 | sshd with one forked child and bash |
3 | mysqld |
1 | postgresql server |
2.5 | CommuniGate Pro 4.0.6 |
25 | phhttps with LinuxThreads. Doc set is /var/www/manual/*.en |
They can be much larger, when the container to be migrated runs processes which use lots of virtual memory.
Limitations[edit]
CPT implements migration of almost all kernel objects, but not all of them. When CPT sees that a process in container makes use of an unimplemented facility it does not allow migration.
Another kind of limitation is when a process uses some facilities, which are not available at target node. For example, a process can auto detect CPU at runtime and start using some instructions specific for this CPU: SSE2, CMOV instructions etc. In this case migration is possible only when destination node also supports those facilities.
Third kind of limitations is caused by applications, which use non-virtual capabilities, which are directly accessible at user level. F.e. a process could use CPU timestamps to calculate some timings, or it could use SMP CPU ID to optimize memory accesses. In this case completely transparent migration would be possible only using virtualization techniques provided by the latest Intel CPUs and it is not even clear, whether unavoidable overhead introduced at this level of virtualization is worth of maintaining such exotic applications.