Changes

← Older edit

Checkpointing internals

713 bytes added, 08:35, 26 December 2015

Marked this version for translation

~~= Checkpointing internals =~~<translate>The process of checkpoint/restore consists of two phases. The first phase is to save the running state of a process. This usually includes register set, address space, allocated resources, and other process private data. The second phase is to re-construct the original running process from the saved image and resume the execution exactly the point, where it was suspended.

~~Process~~ There are several problems with existing checkpoint/restore ~~consists of two phases~~systems. First, except some written-from-scratch process migration operating systems (such as Sprite), they can not preserve opened network connections. ~~The first phase is~~ Second, general-purpose operating systems such as Unix were not designed to ~~save the running state~~ support process migration, so checkpoint/restore systems built on top of existing OSs usually only support a ~~process~~limited set of applications. ~~This usually includes register set, address space, allocated resources~~Third, ~~and~~ all systems do not guarantee processes restoration on other side because of resource conflicts (e.g. there can be a process ~~private data~~with such PID). ~~The second phase is~~ OpenVZ gives a unique chance to resolve all those problems and to implement full-~~construct the original running process from the saved image~~ fledged universal checkpoint/restore system, its intrinsic capability to isolate and ~~resume~~ to virtualize groups of processes allows to define a self-consistent state essentially for any configurations of containers using all the ~~execution exactly the point~~kinds of resources, ~~where it was suspended~~which are available inside a container.

~~There are several problems with existing checkpoint/restore systems. First, except some written~~<!-~~from~~-scratch process migration operating systems (such as Sprite), they can not preserve opened network connections. Second, general-purpose operating systems such as Unix were not designed to support process migration, so checkpoint/restore systems built on top of existing OSes usually only support a limited set of applications. Third, all systems do not guarantee processes restoration on other side because of resource conflicts (e.g. there can be a process with such pid). OpenVZ gives a unique chance to solve all those problems and to implement fullT:3-~~fledged universal checkpoint/restore system, its intrinsic capability to isolate and to virtualize groups of processes allows to define a self~~-~~consistent state essentially for any configurations of VEs using all the kinds of resources, which are available inside VE.~~>The ~~primary contributions~~ main features of OpenVZ checkpoint/restore system are:* No run time overhead besides actual checkpoint/restore * Network connection migration support * Virtualization of ~~pids~~PIDs

* Image size minimization

== Modular Structure ==

Main functionality (checkpoint and restore functions) implemented as two separate kernel modules:

'''vzcpt''' - provides checkpoint functionality;

'''vzrst''' - provides restore functionality.

Checkpoint and restore are controlled via <code>ioctl()</code> calls on regular pseudo-files <code>/proc/cpt</code> and <code>/proc/rst</code> created in <code>procfs</code>. Ioctl commands are listed in <code><linux/cpt_ioctl.h></code>.

== Checkpoint Module ==

Checkpoint module (<code>vzcpt</code>) provides the following general functionality by ~~ioctls~~<code>ioctl</code>s:* <code>CPT_SUSPEND</code> – moving processes to frozen state (VE container suspending);* <code>CPT_DUMP</code> – collecting and saving all ~~VPS~~ container's data to image file (VE container dumping);* <code>CPT_KILL</code> – killing VEa container;

* <code>CPT_RESUME</code> – resuming processes from frozen state to running state.

Freezing all the processes before saving VE container state is necessary because processes inside VE the container can be connected via IPC, can send signals, share files, virtual memory and another objects. To guarantee self-consistency of saved state all the processes must be suspended and network connections must be stopped.

== Restore Module ==

Restore module (<code>vzrst</code>) provides the following general functionality by ioctls:

* <code>CPT_UNDUMP</code> – reconstructing processes and VE container's private data from image file (VE container undumping);* <code>CPT_KILL</code> – killing VEa container;

* <code>CPT_RESUME</code> – resuming processes from frozen state to running state.

After reconstructing all necessary kernel structures processes are placed in an uninterruptible state, so that processes cannot run before reconstruction of full VE container will be completed. Only after the whole VE container is restored it is possible to resume network connectivity and to wake up the processes. It is necessary to emphasize, it is impossible to reduce latency of migration waking up some processes before all the VE container is restored. == Virtualization of PIDs ==

~~== Virtualization of pids ==~~A process has an identifier (<code>PID</code>), which is unaltered while process lifecycle. So, it is necessary to restore PID after migration. But it is impossible to do this if there is another process with the same PID. This problem was solved in the following way.

~~A process has an identifier (~~<~~code~~!--T:17-->Processes created inside container are assigned with pair of PIDs: one is traditional PID~~</code>)~~, i.e. a global value which uniquely identifies the process in host OS. Another is virtual PID which is ~~unaltered while process lifecycle~~unique only inside container but can be used by several containers. SoProcesses inside container communicate using only their virtual PIDs, ~~it is necessary~~ so that provided virtual PIDs are preserved while checkpointing/restore the whole container can be transferred to ~~restore pid after migration~~another hardware node not causing PID conflicts. ~~But it is impossible to do~~ Migrated processes get another global PIDs but this ~~if there~~ PID is ~~another process with the same pid. This problem was solved in the following way~~invisible from inside container.

~~Processes created inside VE are assigned with pair~~ Main drawback of ~~pids: one~~ this solution is that it is ~~traditional pid~~necessary to maintain mapping of virtual PIDs to global PIDs, iwhich introduces additional overhead for all the syscalls using a PID as an argument or a return value (f.e. ~~a global value which uniquely identifies the process in host OS~~<code>kill()</code>, <code>wait4()</code>, <code>fork()</code>). The overhead (~0. ~~Another~~ 3%) is ~~virtual pid~~ visible in the tests, which ~~is unique only inside VE~~ essentially do nothing but ~~can be used by several VEs~~forking and stopping processes. ~~Processes inside VE communicate using~~ This overhead appears only ~~their virtual pids~~for online migrated containers. There are no overhead at all for containers, so that provided virtual pids are preserved while checkpointing/restore the whole VE can be transferred to another hardware node not causing pid conflicts. Migrated processes get another global pids but this pid is invisible from inside VEwhich never migrated.

Main drawback of this solution is that it is necessary to maintain mapping of virtual pids to global pids, which introduces additional overhead for all the syscalls using a pid as an argument or a return value (f.e. == Image size minimization == <~~code~~!--T:19-->kill()</code>, <code>wait4()</code>, <code>fork()</code>). The overhead (~0.3%) is visible in the tests, which essentially do nothing but forking and stopping processes. This overhead appears only for online migrated VEs. There are no overhead at all for VEs, which never migrated.

~~== Image~~ CPT needs to save all the container data to a file after container was suspended and to transfer this file to the destination node before container can be restored. It means that migration latency is proportional to total size ~~minimization ==~~of this image file. Though actual image sizes are surprisingly small for typical tasks.

CPT needs to save all the VE data to a file after VE was suspended and to transfer this file to the destination node before VE can be restored. It means that migration latency is proportional to total size of this image file. Though actual image sizes are surprisingly small for typical tasks<br!--T:21-->{| class="wikitable sortable"!Size, Mb !! Applications|-|2 ~~Mb –~~ || idle apache with 8 preforked children~~ ~~|-|20 ~~Mb –~~ || screen + bash + cxoffice + winword + small document~~ ~~|-|24 ~~Mb –~~ || screen + bash + mozilla + Java VM~~ ~~|-|0.7 ~~Mb –~~ || screen + 1 bash~~ ~~|-|3 ~~Mb –~~ || screen + 8 bashes in chain~~ ~~|-|13 ~~Mb –~~ || screen + acroread on 7 Mb pdf file~~ ~~|-|2 ~~Mb –~~ || running full Zeus-4.2r4~~ ~~|-|0.9 ~~Mb –~~ || sshd with one forked child and bash~~ ~~|-|3 ~~Mb –~~ || mysqld~~ ~~|-|1 ~~Mb –~~ || postgresql server~~ ~~|-|2.5 ~~Mb –~~ || CommuniGate Pro 4.0.6~~ ~~|-|25 ~~Mb –~~ || phhttps with LinuxThreads. Doc set is /var/www/manual/*.en~~ ~~|}

~~they~~ They can be much larger, when the VE container to be migrated runs processes which use lots of virtual memory.

== Limitations ==

CPT implements migration of almost all kernel objects, but not all of them. When CPT sees that a process in VE container makes use of an unimplemented facility it does not allow migration.

Another kind of limitation is when a process uses some facilities, which are not available at target node. For example, a process can auto detect CPU at runtime and start using some instructions specific for this CPU: SSE2, CMOV instructions etc. In this case migration is possible only when destination node also supports those facilities.

Third kind of limitations is caused by applications, which use non-virtual capabilities, which are directly accessible at user level. F.e. a process could use CPU timestamps to calculate some timings, or it could use SMP CPU ID to optimize memory accesses. In this case completely transparent migration would be possible only using virtualization techniques provided by the latest Intel CPUs and it is not even clear, whether unavoidable overhead introduced at this level of virtualization is worth of maintaining such exotic applications.

== See also ==

* [[Checkpointing and live migration]]

* [http://criu.org/Main_Page CRIU (Checkpoint and Restore in Userspace)]

</translate>

[[Category:Kernel]]

[[Category:Kernel_internals]]

Sergey Bronnikov

Bureaucrats, Administrators

1,734

edits

OpenVZ Virtuozzo Containers Wiki β

Changes

Checkpointing internals

OpenVZ Virtuozzo Containers Wiki ^β