The architecture of OpenVZ VEs is different from the traditional virtual machines architecture because it always runs the same OS kernel as the host system (while still allowing multiple Linux distributions in individual VEs). This single-kernel implementation technology enables running Virtual Environments with a near-zero overhead. Thus, OpenVZ offer an order of magnitude higher efficiency and manageability than traditional virtualization technologies.
Contents
OS Virtualization
From the point of view of applications and Virtual Environment users, each VE is an independent system. This independency is provided by a virtualization layer in the kernel of the host OS. Note that only a negligible part of the CPU resources is spent on virtualization (around 1-2%). The main features of the virtualization layer implemented in OpenVZ are the following:
- A VE looks and behaves like a regular Linux system. It has standard startup scripts; software from vendors can run inside a VE without OpenVZ-specific modifications or adjustment;
- A user can change any configuration file and install additional software;
- Virtual Environments are completely isolated from each other (file system, processes, Inter Process Communication (IPC), sysctl variables);
- Processes belonging to a VE are scheduled for execution on all available CPUs. Consequently, VEs are not bound to only one CPU and can use all available CPU power.
Network virtualization
The OpenVZ network virtualization layer is designed to isolate VEs from each other and from the physical network:
- Each VE has its own IP address; multiple IP addresses per VE are allowed;
- Network traffic of a VE is isolated from the other VEs. In other words, Virtual Environments are protected from each other in the way that makes traffic snooping impossible;
- Firewalling may be used inside a VE (the user can create rules limiting access to some services using the canonical iptables tool inside the VE). In other words, it is possible to set up firewall rules from inside a VE;
- Routing table manipulations and advanced routing features are supported for individual VEs. For example, setting different maximum transmission units (MTUs) for different destinations, specifying different source addresses for different destinations, and so on.
Resource Management
OpenVZ resource management controls the amount of resources available for Virtual Environments. The controlled resources include such parameters as CPU power, disk space, a set of memory-related parameters, etc. Resource management allows OpenVZ to:
- Effectively share available Hardware Node resources among VEs
- Guarantee Quality-of-Service (QoS)
- Provide performance and resource isolation and protect from denial-of-service attacks
- Collect usage information for system health monitoring
Resource management is much more important for OpenVZ than for a standalone computer since computer resource utilization in a OpenVZ-based system is considerably higher than that in a typical system. As all the VEs are using the same kernel, resource management is of paramount importance. Really, each VE should stay within its boundaries and not affect other VEs in any way — and this is what resource management does.
OpenVZ resource management consists of three components: two-level disk quota, fair CPU scheduler, and user beancounters. Please note that all those resources can be changed during VE runtime, there is no need to reboot. Say, if you want to give your VE less memory, you just change the appropriate parameters on the fly. This is either very hard to do or not possible at all with other virtualization approaches such as VM or hypervisor.
Two-Level Disk Quota
Host system (OpenVZ) owner (root) can set up a per-VE disk quotas, in terms of disk blocks and i-nodes (roughly number of files). This is the first level of disk quota. In addition to that, a VE owner (root) can use usual quota tools inside own VE to set standard UNIX per-user and per-group disk quotas.
If you want to give your VE more disk space, you just increase its disk quota. No need to resize disk partitions etc.
Fair CPU scheduler
CPU scheduler in OpenVZ is a two-level implementation of fair-share scheduling strategy.
On the first level scheduler decides which VE is give the CPU time slice to, based on per-VE cpuunits values. On the second level the standard Linux scheduler decides which process to run in that VE, using standard Linux process priorities and such.
OpenVZ administrator can set up different values of cpuunits for different VEs, and the CPU time will be given to those proportionally.
Also there is a way to limit CPU time, e.g. say that this VE is limited to, say, 10% of CPU time available.
User Beancounters
User Beancounters is a set of per-VE counters, limits, and guarantees. There is a set of about 20 parameters which are carefully chosen to cover all the aspects of VE operation, so no single VE can abuse any resource which is limited for the whole node and thus do harm to another VEs.
Resources accounted and controlled are mainly memory and various in-kernel objects such as IPC shared memory segments, network buffers etc. etc. Each resource can be seen from /proc/user_beancounters and has five values assiciated with it: current usage, maximum usage (for the lifetime of a VE), barrier, limit, and fail counter. The meaning of barrier and limit is parameter-dependant; in short, those can be thought of as a soft limit and a hard limit. If any resource hits the limit, fail counter for it is increased, so VE owner can see if something bad is happening by analyzing the output of /proc/user_beancounters in her VE.
Checkpointing and live migration
A live migration and checkpointing feature was released for OpenVZ in the middle of April 2006. It allows to migrate a VE from one physical server to another without a need to shutdown/restart a VE. The process is known as checkpointing: a VE is freezed and its whole state is saved to the file on disk. This file can then be transferred to another machine and a VE can be unfreezed (restored) there. The delay is about a few seconds, and it is not a downtime, just a delay.
Since every piece of VE state, including opened network connections, is saved, from the user's perspective it looks like a delay in response: say, one database transaction takes a longer time than usual, when it continues as normal and user doesn't notice that his database is already running on the another machine.
That feature makes possible scenarios such as upgrading your server without any need to reboot it: if your database needs more memory or CPU resources, you just buy a newer better server and live migrate your VE to it, then increase its limits. If you want to add more RAM to your server, you migrate all VEs to another one, shut it down, add memory, start it again and migrate all VEs back.