The architecture of OpenVZ is different from the traditional virtual machines architecture because it always runs the same OS kernel as the host system (while still allowing multiple Linux distributions in individual containers). This single-kernel implementation technology enables running containers with a near-zero overhead. Thus, OpenVZ offer an order of magnitude higher efficiency and manageability than traditional virtualization technologies.
From the point of view of applications and container users, each container is an independent system. This independence is provided by a virtualization layer in the kernel of the host OS. Note that only a negligible part of the CPU resources is spent on virtualization (around 1-2%). The main features of the virtualization layer implemented in OpenVZ are the following:
- A container (CT) looks and behaves like a regular Linux system. It has standard startup scripts; software from vendors can run inside a container without OpenVZ-specific modifications or adjustment;
- A user can change any configuration file and install additional software;
- Containers are completely isolated from each other (file system, processes, Inter Process Communication (IPC), sysctl variables);
- Processes belonging to a container are scheduled for execution on all available CPUs. Consequently, CTs are not bound to only one CPU and can use all available CPU power.
The OpenVZ network virtualization layer is designed to isolate CTs from each other and from the physical network:
- Each CT has its own IP address; multiple IP addresses per CT are allowed;
- Network traffic of a CT is isolated from the other CTs. In other words, containers are protected from each other in the way that makes traffic snooping impossible;
- Firewalling may be used inside a CT (the user can create rules limiting access to some services using the canonical iptables tool inside a CT). In other words, it is possible to set up firewall rules from inside a CT;
- Routing table manipulations and advanced routing features are supported for individual containers. For example, setting different maximum transmission units (MTUs) for different destinations, specifying different source addresses for different destinations, and so on.
OpenVZ resource management controls the amount of resources available for containers. The controlled resources include such parameters as CPU power, disk space, a set of memory-related parameters, etc. Resource management allows OpenVZ to:
- Effectively share available host system resources among CTs
- Guarantee Quality-of-Service (QoS)
- Provide performance and resource isolation and protect from denial-of-service attacks
- Collect usage information for system health monitoring
Resource management is much more important for OpenVZ than for a standalone computer since computer resource utilization in a OpenVZ-based system is considerably higher than that in a typical system. As all the CTs are using the same kernel, resource management is of paramount importance. Really, each CT should stay within its boundaries and not affect other CTs in any way — and this is what resource management does.
OpenVZ resource management consists of four main components: two-level disk quota, fair CPU scheduler, disk I/O scheduler, and user beancounters. Please note that all those resources can be changed during CT runtime, there is no need to reboot. Say, if you want to give your CT less memory, you just change the appropriate parameters on the fly. This is either very hard to do or not possible at all with other virtualization approaches such as VM or hypervisor.
Two-Level Disk Quota
Host system administrator (HW root) can set up a per-container disk quotas, in terms of disk blocks and inodes (roughly number of files). This is the first level of disk quota. In addition to that, a container administrator (CT root) can employ usual quota tools inside own CT to set standard UNIX per-user and per-group disk quotas.
If one want to give a CT more disk space, you just increase its disk quota. No need to resize disk partitions etc.
Fair CPU scheduler
CPU scheduler in OpenVZ is a two-level implementation of fair-share scheduling strategy.
On the first level scheduler decides which CT is give the CPU time slice to, based on per-CT cpuunits values. On the second level the standard Linux scheduler decides which process to run in that container, using standard Linux process priorities and such.
OpenVZ administrator can set up different values of
cpuunits for different containers, and the CPU time will be given to those proportionally.
Also there is a way to limit CPU time, e.g. say that this container is limited to, say, 10% of CPU time available.
Similar to the Fair CPU scheduler described above, I/O scheduler in OpenVZ is also two-level, utilizing Jens Axboe's CFQ I/O scheduler on its second level.
Each container is assigned an I/O priority, and the I/O scheduler distributes the available I/O bandwidth according to the priorities assigned. Thus no single container can saturate an I/O channel.
User beancounters is a set of per-CT counters, limits, and guarantees. There is a set of about 20 parameters which are carefully chosen to cover all the aspects of CT operation, so no single container can abuse any resource which is limited for the whole node and thus do harm to another CTs.
Resources accounted and controlled are mainly memory and various in-kernel objects such as IPC shared memory segments, network buffers etc. etc. Each resource can be seen from
/proc/user_beancounters and has five values assiciated with it: current usage, maximum usage (for the lifetime of a container), barrier, limit, and fail counter. The meaning of barrier and limit is parameter-dependant; in short, those can be thought of as a soft limit and a hard limit. If any resource hits the limit, fail counter for it is increased, so CT administrator can see if something bad is happening by analyzing the output of
/proc/user_beancounters in her container.
Checkpointing and live migration
A live migration and checkpointing feature was released for OpenVZ in the middle of April 2006. It allows to migrate a container from one physical server to another without a need to shutdown/restart a container. The process is known as checkpointing: a CT is frozen and its whole state is saved to the file on disk. This file can then be transferred to another machine and a CT can be unfrozen (restored) there. The delay is about a few seconds, and it is not a downtime, just a delay.
Since every piece of the container state, including opened network connections, is saved, from the user's perspective it looks like a delay in response: say, one database transaction takes a longer time than usual, when it continues as normal and user doesn't notice that his database is already running on the another machine.
That feature makes possible scenarios such as upgrading your server without any need to reboot it: if your database needs more memory or CPU resources, you just buy a newer better server and live migrate your container to it, then increase its limits. If you want to add more RAM to your server, you migrate all containers to another one, shut it down, add memory, start it again and migrate all containers back.