WP/What are containers

From OpenVZ Virtuozzo Containers Wiki
< WP
Revision as of 14:48, 24 March 2011 by Kir (talk | contribs) (Namespaces: make image bigger)
Jump to: navigation, search

OpenVZ Linux Containers technology whitepaper

Download in PDF

OpenVZ is an open source virtualization technology for Linux that enables the partitioning of a single physical Linux machine into multiple smaller independent units called containers.

Technically, OpenVZ consists of three major building blocks:

  • Namespaces
  • Resource management
  • Checkpointing / live migration

Namespaces

A namespace is an abstract environment created to hold a logical grouping of unique identifiers or symbols (i.e., names). An identifier defined in a namespace is associated with that namespace. The same identifier can be independently defined in multiple namespaces. An example of namespace is a directory on a file system. This allows two files with the same name to be stored on the same device as long as they are stored in different directories.

For OpenVZ, Linux kernel namespaces are used as containers building blocks. A simple case of a namespace is chroot.

Chroot.png

Chroot

Traditional UNIX chroot() system call is used to change the root of the file system of a calling process to a particular directory. That way it limits the scope of file system for the process, so it can only see and access a limited sub tree of files and directories.

Chroot is still used for application isolation (although, unlike container, it does not provide full isolation).

Chroot is also used by containers, so a container filesystem is just a directory on the host. Consequences are:

  • there is no need for a separate block device, hard drive partition or filesystem-in-a-file setup.
  • host system administrator can see all the containers' files
  • containers backup/restore is trivial
  • mass deployment is easy

Other namespaces

OpenVZ builds on a chroot idea and expands it to everything else that applications have. In other words, every API that kernel provides to applications are “namespaced”, making sure every container have its own isolated subset of a resource. Examples include:

  • File system namespace: this one is chroot() itself, making sure containers can not see each other's files.
  • Process ID namespace: this is so that every container processes has its own unique process IDs, and the first process inside a container has a PID of 1 (it is usually /sbin/init process which actually relies on its PID to be 1). Containers can only see their own processes, and they can't see (or access in any way) processes in other containers.
  • IPC namespace: this is so that every container has its own Inter-Process Communication (IPC) shared memory segments, semaphores, and messages.
  • Networking namespace: this is so that every container has its own network devices, IP addresses, routing rules, firewall (iptables) rules, network caches and so on.
  • /proc and /sys namespaces: this is so that every container to have its own representation of /proc and /sys — special filesystems used to export some kernel information to applications. In a nutshell, those are subsets of what a physical Linux host system have.
  • UTS namespace: this is so that every container can have its own hostname.

Note that memory and CPU need not be namespaced: existing virtual memory and multitask mechanisms address this.

CTs.svg

Single kernel approach

To put it simply, a container is a sum of all its namespaces. Therefore, there is only one single OS kernel running, on top of it there are multiple isolated containers, sharing that single kernel.

Single kernel approach is much more light-weight than traditional VM-style virtualization. The consequences of having only one kernel are:

  1. Waiving the need to run multiple OS kernels leads to higher density of containers (compared to VMs).
  2. Software stack that lies in between the hardware and an end-user application is as thin as in usual non-virtualized system (see the image), this means higher performance of containers (compared to VMs)
  3. A container can only run the same OS as the host, i.e. Linux in case of OpenVZ. Nevertheless, multiple different Linux distributions can be used in different containers.

Resource management

Due to a single kernel model used, there is one single entity which controls all of the resources: the kernel. All the containers share the same set of resources: CPU, memory, disk and network. All these resources needs to be controlled on a per-container basis, for the containers to not step on each other's toes.

All such resources are accounted for and controlled by the kernel.

It is important to understand that resources are not pre-allocated, they are just limited. That means:

  • all the resources can be changed dynamically (run-time);
  • if a resource is not used, it it available.

CPU

Kernel CPU scheduler is modified to be containers-aware. When it's time for a context switch, scheduler selects a process to give a CPU time slice to. A traditional scheduler just chooses one among all the runnable tasks in the system. OpenVZ scheduler implements two-level schema: it chooses a container first, then it chooses a task inside the container. That way, all the containers get a fair share of CPU resources (with no regard to number of processes inside each container).

The following CPU scheduler settings are available per container:

  • CPU units: a proportional "weight" of a container. The more units a container has, the more CPU it will get. Assuming we have 2 containers with equal CPU units, when both containers want CPU time (e.g. by running busy loops), each one will get 50%. In case we will double CPU units of one container, it will have two times more CPU (i.e. 66%, while another will take 33%). Note however that if other containers are idle, a single container can have as much as 100% of available CPU time.
  • CPU limit: a hard limit on a share of CPU time. For example, if we set it to 50%, a container will not be able to use more than 50% of CPU time even if CPU will be idle otherwise. By default, this limit is not set, i.e. a single container can have as much as 100% of available CPU time.
  • CPU mask: tells the kernel the exact CPUs that can be used to run this container on. This can also be used as a CPU limiting factor, and helps performance on a non-uniform memory (NUMA) systems.
  • VCPU affinity: tells the kernel a maximum number of CPUs a container can use. The difference from the previous option is you are not able to specify the exact CPUs, only the number of those.

Disk

  • Disk space. In a default setup, all containers reside on the same hard drive partition (since a container is just a subdirectory). OpenVZ introduces a per-container disk space limit to control disk usage. So, to increase the disk space available to a container, one just needs to increase that limit -- dynamically, on the fly, without a need to resize a partition or a filesystem.
  • Disk I/O priority. Containers compete for I/O operations, and can affect each other if they use the same disk drive. OpenVZ introduces a per-container I/O priority, which can be used to decrease the "bad guy" I/O rate in order to not trash the other containers.
  • Disk I/O bandwidth. I/O bandwidth (in bytes per second) can be limited per-container (currently only available in commercial Parallels Virtuozzo Containers).

Memory

All the containers share the same physical memory and swap space, and other similar resources like a page cache. All that memory is managed by a single kernel, thus making memory distribution model very elastic — if memory is not used by one container, it can be used by another.

Two major memory resource control parameters that are controlled per container are RAM and swap. If container is off its limit in terms of RAM, kernel tries to free some, by either shrinking the page cache or by swapping out. This reclamation mechanism is the same as used by a non-containerized kernel, the only difference is swap out is "virtual", in a sense that kernel does not write physical pages to the disk, but just removes those from container context (in order to avoid unnecessary I/O), while slowing down a container (to emulate the effect of real swap out). Next, if a situation of global (not per-container) memory shortage happens, such pages are really swapped out into a swap file on disk.

The above memory control mechanism is efficient, easy to use and comprehend by an administrator, and overall very effective.

In addition, there is an ability to fine-grain control some of the memory-related resources, such as size of IPC shared memory mappings, network buffers, number of processes etc, overall about 20 parameters called User Beancounters.

Miscellaneous resources

Also, there are following per-container counters/limits:

  • number of processes
  • number of opened files
  • number of iptables rules
  • number of sockets
  • etc.

Read more

Resource management is covered in greater details in Resource management whitepaper.

Checkpointing and live migration

Live migration is an ability to move a running container from one physical server to another without a shutdown or service interruption. Live migration is based on checkpointing, it does not require any special hardware, disk or networking setup.

Checkpointing is an OpenVZ kernel feature that makes it able to freeze a running container (i.e. pause all its processes) and dump its complete in-kernel state into a file on disk. Such a dump file contains everything about processes inside a container: their memory, opened files, network connections, states etc. It is possible to restore a container from a dump file and resume its execution, with. From inside the container it looks like a mere jump forward in time, there are no other side effects. Container can also be restored on a different system, and if its file system is the same it will continue to run as is on a new system.

Checkpointing is currently supported for x86, x86_64 (previously it was also supported on IA64).

FIXME : what else?

Miscellaneous topics

CT-networking.png

Networking

Each container have their own network stack. This includes network device(s), routing table, firewall rules (iptables), network caches, hash tables, etc.

Three major modes of operation are possible.

Route-based (venet)

This mode works in Layer 3 (network layer) of w:OSI model. That means that a container have a MAC-less network device (called venet), with the host system acting as a router. Each IP packet is traversing both host and container's IP stack.

The major features of this setup are:

  • High security. It's the host system administrator who specifies container IP(s) and routing rule(s).
  • High control. Host system administrator fully controls container networking, by means of routing, firewall, traffic shaper etc.
  • NOARP. A container can not use broadcasts or multicasts (since these features are on Level 2 and require a MAC address).

Bridge-based (veth)

This mode works in OSI Layer 2. For container, a Virtual Ethernet (veth) device is used. This device can be thought of as a pipe with two ends -- one end in the host system and another end in a CT, so if a packet goes to one end it will come out from the other end. The host system acts as a bridge, so veth is usually bridged together with eth0 or similar interface.

The major features of this setup are:

  • High configurability: container administrator can setup all the networking.
  • Ability to use broadcasts/multicasts

Real network device in a container

Host system administrator can move a network device (such as eth1) into a container. Container administrator can then manage it as usual.

Major features are:

  • Best performance
  • Low security
  • Container is tied to hardware

Limitations

From the point of view of a container owner, it looks and feels like a real system. Nevertheless, it is important to understand what are container limitations:

  • Container is constrained by limits set by host system administrator. That includes usage of CPU, memory, disk space and bandwidth, network bandwidth etc.t
  • OpenVZ container only runs Linux (Windows or FreeBSD is not an option), although different distributions is not an issue.
  • Container can't boot/use its own kernel (it uses host system kernel).
  • Container can't load its own kernel modules (it uses host system kernel modules).
  • Container can't set system time, unless explicitly configured to do so (say to run ntpd in a CT).
  • Container does not have direct access to hardware such as hard drive, network card, or a PCI device. Such access can be granted by host system administrator if needed.

OpenVZ host system scope

From the host system, all containers processes are visible.