WP/What are containers
Contents
OpenVZ Linux Containers technology whitepaper
OpenVZ is a virtualization technology for Linux, which lets one to partition a single physical Linux machine into multiple smaller units called containers.
Technically, it consists of three major building blocks:
- Namespaces
- Resource management
- Checkpointing / live migration
Namespaces
A namespace is a feature to limit the scope of something. Here, namespaces are used as containers building blocks. A simple case of a namespace is chroot.
Chroot
Traditional UNIX chroot()
system call is used to change the root of the file system of a calling process to a particular directory. That way it limits the scope of file system for the process, so it can only see and access a limited sub tree of files and directories.
Chroot is still used for application isolation. For example, running ftpd in a chroot to avoid a potential security breach.
Chroot is also used in containers, which have the following consequences:
- there is no need for a separate block device, hard drive partition or filesystem-in-a-file setup
- host system administrator can see all the containers' files
- containers backup/restore is trivial
- mass deployment is easy
Other namespaces
OpenVZ builds on a chroot idea and expands it to everything else that applications have. In other words, every API that kernel provides to applications are «namespaced», making sure every container have its own isolated subset of a resource. Examples include:
- File system namespace — this one is chroot() itself, making sure containers can't see each other's files.
- Process ID namespace, so in every container processes have its own unique process IDs, and the first process inside a container have a PID of 1 (it is usually /sbin/init process which actually relies on its PID to be 1). Containers can only see their own processes, and they can't see (or access in any way, say by sending a signal) processes in other containers.
- IPC namespace, so every container have its own IPC (Inter-Process Communication) shared memory segments, semaphores, and messages.
- Networking namespace, so every container have its own network devices, IP addresses, routing rules, firewall (iptables) rules, network caches and so on.
/proc
and/sys
namespaces, for every container to have their own representation of/proc
and/sys
— special filesystems used to export some kernel information to applications. In a nutshell, those are subsets of what a host system have.
- UTS namespace, so every container can have its own hostname.
Note that memory and CPU need not be namespaced. Existing virtual memory and multitask mechanisms are already taking care of it.
Single kernel approach
To put it simple, a container is a sum of all the namespaces. Therefore, there is only one single OS kernel running, and on top of that there are multiple isolated containers, sharing that single kernel.
Single kernel approach is much more light-weight than traditional VM-style virtualization. The consequences of having only one kernel are:
- Waiving the need to run multiple OS kernels leads to higher density of containers (compared to VMs)
- Software stack that lies in between an application and the hardware is much thinner, this means higher performance of containers (compared to VMs)
- A container can only run Linux.
Resource management
Due to a single kernel model used, there is one single entity which controls all of the resources: the kernel. All the containers share the same set of resources: CPU, memory, disk and network. All these resources needs to be controlled on a per-container basis, for the containers to not step on each other's toes.
All such resources are accounted for and controlled by the kernel.
It is important to understand that resources are not pre-allocated, but just limited. That means:
- all the resources can be changed dynamically (run-time);
- if a resource is not used, it it available.
Let's see what resources are controlled and how.
CPU
Kernel CPU scheduler is modified to be containers-aware. When it is a time for a context switch, scheduler decides which task to give a CPU time slice to. Traditional scheduler just chooses one among all the runnable tasks in the system. OpenVZ scheduler implements two-level schema: it chooses a container first, then it chooses a task inside the container. That way, all the containers get a fair share of CPU resources (with no regard to number of processes inside each container).
The following CPU scheduler settings are available per container:
- CPU units: a proportional "weight" of a container. The more units a container has, the more CPU it will get. Assuming we have 2 containers with equal CPU units, when both containers want CPU time (e.g. by running busy loops), each one will get 50%. In case we will double CPU units of one container, it will have two times more CPU (i.e. 66%, while another will take 33%). Note however that if other containers are idle, a single container can have as much as 100% of available CPU time.
- CPU limit: a hard limit on a share of CPU time. For example, if we set it to 50%, a container will not be able to use more than 50% of CPU time even if CPU will be idle otherwise. By default, this limit is not set, i.e. a single container can have as much as 100% of available CPU time.
- CPU mask: tells the kernel the exact CPUs that can be used to run this container on. This can also be used as a CPU limiting factor, and helps performance on a non-uniform memory (NUMA) systems.
- VCPU affinity: tells the kernel a maximum number of CPUs a container can use. The difference from the previous option is you are not able to specify the exact CPUs, only the number of those.
Disk
- Disk space. In a default setup, all containers reside on the same hard drive partition (since a container is just a subdirectory). OpenVZ introduces a per-container disk space limit to control disk usage. So, to increase the disk space available to a container, one just needs to increase that limit -- dynamically, on the fly, without a need to resize a partition or a filesystem.
- Disk I/O priority. Containers compete for I/O operations, and can affect each other if they use the same disk drive. OpenVZ introduces a per-container I/O priority, which can be used to decrease the "bad guy" I/O rate in order to not trash the other containers.
- Disk I/O bandwidth. I/O bandwidth (in bytes per second) can be limited per-container (currently only available in commercial Parallels Virtuozzo Containers).
Memory
All the containers share the same physical memory and swap space, and other similar resources like a page cache.
FIXME : shared page cache, elastic RAM, virtual swap, RSS reclamation, kernel vs user memory(?), virtual vs physical memory(?), networking buffers(?), moar, moar...
Miscellaneous resources
Also, there are following per-container counters/limits:
- number of processes
- number of opened files
- number of iptables rules
- number of sockets
- etc.
Read more
Resource management is covered in greater details in Resource management whitepaper.
Checkpointing and live migration
Miscellaneous topics
Containers overhead
OpenVZ works almost as fast as a usual Linux system. The only overhead is for networking and additional resource management (see below), and in most cases it is negligible.
OpenVZ host system scope
From the host system, all containers processes are visible.
Networking (routed/bridged)
Does it differ much from VMs?
Limitations
From the point of view of a container owner, it looks and feels like a real system. Nevertheless, it is important to understand what are container limitations:
- Container is constrained by limits set by host system administrator. That includes usage of CPU, memory, disk space and bandwidth, network bandwidth etc.t
- Container only runs Linux (Windows or FreeBSD is not an option), although different distributions is not an issue.
- Container can't boot/use its own kernel (it uses host system kernel).
- Container can't load its own kernel modules (it uses host system kernel modules).
- Container can't set system time, unless explicitly configured to do so (say to run
ntpd
in a CT).
- Container does not have direct access to hardware such as hard drive, network card, or a PCI device. Such access can be granted by host system administrator if needed.