Changes

← Older edit

WP/What are containers

7,261 bytes added, 20:48, 8 May 2015

io bandwidth limit is avail in openvz

= OpenVZ is an open source virtualization technology for Linux ~~Containers technology whitepaper =~~ that enables the partitioning of a single physical Linux machine into multiple smaller independent units called containers.

~~OpenVZ is a virtualization technology for Linux, which lets one to partition a single physical Linux machine into multiple smaller units called containers.~~ Technically, it OpenVZ consists of three major building blocks:

* Namespaces

* Resource management

* Checkpointing/ live migration

== Namespaces ==

A namespace is an abstract environment created to hold a logical grouping of unique identifiers or symbols (i.e., names). An identifier defined in a namespace is associated with that namespace. The same identifier can be independently defined in multiple namespaces. An example of namespace is a ~~feature~~ directory on a file system. This allows two files with the same name to ~~limit~~ be stored on the ~~scope of something~~same device as long as they are stored in different directories. ~~Here~~ For OpenVZ, Linux kernel namespaces are used as containers building blocks. ~~A simple case~~ As an example of namespace usage, let's take a ~~namespace is~~ look at chroot.

[[Image:Chroot.png|right|~~200px~~400px]]

=== Chroot ===

Traditional UNIX <code>chroot()</code> system call is used to change the root of the file system of a calling process to a particular directory. That way it limits the scope of file system for the process, so it can only see and access a limited sub tree of files and directories.

Chroot is still used for application isolation~~. For example~~(although, ~~running ftpd in a~~ unlike container, it does not provide full isolation — an application can escape from chroot ~~to avoid a potential security breach~~under certain circumstances).

Chroot is also used in by containers, ~~which have~~ so a container filesystem is just a directory on the ~~following consequences~~host. Consequences are:

* there is no need for a separate block device, hard drive partition or filesystem-in-a-file setup.

* host system administrator can see all the containers' files

* containers backup/restore is trivial

=== Other namespaces ===

OpenVZ builds on a chroot idea and expands it to everything else that applications have. In other words, every API that kernel provides to applications ~~are «namespaced»~~is “namespaced”, making sure every container have its own isolated subset of a resource. Examples include: * '''File system namespace''': this one is <code>chroot()</code> itself, making sure containers can not see each other's files.

* '''~~File system~~ Process ID namespace''' — : this ~~one~~ is ~~chroot~~so that every container processes has its own unique process IDs, and the first process inside a container has a PID of 1 (it is usually <code>/sbin/init</code> process which actually relies on its PID to be 1) ~~itself~~. For every process in container, its PID in container is different from the one at host. Containers can only see their own processes, ~~making sure containers~~ and they can't see ~~each~~ (or access in any way, like sending signals) processes in other~~'s files~~containers.

* '''~~Process ID~~ IPC namespace''', : this is so in that every container ~~processes have~~ has its own ~~unique process IDs, and the first process inside a container have a PID of 1~~ System V IPC (~~it is usually /sbin/init process which actually relies on its PID to be 1~~Inter-Process Communication)~~. Containers can only see their own processes~~shared memory segments, semaphores, and ~~they can't see (or access in any way~~messages. For example, ~~say by sending a signal) processes~~ <code>ipcs</code> output is different in ~~other containers~~every container.

* '''~~IPC~~ Networking namespace''', : this is so that every container ~~have~~ has its own ~~IPC~~ network devices, IP addresses, routing rules, firewall (~~Inter-Process Communication~~iptables) ~~shared memory segments, semaphores~~rules, network caches and ~~messages~~so on. See more details below at [[#Networking]].

* '''~~Networking namespace~~<code>/proc</code> and <code>/sys</code> namespaces''', : this is so that every container to have its own ~~network devices~~representation of <code>/proc</code> and <code>/sys</code> — special filesystems used to export some kernel information to applications. In a nutshell, ~~IP addresses, routing rules, firewall (iptables) rules, network caches and so on~~those are subsets of what a physical Linux host system have.

* '''~~<code>/proc</code> and <code>/sys</code> namespaces~~UTS namespace'''~~, for~~ : this is so that every container to can have ~~their~~ its own ~~representation of <code>/proc</code> and <code>/sys</code> — special filesystems used to export some kernel information to applications. In a nutshell, those are subsets of what a host system have~~hostname.

* ~~'''UTS namespace''',~~ and so ~~every container can have its own hostname~~on.

Note that memory and CPU need not be namespaced~~. Existing~~ : existing virtual memory and multitask mechanisms ~~are already taking care of it~~address this.

[[Image:CTs.svg|400px|right]]

=== Single kernel approach ===

To put it ~~simple~~simply, a container is a sum of all ~~the~~ its namespaces. Therefore, there is only one single OS kernel running, ~~and~~ on top of ~~that~~ it there are multiple isolated containers, sharing that single kernel. Single kernel approach is much more light-weight than traditional VM-style virtualization (for more differences between CT and VM, see [[../Containers vs VMs/]]). The consequences of having only one kernel are:

~~Single kernel approach~~ # A container can only run the same OS as the host, i.e. Linux in case of OpenVZ. Nevertheless, multiple different Linux distributions can be used in different containers. For example, RHEL4, RHEL5, RHEL6, Fedora 14 and Ubuntu 10.10 can run inside different containers on the same host system (running e.g. Gentoo).# Waiving the need to run multiple OS kernels leads to '''higher density''' of containers (compared to VMs). Practically that means that a few hundreds of typical containers can be started on a conventional notebook.# Software stack that lies in between the hardware and an end-user application is ~~much more light~~as thin as in usual non-~~weight than traditional VM-style~~ virtualized system (see the image), this means native performance of containers and no virtualizationoverhead (compared to VMs). ~~The consequences of having only one kernel are:~~

See more at [[# ~~Waiving the need to run multiple OS kernels leads to '''higher~~ Performance and density~~''' of containers (compared to VMs)# Software stack that lies in between an application and the hardware is much thinner, this means higher performance of containers (compared to VMs)# A container can only run Linux~~]] below.

== Resource management ==

All such resources are accounted for and controlled by the kernel.

It is important to understand that resources are not pre-allocated, ~~but~~ they are just limited. That means:

* all the resources can be changed dynamically (run-time);

* if a resource is not used, it it is available. ~~Let's see what resources are controlled and how~~for other containers, which makes resource overcommitting easy.

=== CPU ===

Kernel CPU scheduler is modified to be containers-aware. When it ~~is a~~ 's time for a context switch, scheduler ~~decides which task~~ selects a process to give a CPU time slice to. ~~Traditional~~ A traditional scheduler just chooses one among all the runnable tasks in the system. OpenVZ scheduler implements two-level schema: it chooses a container first, then it chooses a task inside the container. That way, all the containers get a fair share of CPU resources (with no regard to number of processes inside each container).

The following CPU scheduler settings are available per container:

* '''CPU mask''': tells the kernel the exact CPUs that can be used to run this container on. This can also be used as a CPU limiting factor, and helps performance on a non-uniform memory (NUMA) systems.

* '''VCPU affinity''': tells the kernel a maximum number of CPUs a container can use. The difference from the previous option is you are not able to specify the exact CPUs, only the number of those, and then the kernel dynamically assigns / adjusts CPUs between containers based on current load.

=== Disk ===

~~====~~ * '''Disk space ~~====~~'''. In a default setup, all containers reside on the same hard drive partition (since a container is just a subdirectory). OpenVZ introduces a per-container disk space limit to control disk usage. So, to increase the disk space available to a container, one just needs to increase that limit -- dynamically, on the fly, without a need to resize a partition or a filesystem.

~~In a default setup~~* '''Disk I/O priority'''. Containers compete for I/O operations, ~~all containers reside on~~ and can affect each other if they use the same ~~hard~~ disk drive ~~partition (since a container is just a subdirectory)~~.OpenVZ introduces a per-container ~~disk space limit~~ I/O priority, which can be used to ~~control disk usage~~e.g. ~~So, to increase~~ decrease the ~~disk space available~~"bad guy" I/O rate in order to ~~a container, one just needs to increase that limit -- dynamically, on~~ not trash the ~~fly, without a need to resize a partitionor a filesystem~~other containers.

~~====~~ * '''Disk I/O ~~priority ====~~bandwidth'''. I/O bandwidth (in bytes per second) can be limited per-container.

~~Containers compete for I/O operations, and can affect each other if they use the same disk drive. OpenVZintroduces a per-container I/O priority, which can be used to decrease the "bad guy" I/O rate in orderto not trash the other containers.~~=== Memory ===

~~==== Disk I/O bandwidth ====~~All the containers share the same physical memory and swap space, and other similar resources like a page cache. All that memory is managed by a single kernel, thus making memory distribution model very elastic — if memory is not used by one container, it can be used by another.

Two major memory resource control parameters that are controlled per container are RAM and swap. If container is off its limit in terms of RAM, kernel tries to free some, by either shrinking the page cache or by swapping out. This reclamation mechanism is the same as used by a non-containerized kernel, the only difference is swap out is "virtual", in a sense that kernel does not write physical pages to the disk, but just removes those from container context (in order to avoid unnecessary I/O ~~bandwidth~~ ), while slowing down a container (~~in bytes per second~~to emulate the effect of real swap out) ~~can be limited~~ . Next, if a situation of global (not per-container ~~(currently only available in commercial Parallels Virtuozzo Containers~~)memory shortage happens, such pages are really swapped out into a swap file on disk.

~~=== Memory ===~~The above memory control mechanism is efficient, easy to use and comprehend by an administrator, and overall very effective.

~~All~~ In addition, there is an ability to fine-grain control some of the ~~containers share the same physical~~ memory ~~and swap space~~-related resources, such as size of IPC shared memory mappings, network buffers, number of processes etc, ~~and other similar resources like a page cache~~overall about 20 parameters called User Beancounters. ~~FIXME~~

=== Miscellaneous resources ===

== Checkpointing and live migration ==

'''Checkpointing''' is an OpenVZ kernel feature that makes it possible to freeze a running container (i.e. pause all its processes) and dump its complete in-kernel state into a file on disk. Such a dump file contains everything about processes inside a container: their memory, opened files, network connections, states etc. Then, a running container can be restored from the dump file and continue to run normally. The concept is somewhat similar to suspend-to-disk, only for a single container and much faster.

A container can be restored from a dump file on a different physical server, opening the door for live migration.

'''Live migration''' is an ability to move a running container from one physical server to another without a shutdown or service interruption. Network connections are migrated as well, so from a user's point of view it looks like some delay in response. OpenVZ live migration does not require any special hardware, disk or networking setup. It is implemented in the following way:

# Run rsync to copy container files to the destination system (the container is still running)

# Freeze and checkpoint the container

# Run rsync again, to catch the changes in files while the container was still running

# Copy the dump file to the destination system

# Undump and resume the container on the destination system

If container is residing on a shared storage (like NFS or SAN) there is no need to copy its files, one just checkpoints the container on one system and restores it on another.

Network connections are fully preserved and migrated. Upon finishing migration, destination server sends an ARP announce telling that container IP address now lives on a new MAC. While the container is frozen, all the incoming packets for it are dropped. In case of TCP, such packets will be retransmitted by the sending side, while in case of UDP packets are supposed to be lost sometimes.

Unlike other containers functionality, which is architecture-agnostic (and therefore containers on ARM or MIPS are easy to have), checkpointing is architecture-dependent. It is currently supported for x86, x86_64, and IA64.

== Miscellaneous topics ==

[[Image:CT-networking.png|right]]=== ~~Containers overhead~~ Networking ===

~~OpenVZ works almost as fast as a usual Linux system~~Each container has their own network stack. ~~The only overhead is for networking and additional resource management~~ This includes network device(s), routing table, firewall rules (~~see below~~iptables), ~~and in most cases~~ network caches, hash tables, etc. From the perspective of container owner it ~~is negligible~~looks like a standalone Linux box.

Three major modes of operation are possible. === ~~OpenVZ~~ = Route-based (venet) ==== This mode works in Layer 3 (network layer) of [[w:OSI model]]. That means that a container have a MAC-less network device (called <code>venet</code>) with one end in container and another end in the host system. Host system then acts as a router. Each IP packet is traversing both host and container's IP stack. The major features of this setup are:* '''Host system ~~scope~~ acts as a router'''* '''High security'''. It's the host system administrator who specifies container IP(s) and routing rule(s). No spoofing or harming is possible.* '''High control'''. Host system administrator fully controls container networking, by means of routing, firewall, traffic shaper etc.* '''No MAC address'''. A container can not use broadcasts or multicasts (since these features are on Level 2 and require a MAC address). ==== Bridge-based (veth) ==== This mode works in OSI Layer 2 (data link layer). For container, a Virtual Ethernet (<code>veth</code>) device is used. This device can be thought of as a pipe with two ends -- one end in the host system and another end in a CT, so if a packet goes to one end it will come out from the other end. The host system acts as a bridge, so veth is usually bridged together with eth0 or similar interface. The major features of this setup are:* '''Host system acts as a bridge'''* '''High configurability''': container administrator can setup all the networking.* '''Ability to use broadcasts/multicasts'''.* '''DHCP''' and dynamic IP addresses are possible* Broadcasting have negative performance impact (it is delivered separately to each CT) ==== Real network device in a container ====

~~From the host~~ Host system~~, all containers processes are visible~~administrator can assign a network device (such as <code>eth1</code>) into a container. Container administrator can then manage it as usual.

~~=== Networking (routed/bridged) ===~~Major features are:* Best performance* Low security* Container is tied to hardware

~~Does it differ much from VMs?~~=== Performance and density ===* See [[Performance]]* Density: {{FIXME|add graphs}}

=== Limitations ===

From the point of view of a container owner, it looks and feels like a real system. Nevertheless, it is important to understand what are container limitations:

* ~~Container is constrained by limits set by host system administrator. That includes usage of CPU, memory, disk space and bandwidth, network bandwidth etc.t~~ * Container OpenVZ container only runs Linux (Windows or FreeBSD is not an option), although different distributions ~~is not an issue~~run perfectly .

* Container can't boot/use its own kernel (it uses host system kernel).

* Container can't load its own kernel modules (it uses host system kernel modules).

* ~~Container~~ By default, container can't set system time. Such permission should be explicitly granted by host system administrator. * By default, ~~unless explicitly configured~~ container does not have direct access to ~~do so~~ hardware such as hard drive, network card, or a PCI device. Such access can be granted by host system administrator if needed. === OpenVZ host system scope === From the host system, all containers processes are visible, and all the container files are accessible (~~say~~ under <code>/vz/root/$CTID</code>). Host system administrator can set containers' parameters, access all containers files, send signals to ~~run~~ containers processes etc. Containers mass-management is easy with some shell scripting and commands like <code>~~ntpd~~exec</code> in and <code>enter</code>. For example, to add a ~~CT).~~user jack to all running containers the following command can be used:

* Container does not have direct access to hardware such as hard drive, network card, or a PCI device. Such access can be granted by host system administrator if needed. for CT in $(vzlist -H -o ctid); do vzctl set $CT --userpasswd jack:secret; done

Kir

Bureaucrats, Administrators

6,534

edits

Changes

WP/What are containers

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Services

Donate

Tools