Editing Ploop/Why
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
− | |||
− | |||
This article tries to summarize why ploop is needed, and why is it a better technology. | This article tries to summarize why ploop is needed, and why is it a better technology. | ||
− | == Before ploop == | + | == Before ploop == |
− | |||
First of all, a few facts about the pre-ploop era technologies and their | First of all, a few facts about the pre-ploop era technologies and their | ||
limitations. | limitations. | ||
− | |||
As you are probably aware, a container file system was just a directory | As you are probably aware, a container file system was just a directory | ||
on the host, which a new container was chroot()-ed into. Although it | on the host, which a new container was chroot()-ed into. Although it | ||
seems like a good and natural idea, there are a number of limitations. | seems like a good and natural idea, there are a number of limitations. | ||
− | |||
<ol> | <ol> | ||
<li>Since containers are living on one same file system, they all | <li>Since containers are living on one same file system, they all | ||
− | share common properties of that file system ( | + | share common properties of that file system (it's type, block size, |
− | and other options). That means we | + | and other options). That means we can not configure the above properties |
on a per-container basis.</li> | on a per-container basis.</li> | ||
− | |||
<li>One such property that deserves a special item in this list is | <li>One such property that deserves a special item in this list is | ||
file system journal. While journal is a good thing to have, because | file system journal. While journal is a good thing to have, because | ||
Line 30: | Line 24: | ||
file truncates), all the other containers I/O will block waiting | file truncates), all the other containers I/O will block waiting | ||
for the journal to be written to disk. In some extreme cases we saw | for the journal to be written to disk. In some extreme cases we saw | ||
− | up to 15 seconds of such blockage | + | up to 15 seconds of such blockage.</li> |
− | |||
− | |||
<li>Since many containers share the same file system with limited space, | <li>Since many containers share the same file system with limited space, | ||
in order to limit containers disk space we had to develop per-directory | in order to limit containers disk space we had to develop per-directory | ||
disk quotas (i.e. vzquota).</li> | disk quotas (i.e. vzquota).</li> | ||
− | |||
<li>Since many containers share the same file system, and the number | <li>Since many containers share the same file system, and the number | ||
− | of inodes on a file system is limited [ | + | of inodes on a file system is limited [for most file systems], vzquota |
should also be able to limit inodes on a per container (per directory) | should also be able to limit inodes on a per container (per directory) | ||
basis.</li> | basis.</li> | ||
− | |||
<li>In order for in-container (aka second-level) disk quota | <li>In order for in-container (aka second-level) disk quota | ||
(i.e. standard per-user and per-group UNIX dist quota) to work, | (i.e. standard per-user and per-group UNIX dist quota) to work, | ||
Line 51: | Line 41: | ||
to work.</li> | to work.</li> | ||
− | |||
<li>When doing a live migration without some sort of shared storage | <li>When doing a live migration without some sort of shared storage | ||
(like NAS or SAN), we sync the files to a destination system using | (like NAS or SAN), we sync the files to a destination system using | ||
Line 59: | Line 48: | ||
those apps are not surviving the migration</li> | those apps are not surviving the migration</li> | ||
− | |||
<li>Finally, a container backup or snapshot is harder to do because | <li>Finally, a container backup or snapshot is harder to do because | ||
there is a lot of small files that need to be copied.</li> | there is a lot of small files that need to be copied.</li> | ||
</ol> | </ol> | ||
− | == Introducing ploop == | + | == Introducing ploop == |
− | |||
In order to address the above problems and ultimately make a world a better | In order to address the above problems and ultimately make a world a better | ||
place, we decided to implement a container-in-a-file technology, not | place, we decided to implement a container-in-a-file technology, not | ||
Line 72: | Line 59: | ||
as effectively as all the other container bits and pieces in OpenVZ. | as effectively as all the other container bits and pieces in OpenVZ. | ||
− | |||
The main idea of ploop is to have an image file, use it as a block | The main idea of ploop is to have an image file, use it as a block | ||
device, and create and use a file system on that device. Some readers | device, and create and use a file system on that device. Some readers | ||
Line 80: | Line 66: | ||
is very limited. | is very limited. | ||
− | === Modular design === | + | === Modular design === |
− | |||
Ploop implementation in the kernel have a modular and layered design. | Ploop implementation in the kernel have a modular and layered design. | ||
The top layer is the main ploop module, which provides a virtual block | The top layer is the main ploop module, which provides a virtual block | ||
device to be used for CT filesystem. | device to be used for CT filesystem. | ||
− | |||
The middle layer is the format module, which does translation of | The middle layer is the format module, which does translation of | ||
block device block numbers into image file block numbers. A simple format | block device block numbers into image file block numbers. A simple format | ||
Line 97: | Line 81: | ||
data stored in the container. | data stored in the container. | ||
− | |||
It is also possible to support other image formats by writing other | It is also possible to support other image formats by writing other | ||
ploop format modules, such as the one for QCOW2 (used by QEMU and KVM). | ploop format modules, such as the one for QCOW2 (used by QEMU and KVM). | ||
− | |||
The bottom layer is the I/O module. Currently modules for direct I/O | The bottom layer is the I/O module. Currently modules for direct I/O | ||
on an ext4 device, and for NFS are available. There are plans to also | on an ext4 device, and for NFS are available. There are plans to also | ||
Line 108: | Line 90: | ||
in the VFS layer which is still being worked on. | in the VFS layer which is still being worked on. | ||
− | === Write tracker === | + | === Write tracker === |
− | |||
Write tracker is a feature of ploop that is designed for live migration. When write tracker is turned on, the kernel memorizes a list of modified data blocks. This list then can be used to efficiently migrate a ploop device to a different physical server, with minimal container downtime. User-space support for this is implemented in '''ploop copy''' tool and is used by '''vzmigrate''' utility. | Write tracker is a feature of ploop that is designed for live migration. When write tracker is turned on, the kernel memorizes a list of modified data blocks. This list then can be used to efficiently migrate a ploop device to a different physical server, with minimal container downtime. User-space support for this is implemented in '''ploop copy''' tool and is used by '''vzmigrate''' utility. | ||
− | |||
The idea is to do iterative migration of an image file, in the following | The idea is to do iterative migration of an image file, in the following | ||
way: | way: | ||
Line 123: | Line 103: | ||
# Freeze the container processes and repeat steps 3 and 4 last time. | # Freeze the container processes and repeat steps 3 and 4 last time. | ||
− | |||
See [http://openvz.livejournal.com/41835.html Effective live migration with ploop write tracker] blog post for more details. | See [http://openvz.livejournal.com/41835.html Effective live migration with ploop write tracker] blog post for more details. | ||
− | === | + | == Benefits == |
− | + | * File system journal is not bottleneck anymore | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | * File system journal is not bottleneck anymore | ||
* Large-size image files I/O instead of lots of small-size files I/O on management operations | * Large-size image files I/O instead of lots of small-size files I/O on management operations | ||
* Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas | * Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas | ||
− | * Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system) | + | * Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system) |
* Live backup is easy and consistent | * Live backup is easy and consistent | ||
* Live migration is reliable and efficient | * Live migration is reliable and efficient | ||
* Different containers may use file systems of different types and properties | * Different containers may use file systems of different types and properties | ||
− | |||
In addition: | In addition: | ||
+ | * Efficient container creation | ||
* [Potential] support for QCOW2 and other image formats | * [Potential] support for QCOW2 and other image formats | ||
* Support for different storage types | * Support for different storage types | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |