Difference between revisions of "Ploop/Why"
|  (part 1) |  (ploop intro and modular design) | ||
| Line 40: | Line 40: | ||
| there is a lot of small files that need to be copied.</li> | there is a lot of small files that need to be copied.</li> | ||
| </ol> | </ol> | ||
| + | |||
| + | == Introducing ploop == | ||
| + | |||
| + | In order to address the above problems and ultimately make a world a better | ||
| + | place, we decided to implement a container-in-a-file technology, not | ||
| + | different from what various VM products are using, but working | ||
| + | as effectively as all the other container bits and pieces in OpenVZ. | ||
| + | |||
| + | The main idea of ploop is to have an image file, use it as a block | ||
| + | device, and create and use a file system on that device. Some readers | ||
| + | will recognize that this is exactly what Linux loop device does! | ||
| + | Right, the only thing is loop device is very inefficient (say, using | ||
| + | it leads to double caching of data in memory) and its functionality | ||
| + | is very limited. | ||
| + | |||
| + | |||
| + | == Modular design === | ||
| + | |||
| + | Ploop implementation in the kernel have a modular and layered design. | ||
| + | The top layer is the main ploop module, which provides a virtual block | ||
| + | device to be used for CT filesystem. | ||
| + | |||
| + | The middle layer is the format module, which does translation of | ||
| + | block device block numbers into image file block numbers. A simple format | ||
| + | module which is called "raw" is doing trivial 1:1 translation, same as | ||
| + | existing loop device. More sophisticated format module is keeping the | ||
| + | translation table and is able to dynamically grow and shrink the image | ||
| + | file. That means, if you create a container with 2GB of disk space, | ||
| + | the image file size will not be 2GB, but less -- the size of the actual | ||
| + | data stored in the container. | ||
| + | |||
| + | It is also possible to support other image formats by writing other | ||
| + | ploop format modules, such as the one for QCOW2 (used by QEMU and KVM). | ||
| + | |||
| + | The bottom layer is the I/O module. Currently modules for direct I/O | ||
| + | on an ext4 device, and for NFS are available. There are plans to also | ||
| + | have a generic VFS module, which will be able to store images on any | ||
| + | decent file system, but that needs an efficient direct I/O implementation | ||
| + | in the VFS layer which is still being worked on. | ||
Revision as of 21:15, 23 March 2012
This articles tries to summarize why ploop is needed, and why is it a better technology.
Before ploop
First of all, a few facts about the pre-ploop era technologies and their limitations.
As you are probably aware, a container file system was just a directory on the host, which a new container was chroot()-ed into. Although it seems like a good and natural idea, there are a number of limitations.
- Since containers are living on one same file system, they all share common properties of that file system (it's type, block size, and other options). That means we can not configure the above properties on a per-container basis.
- One such property that deserves a special item in this list is file system journal. While journal is a good thing to have, because it helps to maintain file system integrity and improve reboot times (by eliminating fsck in many cases), it is also a bottleneck for containers. If one container will fill up in-memory journal (with lots of small operations leading to file metadata updates, e.g. file truncates), all the other containers I/O will block waiting for the journal to be written to disk. In some extreme cases we saw up to 15 seconds of such blockage.
- There is no such thing as a per-directory disk quota for Linux, so in order to limit containers disk space we had to develop one, it's called vzquota.
- When doing a live migration without some sort of shared storage (like NAS or SAN), we sync the files to a destination system using rsync, which does the exact copy of all files, except that their i-node numbers on disk will change. If there are some apps that rely on files' i-node numbers being constant (which is normally the case), those apps are not surviving the migration
- Finally, a container backup or snapshot is harder to do because there is a lot of small files that need to be copied.
Introducing ploop
In order to address the above problems and ultimately make a world a better place, we decided to implement a container-in-a-file technology, not different from what various VM products are using, but working as effectively as all the other container bits and pieces in OpenVZ.
The main idea of ploop is to have an image file, use it as a block device, and create and use a file system on that device. Some readers will recognize that this is exactly what Linux loop device does! Right, the only thing is loop device is very inefficient (say, using it leads to double caching of data in memory) and its functionality is very limited.
Modular design =
Ploop implementation in the kernel have a modular and layered design. The top layer is the main ploop module, which provides a virtual block device to be used for CT filesystem.
The middle layer is the format module, which does translation of block device block numbers into image file block numbers. A simple format module which is called "raw" is doing trivial 1:1 translation, same as existing loop device. More sophisticated format module is keeping the translation table and is able to dynamically grow and shrink the image file. That means, if you create a container with 2GB of disk space, the image file size will not be 2GB, but less -- the size of the actual data stored in the container.
It is also possible to support other image formats by writing other ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).
The bottom layer is the I/O module. Currently modules for direct I/O on an ext4 device, and for NFS are available. There are plans to also have a generic VFS module, which will be able to store images on any decent file system, but that needs an efficient direct I/O implementation in the VFS layer which is still being worked on.
