| Latest revision |
Your text |
| Line 1: |
Line 1: |
| − | <translate>
| + | This articles tries to summarize why ploop is needed, and why is it a better technology. |
| − | <!--T:1-->
| |
| − | This article tries to summarize why ploop is needed, and why is it a better technology. | |
| | | | |
| − | == Before ploop == <!--T:2--> | + | == Before ploop == |
| | | | |
| − | <!--T:3-->
| |
| | First of all, a few facts about the pre-ploop era technologies and their | | First of all, a few facts about the pre-ploop era technologies and their |
| | limitations. | | limitations. |
| | | | |
| − | <!--T:4-->
| |
| | As you are probably aware, a container file system was just a directory | | As you are probably aware, a container file system was just a directory |
| | on the host, which a new container was chroot()-ed into. Although it | | on the host, which a new container was chroot()-ed into. Although it |
| | seems like a good and natural idea, there are a number of limitations. | | seems like a good and natural idea, there are a number of limitations. |
| | | | |
| − | <!--T:5-->
| |
| | <ol> | | <ol> |
| | <li>Since containers are living on one same file system, they all | | <li>Since containers are living on one same file system, they all |
| − | share common properties of that file system (its type, block size, | + | share common properties of that file system (it's type, block size, |
| − | and other options). That means we cannot configure the above properties | + | and other options). That means we can not configure the above properties |
| | on a per-container basis.</li> | | on a per-container basis.</li> |
| | | | |
| − | <!--T:6-->
| |
| | <li>One such property that deserves a special item in this list is | | <li>One such property that deserves a special item in this list is |
| | file system journal. While journal is a good thing to have, because | | file system journal. While journal is a good thing to have, because |
| Line 30: |
Line 24: |
| | file truncates), all the other containers I/O will block waiting | | file truncates), all the other containers I/O will block waiting |
| | for the journal to be written to disk. In some extreme cases we saw | | for the journal to be written to disk. In some extreme cases we saw |
| − | up to 15 seconds of such blockage [but can be easily fixed using | + | up to 15 seconds of such blockage.</li> |
| − | journal_async_commit in mount options].</li>
| |
| | | | |
| − | <!--T:7-->
| + | <li>There is no such thing as a per-directory disk quota for Linux, |
| − | <li>Since many containers share the same file system with limited space, | + | so in order to limit containers disk space we had to develop one, |
| − | in order to limit containers disk space we had to develop per-directory | + | it's called vzquota.</li> |
| − | disk quotas (i.e. vzquota).</li>
| |
| | | | |
| − | <!--T:8-->
| |
| − | <li>Since many containers share the same file system, and the number
| |
| − | of inodes on a file system is limited [but can be increased in fs creation], vzquota
| |
| − | should also be able to limit inodes on a per container (per directory)
| |
| − | basis.</li>
| |
| − |
| |
| − | <!--T:9-->
| |
| − | <li>In order for in-container (aka second-level) disk quota
| |
| − | (i.e. standard per-user and per-group UNIX dist quota) to work,
| |
| − | we had to provide a dummy file system called simfs. Its sole
| |
| − | purpose is to have a superblock which is needed for disk quota
| |
| − | to work.</li>
| |
| − |
| |
| − | <!--T:10-->
| |
| | <li>When doing a live migration without some sort of shared storage | | <li>When doing a live migration without some sort of shared storage |
| | (like NAS or SAN), we sync the files to a destination system using | | (like NAS or SAN), we sync the files to a destination system using |
| Line 59: |
Line 37: |
| | those apps are not surviving the migration</li> | | those apps are not surviving the migration</li> |
| | | | |
| − | <!--T:11-->
| |
| | <li>Finally, a container backup or snapshot is harder to do because | | <li>Finally, a container backup or snapshot is harder to do because |
| | there is a lot of small files that need to be copied.</li> | | there is a lot of small files that need to be copied.</li> |
| | </ol> | | </ol> |
| − |
| |
| − | == Introducing ploop == <!--T:12-->
| |
| − |
| |
| − | <!--T:13-->
| |
| − | In order to address the above problems and ultimately make a world a better
| |
| − | place, we decided to implement a container-in-a-file technology, not
| |
| − | different from what various VM products are using, but working
| |
| − | as effectively as all the other container bits and pieces in OpenVZ.
| |
| − |
| |
| − | <!--T:14-->
| |
| − | The main idea of ploop is to have an image file, use it as a block
| |
| − | device, and create and use a file system on that device. Some readers
| |
| − | will recognize that this is exactly what Linux loop device does!
| |
| − | Right, the only thing is loop device is very inefficient (say, using
| |
| − | it leads to double caching of data in memory) and its functionality
| |
| − | is very limited.
| |
| − |
| |
| − | === Modular design === <!--T:15-->
| |
| − |
| |
| − | <!--T:16-->
| |
| − | Ploop implementation in the kernel have a modular and layered design.
| |
| − | The top layer is the main ploop module, which provides a virtual block
| |
| − | device to be used for CT filesystem.
| |
| − |
| |
| − | <!--T:17-->
| |
| − | The middle layer is the format module, which does translation of
| |
| − | block device block numbers into image file block numbers. A simple format
| |
| − | module which is called "raw" is doing trivial 1:1 translation, same as
| |
| − | existing loop device. More sophisticated format module is keeping the
| |
| − | translation table and is able to dynamically grow and shrink the image
| |
| − | file. That means, if you create a container with 2GB of disk space,
| |
| − | the image file size will not be 2GB, but less -- the size of the actual
| |
| − | data stored in the container.
| |
| − |
| |
| − | <!--T:18-->
| |
| − | It is also possible to support other image formats by writing other
| |
| − | ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).
| |
| − |
| |
| − | <!--T:19-->
| |
| − | The bottom layer is the I/O module. Currently modules for direct I/O
| |
| − | on an ext4 device, and for NFS are available. There are plans to also
| |
| − | have a generic VFS module, which will be able to store images on any
| |
| − | decent file system, but that needs an efficient direct I/O implementation
| |
| − | in the VFS layer which is still being worked on.
| |
| − |
| |
| − | === Write tracker === <!--T:20-->
| |
| − |
| |
| − | <!--T:21-->
| |
| − | Write tracker is a feature of ploop that is designed for live migration. When write tracker is turned on, the kernel memorizes a list of modified data blocks. This list then can be used to efficiently migrate a ploop device to a different physical server, with minimal container downtime. User-space support for this is implemented in '''ploop copy''' tool and is used by '''vzmigrate''' utility.
| |
| − |
| |
| − | <!--T:22-->
| |
| − | The idea is to do iterative migration of an image file, in the following
| |
| − | way:
| |
| − | # Turn write tracker feature on. Now the kernel will keep track of ploop image blocks being modified.
| |
| − | # Copy all blocks of a ploop image file to a destination system.
| |
| − | # Ask write tracker which blocks were modified.
| |
| − | # Copy only these blocks.
| |
| − | # Repeat steps 3 and 4 until number of blocks is not decreasing.
| |
| − | # Freeze the container processes and repeat steps 3 and 4 last time.
| |
| − |
| |
| − | <!--T:23-->
| |
| − | See [http://openvz.livejournal.com/41835.html Effective live migration with ploop write tracker] blog post for more details.
| |
| − |
| |
| − | === Snapshots === <!--T:24-->
| |
| − |
| |
| − | <!--T:25-->
| |
| − | With ploop, one can instantly create file system snapshots. Snapshots are described in [http://openvz.livejournal.com/44508.html ploop snapshots and backups] blog post.
| |
| − |
| |
| − | == Benefits == <!--T:26-->
| |
| − |
| |
| − | <!--T:27-->
| |
| − | * File system journal is not bottleneck anymore [if you are not using journal_async_commit mount option yet]
| |
| − | * Large-size image files I/O instead of lots of small-size files I/O on management operations
| |
| − | * Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas
| |
| − | * Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system) [but these file systems yet have their own inodes limit]
| |
| − | * Live backup is easy and consistent
| |
| − | * Live migration is reliable and efficient
| |
| − | * Different containers may use file systems of different types and properties
| |
| − |
| |
| − | <!--T:28-->
| |
| − | In addition:
| |
| − | * [Potential] support for QCOW2 and other image formats
| |
| − | * Support for different storage types
| |
| − |
| |
| − | == Disadvantages == <!--T:29-->
| |
| − | * Boot delays in each container after some restarts or in system crashs due the multiple forced FSCKs when using ext3/4 file systems
| |
| − | * Container's starts fails when FSCK find several inconsistencies in FS needing manual intervention
| |
| − | * Increased risks of unrecoverable errors due container crashes
| |
| − | * Greatly increased risks of unrecoverable errors when used over a NFS due network instabilities
| |
| − | * Extra IO use when shrinking a PLOOP due block re-alocation [varies due FS fragmentation]
| |
| − | * Slight poor performance due additional PLOOP layers
| |
| − | * Needs a manually defrag and compact operations to recover hardnode free space wasted by allocated and no-more used blocks in each container
| |
| − | * Additional space wasted due the additional FS metadata and format
| |
| − | * No support for hardnode bind mounts to other disks (like backups) [can be workarounded using "loopback" NFS-like solutions to hardnode but looses some performance]
| |
| − |
| |
| − | == See also == <!--T:30-->
| |
| − | * [[Ploop]]
| |
| − | </translate>
| |
| − |
| |
| − | [[Category: Storage]]
| |