Latest revision |
Your text |
Line 1: |
Line 1: |
− | <translate>
| + | This articles tries to summarize why ploop is needed, and why is it a better technology. |
− | <!--T:1-->
| |
− | This article tries to summarize why ploop is needed, and why is it a better technology. | |
| | | |
− | == Before ploop == <!--T:2--> | + | == Before ploop == |
| | | |
− | <!--T:3-->
| |
| First of all, a few facts about the pre-ploop era technologies and their | | First of all, a few facts about the pre-ploop era technologies and their |
| limitations. | | limitations. |
| | | |
− | <!--T:4-->
| |
| As you are probably aware, a container file system was just a directory | | As you are probably aware, a container file system was just a directory |
| on the host, which a new container was chroot()-ed into. Although it | | on the host, which a new container was chroot()-ed into. Although it |
| seems like a good and natural idea, there are a number of limitations. | | seems like a good and natural idea, there are a number of limitations. |
| | | |
− | <!--T:5-->
| |
| <ol> | | <ol> |
| <li>Since containers are living on one same file system, they all | | <li>Since containers are living on one same file system, they all |
− | share common properties of that file system (its type, block size, | + | share common properties of that file system (it's type, block size, |
− | and other options). That means we cannot configure the above properties | + | and other options). That means we can not configure the above properties |
| on a per-container basis.</li> | | on a per-container basis.</li> |
| | | |
− | <!--T:6-->
| |
| <li>One such property that deserves a special item in this list is | | <li>One such property that deserves a special item in this list is |
| file system journal. While journal is a good thing to have, because | | file system journal. While journal is a good thing to have, because |
Line 30: |
Line 24: |
| file truncates), all the other containers I/O will block waiting | | file truncates), all the other containers I/O will block waiting |
| for the journal to be written to disk. In some extreme cases we saw | | for the journal to be written to disk. In some extreme cases we saw |
− | up to 15 seconds of such blockage [but can be easily fixed using | + | up to 15 seconds of such blockage.</li> |
− | journal_async_commit in mount options].</li>
| |
| | | |
− | <!--T:7-->
| + | <li>There is no such thing as a per-directory disk quota for Linux, |
− | <li>Since many containers share the same file system with limited space, | + | so in order to limit containers disk space we had to develop one, |
− | in order to limit containers disk space we had to develop per-directory | + | it's called vzquota.</li> |
− | disk quotas (i.e. vzquota).</li>
| |
| | | |
− | <!--T:8-->
| |
− | <li>Since many containers share the same file system, and the number
| |
− | of inodes on a file system is limited [but can be increased in fs creation], vzquota
| |
− | should also be able to limit inodes on a per container (per directory)
| |
− | basis.</li>
| |
− |
| |
− | <!--T:9-->
| |
− | <li>In order for in-container (aka second-level) disk quota
| |
− | (i.e. standard per-user and per-group UNIX dist quota) to work,
| |
− | we had to provide a dummy file system called simfs. Its sole
| |
− | purpose is to have a superblock which is needed for disk quota
| |
− | to work.</li>
| |
− |
| |
− | <!--T:10-->
| |
| <li>When doing a live migration without some sort of shared storage | | <li>When doing a live migration without some sort of shared storage |
| (like NAS or SAN), we sync the files to a destination system using | | (like NAS or SAN), we sync the files to a destination system using |
Line 59: |
Line 37: |
| those apps are not surviving the migration</li> | | those apps are not surviving the migration</li> |
| | | |
− | <!--T:11-->
| |
| <li>Finally, a container backup or snapshot is harder to do because | | <li>Finally, a container backup or snapshot is harder to do because |
| there is a lot of small files that need to be copied.</li> | | there is a lot of small files that need to be copied.</li> |
| </ol> | | </ol> |
− |
| |
− | == Introducing ploop == <!--T:12-->
| |
− |
| |
− | <!--T:13-->
| |
− | In order to address the above problems and ultimately make a world a better
| |
− | place, we decided to implement a container-in-a-file technology, not
| |
− | different from what various VM products are using, but working
| |
− | as effectively as all the other container bits and pieces in OpenVZ.
| |
− |
| |
− | <!--T:14-->
| |
− | The main idea of ploop is to have an image file, use it as a block
| |
− | device, and create and use a file system on that device. Some readers
| |
− | will recognize that this is exactly what Linux loop device does!
| |
− | Right, the only thing is loop device is very inefficient (say, using
| |
− | it leads to double caching of data in memory) and its functionality
| |
− | is very limited.
| |
− |
| |
− | === Modular design === <!--T:15-->
| |
− |
| |
− | <!--T:16-->
| |
− | Ploop implementation in the kernel have a modular and layered design.
| |
− | The top layer is the main ploop module, which provides a virtual block
| |
− | device to be used for CT filesystem.
| |
− |
| |
− | <!--T:17-->
| |
− | The middle layer is the format module, which does translation of
| |
− | block device block numbers into image file block numbers. A simple format
| |
− | module which is called "raw" is doing trivial 1:1 translation, same as
| |
− | existing loop device. More sophisticated format module is keeping the
| |
− | translation table and is able to dynamically grow and shrink the image
| |
− | file. That means, if you create a container with 2GB of disk space,
| |
− | the image file size will not be 2GB, but less -- the size of the actual
| |
− | data stored in the container.
| |
− |
| |
− | <!--T:18-->
| |
− | It is also possible to support other image formats by writing other
| |
− | ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).
| |
− |
| |
− | <!--T:19-->
| |
− | The bottom layer is the I/O module. Currently modules for direct I/O
| |
− | on an ext4 device, and for NFS are available. There are plans to also
| |
− | have a generic VFS module, which will be able to store images on any
| |
− | decent file system, but that needs an efficient direct I/O implementation
| |
− | in the VFS layer which is still being worked on.
| |
− |
| |
− | === Write tracker === <!--T:20-->
| |
− |
| |
− | <!--T:21-->
| |
− | Write tracker is a feature of ploop that is designed for live migration. When write tracker is turned on, the kernel memorizes a list of modified data blocks. This list then can be used to efficiently migrate a ploop device to a different physical server, with minimal container downtime. User-space support for this is implemented in '''ploop copy''' tool and is used by '''vzmigrate''' utility.
| |
− |
| |
− | <!--T:22-->
| |
− | The idea is to do iterative migration of an image file, in the following
| |
− | way:
| |
− | # Turn write tracker feature on. Now the kernel will keep track of ploop image blocks being modified.
| |
− | # Copy all blocks of a ploop image file to a destination system.
| |
− | # Ask write tracker which blocks were modified.
| |
− | # Copy only these blocks.
| |
− | # Repeat steps 3 and 4 until number of blocks is not decreasing.
| |
− | # Freeze the container processes and repeat steps 3 and 4 last time.
| |
− |
| |
− | <!--T:23-->
| |
− | See [http://openvz.livejournal.com/41835.html Effective live migration with ploop write tracker] blog post for more details.
| |
− |
| |
− | === Snapshots === <!--T:24-->
| |
− |
| |
− | <!--T:25-->
| |
− | With ploop, one can instantly create file system snapshots. Snapshots are described in [http://openvz.livejournal.com/44508.html ploop snapshots and backups] blog post.
| |
− |
| |
− | == Benefits == <!--T:26-->
| |
− |
| |
− | <!--T:27-->
| |
− | * File system journal is not bottleneck anymore [if you are not using journal_async_commit mount option yet]
| |
− | * Large-size image files I/O instead of lots of small-size files I/O on management operations
| |
− | * Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas
| |
− | * Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system) [but these file systems yet have their own inodes limit]
| |
− | * Live backup is easy and consistent
| |
− | * Live migration is reliable and efficient
| |
− | * Different containers may use file systems of different types and properties
| |
− |
| |
− | <!--T:28-->
| |
− | In addition:
| |
− | * [Potential] support for QCOW2 and other image formats
| |
− | * Support for different storage types
| |
− |
| |
− | == Disadvantages == <!--T:29-->
| |
− | * Boot delays in each container after some restarts or in system crashs due the multiple forced FSCKs when using ext3/4 file systems
| |
− | * Container's starts fails when FSCK find several inconsistencies in FS needing manual intervention
| |
− | * Increased risks of unrecoverable errors due container crashes
| |
− | * Greatly increased risks of unrecoverable errors when used over a NFS due network instabilities
| |
− | * Extra IO use when shrinking a PLOOP due block re-alocation [varies due FS fragmentation]
| |
− | * Slight poor performance due additional PLOOP layers
| |
− | * Needs a manually defrag and compact operations to recover hardnode free space wasted by allocated and no-more used blocks in each container
| |
− | * Additional space wasted due the additional FS metadata and format
| |
− | * No support for hardnode bind mounts to other disks (like backups) [can be workarounded using "loopback" NFS-like solutions to hardnode but looses some performance]
| |
− |
| |
− | == See also == <!--T:30-->
| |
− | * [[Ploop]]
| |
− | </translate>
| |
− |
| |
− | [[Category: Storage]]
| |