Editing Ploop/Why

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 1: Line 1:
<translate>
+
This articles tries to summarize why ploop is needed, and why is it a better technology.
<!--T:1-->
 
This article tries to summarize why ploop is needed, and why is it a better technology.
 
  
== Before ploop == <!--T:2-->
+
== Before ploop ==
  
<!--T:3-->
 
 
First of all, a few facts about the pre-ploop era technologies and their
 
First of all, a few facts about the pre-ploop era technologies and their
 
limitations.
 
limitations.
  
<!--T:4-->
 
 
As you are probably aware, a container file system was just a directory
 
As you are probably aware, a container file system was just a directory
 
on the host, which a new container was chroot()-ed into. Although it
 
on the host, which a new container was chroot()-ed into. Although it
 
seems like a good and natural idea, there are a number of limitations.
 
seems like a good and natural idea, there are a number of limitations.
  
<!--T:5-->
 
 
<ol>
 
<ol>
 
<li>Since containers are living on one same file system, they all
 
<li>Since containers are living on one same file system, they all
share common properties of that file system (its type, block size,
+
share common properties of that file system (it's type, block size,
and other options). That means we cannot configure the above properties
+
and other options). That means we can not configure the above properties
 
on a per-container basis.</li>
 
on a per-container basis.</li>
  
<!--T:6-->
 
 
<li>One such property that deserves a special item in this list is
 
<li>One such property that deserves a special item in this list is
 
file system journal. While journal is a good thing to have, because
 
file system journal. While journal is a good thing to have, because
Line 30: Line 24:
 
file truncates), all the other containers I/O will block waiting
 
file truncates), all the other containers I/O will block waiting
 
for the journal to be written to disk. In some extreme cases we saw
 
for the journal to be written to disk. In some extreme cases we saw
up to 15 seconds of such blockage [but can be easily fixed using
+
up to 15 seconds of such blockage.</li>
journal_async_commit in mount options].</li>
 
  
<!--T:7-->
+
<li>There is no such thing as a per-directory disk quota for Linux,
<li>Since many containers share the same file system with limited space,
+
so in order to limit containers disk space we had to develop one,
in order to limit containers disk space we had to develop per-directory
+
it's called vzquota.</li>
disk quotas (i.e. vzquota).</li>
 
  
<!--T:8-->
 
<li>Since many containers share the same file system, and the number
 
of inodes on a file system is limited [but can be increased in fs creation], vzquota
 
should also be able to limit inodes on a per container (per directory)
 
basis.</li>
 
 
<!--T:9-->
 
<li>In order for in-container (aka second-level) disk quota
 
(i.e. standard per-user and per-group UNIX dist quota) to work,
 
we had to provide a dummy file system called simfs. Its sole
 
purpose is to have a superblock which is needed for disk quota
 
to work.</li>
 
 
<!--T:10-->
 
 
<li>When doing a live migration without some sort of shared storage
 
<li>When doing a live migration without some sort of shared storage
 
(like NAS or SAN), we sync the files to a destination system using
 
(like NAS or SAN), we sync the files to a destination system using
Line 59: Line 37:
 
those apps are not surviving the migration</li>
 
those apps are not surviving the migration</li>
  
<!--T:11-->
 
 
<li>Finally, a container backup or snapshot is harder to do because
 
<li>Finally, a container backup or snapshot is harder to do because
 
there is a lot of small files that need to be copied.</li>
 
there is a lot of small files that need to be copied.</li>
 
</ol>
 
</ol>
  
== Introducing ploop == <!--T:12-->
+
== Introducing ploop ==
  
<!--T:13-->
 
 
In order to address the above problems and ultimately make a world a better
 
In order to address the above problems and ultimately make a world a better
 
place, we decided to implement a container-in-a-file technology, not
 
place, we decided to implement a container-in-a-file technology, not
Line 72: Line 48:
 
as effectively as all the other container bits and pieces in OpenVZ.
 
as effectively as all the other container bits and pieces in OpenVZ.
  
<!--T:14-->
 
 
The main idea of ploop is to have an image file, use it as a block
 
The main idea of ploop is to have an image file, use it as a block
 
device, and create and use a file system on that device. Some readers
 
device, and create and use a file system on that device. Some readers
Line 80: Line 55:
 
is very limited.
 
is very limited.
  
=== Modular design === <!--T:15-->
+
=== Modular design ===
  
<!--T:16-->
 
 
Ploop implementation in the kernel have a modular and layered design.
 
Ploop implementation in the kernel have a modular and layered design.
 
The top layer is the main ploop module, which provides a virtual block
 
The top layer is the main ploop module, which provides a virtual block
 
device to be used for CT filesystem.
 
device to be used for CT filesystem.
  
<!--T:17-->
 
 
The middle layer is the format module, which does translation of
 
The middle layer is the format module, which does translation of
 
block device block numbers into image file block numbers. A simple format
 
block device block numbers into image file block numbers. A simple format
Line 97: Line 70:
 
data stored in the container.
 
data stored in the container.
  
<!--T:18-->
 
 
It is also possible to support other image formats by writing other
 
It is also possible to support other image formats by writing other
 
ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).
 
ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).
  
<!--T:19-->
 
 
The bottom layer is the I/O module. Currently modules for direct I/O
 
The bottom layer is the I/O module. Currently modules for direct I/O
 
on an ext4 device, and for NFS are available. There are plans to also
 
on an ext4 device, and for NFS are available. There are plans to also
Line 108: Line 79:
 
in the VFS layer which is still being worked on.
 
in the VFS layer which is still being worked on.
  
=== Write tracker === <!--T:20-->
+
== Benefits ==
 
 
<!--T:21-->
 
Write tracker is a feature of ploop that is designed for live migration. When write tracker is turned on, the kernel memorizes a list of modified data blocks. This list then can be used to efficiently migrate a ploop device to a different physical server, with minimal container downtime. User-space support for this is implemented in '''ploop copy''' tool and is used by '''vzmigrate''' utility.
 
 
 
<!--T:22-->
 
The idea is to do iterative migration of an image file, in the following
 
way:
 
# Turn write tracker feature on. Now the kernel will keep track of ploop image blocks being modified.
 
# Copy all blocks of a ploop image file to a destination system.
 
# Ask write tracker which blocks were modified.
 
# Copy only these blocks.
 
# Repeat steps 3 and 4 until number of blocks is not decreasing.
 
# Freeze the container processes and repeat steps 3 and 4 last time.
 
  
<!--T:23-->
+
File system journal is not bottleneck anymore
See [http://openvz.livejournal.com/41835.html Effective live migration with ploop write tracker] blog post for more details.
 
 
 
=== Snapshots === <!--T:24-->
 
 
 
<!--T:25-->
 
With ploop, one can instantly create file system snapshots. Snapshots are described in [http://openvz.livejournal.com/44508.html ploop snapshots and backups] blog post.
 
 
 
== Benefits == <!--T:26-->
 
 
 
<!--T:27-->
 
* File system journal is not bottleneck anymore [if you are not using journal_async_commit mount option yet]
 
 
* Large-size image files I/O instead of lots of small-size files I/O on management operations
 
* Large-size image files I/O instead of lots of small-size files I/O on management operations
* Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas
+
Disk quota can be implemented based on virtual device sizes. No need for sub-tree quotas
* Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system) [but these file systems yet have their own inodes limit]
+
•Live backup is easy and consistent
* Live backup is easy and consistent
+
•Live migration is reliable and efficient
* Live migration is reliable and efficient
+
•Different containers may use file systems of different types and properties
* Different containers may use file systems of different types and properties
+
•No need to limit “number-of-inodes-per-container
  
<!--T:28-->
 
 
In addition:
 
In addition:
* [Potential] support for QCOW2 and other image formats
+
* Efficient container creation
* Support for different storage types
+
[Potential] support for QCOW2 and other image formats
 
+
Support for different storage types
== Disadvantages == <!--T:29-->
 
* Boot delays in each container after some restarts or in system crashs due the multiple forced FSCKs when using ext3/4 file systems
 
* Container's starts fails when FSCK find several inconsistencies in FS needing manual intervention
 
* Increased risks of unrecoverable errors due container crashes
 
* Greatly increased risks of unrecoverable errors when used over a NFS due network instabilities
 
* Extra IO use when shrinking a PLOOP due block re-alocation [varies due FS fragmentation]
 
* Slight poor performance due additional PLOOP layers
 
* Needs a manually defrag and compact operations to recover hardnode free space wasted by allocated and no-more used blocks in each container
 
* Additional space wasted due the additional FS metadata and format
 
* No support for hardnode bind mounts to other disks (like backups) [can be workarounded using "loopback" NFS-like solutions to hardnode but looses some performance]
 
 
 
== See also == <!--T:30-->
 
* [[Ploop]]
 
</translate>
 
 
 
[[Category: Storage]]
 

Please note that all contributions to OpenVZ Virtuozzo Containers Wiki may be edited, altered, or removed by other contributors. If you don't want your writing to be edited mercilessly, then don't submit it here.
If you are going to add external links to an article, read the External links policy first!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)