KSM (kernel same-page merging)
<translate> KSM is a memory-saving de-duplication feature, developed by Red Hat. It first appeared in the Linux kernel version 2.6.32.
KSM replaces RAM pages of identical content with a single write-protected page, which in turn gets automatically copied to a new one if a process later wants to update its content. This makes the de-duplication mechanism transparent to applications. This strategy is commonly known as COW (Copy On Write).
Although KSM was originally developed for use with KVM, it can be used with OpenVZ containers as well.
Contents
Performance and overhead
KSM is more effective when the hardware node has plenty of RAM and is running many containers with the same applications (e.g. a HN running 50 containers with Apache, PHP and MySQL).
The amount of RAM that can be saved is in the 30-50% range depending on the application, with applications that have a mixed I/O and CPU footprint (e.g. MySQL and Apache) getting the best results.[1]
It is worth noting that KSM and Virtuozzo use totally different strategies for RAM deduplication: KSM provides a constantly running daemon (called ksmd) which scans the HN's memory and merges identical RAM pages gradually over time, whereas Virtuozzo merges requests from different containers to the same physical binaries on disk. In doing so, Virtuozzo incurs no overhead at all, while KSM has a 5-10% CPU overhead depending on its configuration (faster scanning of the HN RAM will require more CPU power).
Enabling KSM on the hardware node
First, you'll want to verify that KSM support is present and enabled in your OpenVZ kernel:
[root@HN ~]# grep KSM /boot/config-`uname -r` CONFIG_KSM=y
The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, readable by all but writable only by root.  On your hardware node, if KSM is not yet active, you'll see these default values:
[root@HN ~]# grep -H '' /sys/kernel/mm/ksm/* /sys/kernel/mm/ksm/full_scans:0 /sys/kernel/mm/ksm/merge_across_nodes:1 /sys/kernel/mm/ksm/pages_shared:0 /sys/kernel/mm/ksm/pages_sharing:0 /sys/kernel/mm/ksm/pages_to_scan:100 /sys/kernel/mm/ksm/pages_unshared:0 /sys/kernel/mm/ksm/pages_volatile:0 /sys/kernel/mm/ksm/run:0 /sys/kernel/mm/ksm/sleep_millisecs:20
For the meaning of each parameter, please refer to https://www.kernel.org/doc/Documentation/vm/ksm.txt
To start ksmd, issue
[root@HN ~]# echo 1 > /sys/kernel/mm/ksm/run
You can copy the same command in /etc/rc.local on the HN to make it persistent at boot.
Verify that ksmd is running:
[root@HN ~]# ps aux | grep ksmd root 264 0.3 0.0 0 0 ? SN Jun14 176:52 [ksmd] root 989187 0.0 0.0 103252 896 pts/0 S+ 09:30 0:00 grep ksmd
Enabling memory deduplication libraries in containers
In order to have KSM consider a memory page as a candidate for deduplication, the application itself must mark it as mergeable:
- KSM only operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) system call: int madvise(addr, length, MADV_MERGEABLE).
Note that most applications do not use madvise() at all: that's why KSM is generally used in conjunction with KVM, which takes care of marking pages as mergeable on behalf of the applications running within each virtual machine.
Luckily, we can override the default behaviour of each application using the ksm_preload package (http://vleu.net/ksm_preload/) available in CentOS base repo.
- Linux ≥ 2.6.32 features a memory-saving mechanism that works by deduplicating areas of memory that are identical in different processes (even if they were generated at runtime and after the fork() of their common ancestors).
- This mechanism requires the application to opt-in using the madvise() syscall. KSM Preload enables legacy applications (about any current application) to leverage this system by calling madvise(…, MADV_MERGEABLE) on every heap-allocated pages.
Do not use the usual # yum install ksm_preload command inside your containers, as it will install an unnecessary stream of dependencies.  Assuming your container is running on a recent CentOS 6.x template, issue the following commands instead:
[root@container /]# cd /usr/local/src
[root@container /]# yum install -y yum-downloadonly
    ...bunch of output...
[root@container src]# yum -y --downloadonly --downloaddir=/usr/local/src install ksm_preload
    ...bunch of output...
exiting because --downloadonly specified  // this is OK
We can now install just the ksm_preload RPM:
[root@container src]# rpm -i ksm_preload-0.10-3.el6.x86_64.rpm --nodeps
Enabling memory deduplication in applications
In order to make an application take advantage of ksm_preload and use KSM on the HN, add this line into its startup script (assuming your container is running CentOS 6.x x86_64):
LD_PRELOAD=/usr/lib64/libksm_preload.so
E.g., if you want to make Percona Server use KSM, modify its startup script like the following:
[root@container /]# nano /etc/init.d/mysql ... PATH="/sbin:/usr/sbin:/bin:/usr/bin:$basedir/bin" export PATH LD_PRELOAD=/usr/lib64/libksm_preload.so mode=$1 # start or stop ...
Then (re)start your Percona Server as usual.
How to check efficiency of KSM
To check if KSM is actually reducing memory usage, issue this command on the HN:
[root@HN /]# cat /sys/kernel/mm/KSM/pages_sharing
If the value is greater than 0, you're saving memory. Refer to https://www.kernel.org/doc/Documentation/vm/ksm.txt for more details.
To see all the KSM parameters, issue the following command on the HN:
grep -H '' /sys/kernel/mm/ksm/*
On this page you'll find a simple script which displays the same information in MB.
Tuning
On a production machine, you'll want to modify some of the default values.  A more sane value for /sys/kernel/mm/KSM/sleep_millisecs is usually between 50 and 250 (YMMV though):
[root@HN ~]# echo 50 > /sys/kernel/mm/ksm/sleep_millisecs
Caveats
The ksmd daemon will take one or two minutes to start deduplicating memory and will require several minutes to reach stable state.  During the boot phase your HN could start swapping if you have heavily overcommitted your RAM.  You might want to use more aggressive settings (higher pages_to_scan, lower sleep_millisecs) at the beginning, effectively trading CPU utilization for less chances of disk swapping, and then relax them after 10 mins or so.  Another possibility is to place your swap onto an SSD drive.
References
External links
</translate>
