From OpenVZ Virtuozzo Containers Wiki
Jump to: navigation, search



  • UBC optimisations
  • CPT fixes and updates from stable branch.
  • Dynamic VCPU control
  • VE mass stop speedup
  • Loopback statistics
  • Compilations fixes
  • sysfs ptmx virtualization.
  • Mainstream update up to
  • Code cleanups from sparse.

For the complete list of changes in this release, see git changelog for kernel 026test018.



Patch from Alexey Kuznetsov <>:
[CPT] remove annoying printk

In 2.6.9 printk("=") in refrigerator() is commented out. We should remove printk(">\n") in cpt. The code with comment is not removed, but commented out to remember that we have to return this, if the printk in refrigerator() is uncommented.


Patch from Alexey Kuznetsov <>:
[CPT] asmlinkage attribute was forgotten

This fixes CPT with CONFIG_REGPARAM compiled


Patch from Pavel Emelianov <>:
[CPT] capabilities check fixes

  • namespace->sem is replaced with namespace_sem;
  • task->used_math is replaced with tsk_used_math().


Patch from Andrey Mirkin <>:
This patch adds checking for unsupported CPT features.


Patch from Alexey Kuznetsov <>:
[CPT] restoring threads with tsk->fs==NULL

If a nptl thread is ptraced, it does not die immediately and we can arrive to the state:

  main_thread    -----> thread1 [ptraced]

To restore such configuration we do kernel_thread(CLONE_SIGNAL) in context of main_thread. But if it is exited, it has tsk->fs == NULL and kernel oopes.

Suggested fix is very simple: we just attach temporary fs_struct from init task of VE. Also, we have to delay initialization of tsk->group_exit, otherwise kernel will not allow us to clone.

This fix is pragmatic.

Better fix would be restructuring of restore to delay zombification until the last stage of restore. I.e. we could restore all the tree of alive processes with all the attributes of alive task (fs, mm etc). And after it is complete, we could make one more pass and collect garbage killing zombie tasks and clearing fs, mm etc. It would be cleaner and safer, but requires too much of changes.

Bug #65219.


Removed hunks after optimisation patches


Patch from Pavel Emelianov <>:
[CPT] Don't leave function argument list empty - use 'void'


Patch from Andrey Mirkin <>:

This patch adds renumbering of netdev->ifindex'es on restore process. We can do this because network is suspended. All manipulations are protected with rtnl_lock().


Patch from Andrey Mirkin <>:
This patch fixes iptables save/restore on SUSE.

Bug #62837.


Patch from Pavel Emelianov <>:
Export and unstatic __d_path() call for CPT capability checks.


Patch from Andrey Mirkin <>:
In tests we can see message: "mm_struct is referenced outside" After that message checkpoint fails.

It seems that this situation is legal, so checkpoint could be restarted. So we return -EAGAIN to be able to restart checkpoint.


Patch from Andrey Mirkin <>:
This patch removes renumbering of ifindexes of venet and loopback devices on restore.


Patch from Andrey Mirkin <>:
Network devices list were not protected while checkpoining.

This patch adds necessary protection.


Patch from Alexey Kuznetsov <>:
[CPT] SMP race in detecting state of ptraced processes

When suspending VE, we test state of processes while they are still running. It is not a bug: we have to verify for invalid state before checkpointing, real state is saved after processes are scheduled out.

The impact is that we can see process in a bad state, f.e. stopped without any reasons. It is also not a bug, but this rersults in random failures of checkpointing. The only way to fix this is to order updates of state variables. The order is correct almost everywhere.


Patch from Andrey Mirkin <>:

Mount point's mnt_flags (noexec,nosuid,nodev) were omitted and not restored correctly.

This patch should be applied with patch for bind mounts in other case we should do the following:

  1. Remove check for bind-mounts in do_remount() function
  2. Change procedure for restoring bind-mounts in next way:


Patch from Alexey Kuznetsov <>:
[CPT] do not keep open cwd while restore

>>From the viewpoint of CPT, cwd/root are very similar to an open file, it is just pair dentry/mnt. Normally, when opening some file we store it and its inode in special object cache to resolve opening of the same inode, when some of its aliases (dentries) are deleted.

But it is useless for directories, which cannot be hardlinked ever. And this consumes numfile UBC, so that restore can fail easily. So, do not store cwd/root file, unless it is deleted. This does not solve problem with restoring VE hitting numfiles, but relieves it a lot.

Now we can temporarily increase numfile limit while cpt/rst by 2 and everything should be OK.


Patch from Alexey Kuznetsov <>: [CPT] save/restore even SIG_DFL handlers

Linux has a funny feature: when SA_ONESHOT signal resets handler, flags are not set to default. And LTP tests verify this pathology.


Patch from Alexey Kuznetsov <>:
[CPT] restore mm->dumpable correctly

mm->dumpable is not boolean in >=2.6.9, but tri-state. Just save and restore raw value.


Patch from Kirill Korotaev <>:
Fix of compilation of diff-cpt-suspend-cleanup.


Patch from Pavel Emelianov <>:
[CPT] Remove printk("|\n") from refrigerator. (#55914)


Patch from Alexey Kuznetsov <>:
[CPT] tcp sockets were bind()ed incorrectly during restore

This case was totally missed. Fortunately, this happens rarely.

If checkpoint happens after some listening socket was closed, but it left behind some children (including timewait buckets), restore fails to bind them, unless the service used SO_REUSEADDR.

Stress checkpointing of LTP tests did not catch this earlier only because... I repaired the tests not to fail upon exhaustion of port space some time ago. Before that they failed with obvious and harmless diagnosis long before the first binding conflict happened.


Patch from Andrey Mirkin <>:

Feature set were not saved in CPT, so VEs based on SUSE template could fail after restore (VE_FEATURE_SYSFS was lost). Save feature set in place which were not used before (cpt_os_version and cpt_os_features fields in image header).


Patch from Andrey Mirkin <>:
[CPT] This patch adds veth support in CPT


Patch from Pavel Emelianov <>:
[CPT] Remove ifdefs around wait_task_inactive()


Patch from Alexey Kuznetsov <>:
[CPT] fix compilation with CONFIG_DEBUG_INFO

Just #undef it.


Patch from Alexey Kuznetsov <>:
[CPT] process priority was restored incorrectly on x86_64

Ugly type casting bug. u32 was implicitly casted to long and on 64bit archs negative nice values were rejected as huge positive ones.


Patch from Pavel Emelianov <>:
Show info about the largest kmem caches in OOM killer and SysRq-M handler.


Patch from Pavel Emelianov <>:
Show top slabs functionality comp fixes:

  • lock w/o irqsave and flags;
  • correct loop counter;
  • names: objsize -> buffer_size.


Patch from Andrey Mirkin <>:
This patch virtualizes /proc/cpuinfo.

Added sysctl to scale or not cpu frequency inside VE.


Patch from Pavel Emelianov <>:
Add prototype for ve_scale_khz() in vsched.h (comp)


Patch from Kirill Korotaev <>:
This patch adds new fairsched syscalls which allows to change number of VCPUs inside VE dynamically on the fly.


  • per FS-node task list
  • do_fairsched_vcpus: adjust rate
  • __migrate_task doesn't return any error code and can fail
  • empty flag in vcpu_del / synchronize optimization
  • finish diff-cpuinfo
  • /proc file with vcpus field?


Patch from Dmitry Mishin <>:
This patch fixes iowait_time statistics for both VE0 and VEs.

  • removes redundant nr_iowait field in VE_CPU_STATS (bug noticed by Matt Loschert)
  • after schedule task may be activated on the another processor.

Port on 2.6.16 by Xemul.


Patch from Pavel Emelianov <>:
Compilation fix for nr_iowait_ve() modifications.


Patch from Kir Kolyshkin <>:
[PPC] fixes the mistype and the formatting in powerpc's show_regs().


Patch from Kir Kolyshkin <>:
[PPC] adds fairsched syscalls for powerpc


Patch from Pavel Emelianov <>:
Cleanups in fairsched code found by sparse

  • rq->push_vcpu = NULL;
  • __user attribute in sysctl handler argument.


Patch from OpenVZ team <>:
Merged from /linux/kernel/git/stable/linux-2.6.16.y


Patch from Andrey Mirkin <>:

This patch adds support of 3 mount flags to bind mount Now we can do bind mounts with noexec, nosuid and nodev options w/o need to do remount.


Patch from Patrick McHardy <>:
[NETFILTER] x_tables: fix compat related crash on non-x86

When iptables userspace adds an ipt_standard_target, it calculates the size of the entire entry as:

sizeof(struct ipt_entry) + XT_ALIGN(sizeof(struct ipt_standard_target))

ipt_standard_target looks like this:

struct xt_standard_target
      struct xt_entry_target target;
      int verdict;

xt_entry_target contains a pointer, so when compiled for 64 bit the structure gets an extra 4 byte of padding at the end. On 32 bit architectures where iptables aligns to 8 byte it will also have 4 byte padding at the end because it is only 36 bytes large.

The compat_ipt_standard_fn in the kernel adjusts the offsets by

 sizeof(struct ipt_standard_target) -
     sizeof(struct compat_ipt_standard_target),

which will always result in 4, even if the structure from userspace was already padded to a multiple of 8. On x86 this works out by accident because userspace only aligns to 4, on all other architectures this is broken and causes incorrect adjustments to the size and following offsets.

Thanks to Linus for lots of debugging help and testing.

Signed-off-by: Patrick McHardy <>
Signed-off-by: Linus Torvalds <>


Patch from Kir Kolyshkin <>:
[PPC] adds needed TIF_FREEZE define to powerpc


Patch from Pavel Emelianov <>:
Added __user attribute to sysctl handler's args in softirqd disabling code

Found by sparse.


Patch from Andrey Savochkin <>:
This patch fixes currently incorrect comments about locking in dcache.


Patch from Andrey Savochkin <>:
Dcachesize accounting optimization.

The accounting becomes conditional, and dentries start to be accounted only when a given fraction of normal zone is consumed by dcache. On switching accounting on and off, all dentries are walked in stop_machine and charged to ub0 or uncharged.

Port for 2.6.16 by Pavel Emelianov <>


Patch from Andrey Savochkin <>:
Main part of file accounting optimization.

  • files are charged by quants;
  • pre-charged but not used amount is kept in task_beancounter.


Patch from Andrey Savochkin <>:
Additional optimizations of file and kmemsize accounting, fixes.

  • files are now charged to kmemsize explicitly, not through SLAB_UBC;
  • certain amount of numfile and their kmemsize is precharged at fork;
  • poll tables of small size are not charged at all;
  • get_beancounter_batch and put_beancounter_batch are introduced to adjust refcounts at precharge/uncharge time, in batches, instead of at each allocation/deallocation.


Patch from Pavel Emelianov <>:
Take kmem memory usage for file_cachep directly fro cachep.


Patch from Andrey Savochkin <>:

More file/kmemsize accounting fixes related to charges/uncharges to wrong beancounters, as seen when testing optimisation.


Patch from Pavel Emelianov <>:
[UBC] Use gfp_t type where appropriate in ub_mem.c

Found by sparse.


Patch from Andrey Savochkin <>:
Start of kmemsize accounting optimization.

  • kmemsize is accounted by quants;
  • per-charged amounts are kept in task_beancounter for faster and lockless charge/uncharge operations.


Patch from Andrey Savochkin <>:
File and kmemsize accounting optimization fixes and improvements.

  • missing uncharge added;
  • a lot of likely/unlikely added;
  • files are really charged into kmemsize;
  • the problem of atomicity of per-task field is resolved by shifting irq_disable/enable around kmemsize charge calls.


Patch from Andrey Savochkin <>:
Another small but important optimization of kmemsize charges.

The maintenance of SLAB_UBC infrastructure is costly, so kmalloc caches were duplicated, one for !SLAB_UBC allocations and one for SLAB_UBC ones. Deallocations in the former avoid the extra work of checking whether the object was charged.


Patch from Pavel Emelianov <>:
Typo in mm/slab.c after kmem optimisation patch port.


Patch from Kir Kolyshkin <>:
[PPC] fixes the following compilation issue on ppc platform

In file included from include/asm/tlb.h:20,
                 from arch/powerpc/platforms/pseries/lpar.c:37:
include/asm/pgalloc.h:97: error: conflicting types for '__pte_alloc'
include/linux/mm.h:819: error: previous declaration of '__pte_alloc' was
make[2]: *** [arch/powerpc/platforms/pseries/lpar.o] Error 1


Patch from Pavel Emelianov <>:
UBC socket buffers accounting locking fix.

All sock beancounters are stored in the list, starting at top beancounter, and thus top's lock must be used to protect the list.


Patch from Pavel Emelianov <>:
[UBC] Cleanups in networking accounting

  • remove unused gfp var from sock_alloc_send_skb2
  • gfp_t type in ub_skb_alloc_bc()


Patch from Pavel Emelianov <>:

Network buffers (un)charging logic is

  1. work with top beancounter
  2. update all the rest witl (un)charge_beancounter_notop

In ub_sock_tcp_chargesend() it was broken (#65495)


Patch from Pavel Emelianov <>:

Return sk_stream_wait_memory() prototype to original state to make inifiniband driver (and any other caller) compile. Places that use new version call __sk_stream_wait_memory().


Patch from Andrey Savochkin <>:

Make (un)charge_xxx_notop functions inline to avoid call and IRQ disabling for top beancounters. Spotted in profiles by Den.


Patch from Pavel Emelianov <>:
Memset file to 0 before charging it to prevent f_ub erasing.


Patch from Pavel Emelianov <>:
Fix of nrfiles accounting.

Since file_cachep is not SLAB_UBC after Andrey's optimisations slab_ub(file) will BUG_ON inside slab_ub_ref. Use file->f_ub instead.


Patch from Pavel Emelianov <>:
[UBC] Fix UB_NUMFILE accounting optimisation leak

In 2.6.16 files are put via RCU, so ub_file_uncharge() is called in IRQ context. Thus non-atomic decrement of file_precharged must be done with IRQs disabled.


Patch from Kir Kolyshkin <>:
[PPC] fix ubc syscalls declaration for powerpc


Patch from Andrey Savochkin <>:
This patch prints more sensible warning on bad refcounter in __put_beancounter.


Patch from Andrey Savochkin <>:
Various changes in socket buffer accounting.

  • likely/unlikely added;
  • internal code organization improved;
  • skb->sk never follows for netlink sockets (it's almost always wrong);
  • ub_wcharged and optimizations should never be used for netlink sockets.


Patch from Andrey Savochkin <>:

This patch removes skb accounting speed-up for UNIX sockets. It doesn't work (kfree_skb is called in a different socket's context). Along with this, charge severity fixed in tcp_chargepage (#63650)


Patch from Kir Kolyshkin <>:
[PPC] asm-powerpc/unistd.h mistype fix


Patch from Andrey Savochkin <>:

This patch fixes an apparent bug in accounting in ub_sock_tcp_chargepage. Should help problems at DefenderHosting.


Patch from Andrey Savochkin <>:
Optimization of tcprcvbuf accounting.

Keep pre-charged amount in per-socket forw_space.


Patch from Andrey Savochkin <>:
Tcpsndbuf optimization.

  • Keep more in per-socket poll_reserve, do not hurry to return to beancounter if limits are high enough;
  • Certain unification and streamlining of charge/uncharge functions;

minor: severity renamed to ub_severity, to keep this name in proper namespace.


Patch from Denis Lunev <>:
Per-UB limitation to the number of TCP timewait buckets.

This is done to disallow to eat VE kernel memory by them completely. Unfortunately, virtualized sysctl can't help, as TW buckets live after actual VE death, so the counter on UB is used.

So, the number of TW buckets is limited by

  • number of buckets allowed for a UB
  • the fraction of kernel memory limit (in 1024th)

which one is reached first (#61789)

Ported on 2.6.16 by Xemul.


Patch from Pavel Emelianov <>:
Fix ub_timewait_check() to get kmem_cache objuse directly and from correct slab.


Patch from Pavel Emelianov <>:
[UBC] Added __user attribute to UBC syscalls arguments

Found by sparse.


Patch from Kir Kolyshkin <>:
[PPC] adding MAP_EXECPRIO define for powerpc


Patch from Denis Lunev <>:
Deadlock on beancounter lock.

sk_stream_write_space() sends signal to a task, so it can take a beancounter lock. ub_tcp_snd_wakeup()/ub_sock_snd_wakeup() was called with a lock held.


Patch from Pavel Emelianov <>:
Use gfp_t type in __alloc_collect_stats()

Found by sparse.


Patch from Alexey Kuznetsov <>:
[PATCH] memory leakage in fib_hash

FIB hash tables and zone structs were never freed. Each time, when VE is stopped, they leak.

vzctl chkpnt/restore tests bring a system with 4G of ram quite soon. Of course, vzctl start/stop is not so fast to bring down a system with decent amount of RAM, but hundreds of thousands of slab entries are still well visible.

The patch solves leakage in size-128 and most of leakage in size-64.

We still leak two objects in size-64 and 6 entries in size-32.


Patch from Pavel Emelianov <>:

Try to cleanup each VE in a separate thread. This alows simultaneous stop of many VEs at once (#60673)


Patch from Andrey Savochkin <>:
Better print message on promiscuous mode change by ve_printk (possible DoS?)


Patch from Dmitry Mishin <>:
This patch allows VE owner to use net.ipv4.conf.<net_device>.xxx sysctls.

Bug #66842.


Patch from Pavel Emelianov <>:
Fix memory leak in case of CONFIG_VE_NETDEV=n

Do not create fib rules if we're not going to use them.


Patch from Pavel Emelianov <>:
void argument in declarations of fib_rules_create()/destroy()


Patch from Dmitry Mishin <>:
Virtualized loopback_stats

Bug #66571.


Patch from Dmitry Mishin <>:
MTU manipulations on VE's devices

  • removed mtu restore logic for moved devices
  • added posibility to set mtu > 1500 for veth devices (#66836)


Patch from Andrey Savochkin <>:

Fix of broken virtualization of /proc/net/rt_cache.

Bug #65528.


Patch from Andrey Savochkin <>:
Fix for accidently broken /proc/sys/net/ipv4/route/flush.

Fixes permissions as well.


Patch from Pavel Emelianov <>:
[VETH] Add __user attribute to the 2nd copy_from_user()'s argument

Found by sparse.


Patch from Vasily Tarasov <>:

Since size can change in ipt_flush_table() xt_free_table_info() will fail to free memory then.

OpenVZ Bug #191.
Bug #65721.

Port on 2.6.16 by Xemul.


Patch from Dmitry Mishin <>:
[PPC] enabled usage of ip_tables compat layer on ppc64


Patch from Denis Lunev <>:

This patch virtualizes ip_local_port_range sysctl to allow specification of different port range for auto-binding inside VE.


Patch from Vasily Tarasov <>:
Add /sys/class/tty/ptmx device

It's necessary, 'cause otherwise udev doesn't create /dev/ptmx

OpenVZ Bug #243.

Ported patch from Umka by Vasily.


Patch from Pavel Emelianov <>:
Add prototypes for init/fini_ve_tty_class() calls.


Patch from Pavel Emelianov <>:
Cleanups in vecalls.c and vzcalluser.h

  • C99 syntax in structures init;
  • __user attribute where appropriate;
  • pass NULL as pointer arg, not 0.

Also define an empty __user macro for userspace in vzcalluser.h

Found by sparse.


Patch from Pavel Emelianov <>:

Small venet cleanups

  • C99 syntax in structures declarations
  • __user attribute

Found by sparse.


Patch from Pavel Emelianov <>: Cleanups in veowner.c

  • use C99 syntax in struct fields initialization;
  • use NULL instead of 0 for pointer arg.

Found by sparse.


Patch from Denis Lunev <>:

This patch fixes small potential information leak, i.e. here we should protect against core dump of VE0 process not inside VE, but against core dump of VE0 process inside VE filesystem. So, lets prevent coredump of such process at all.


Patch from Vasily Averin <>:
/proc/interrupt file should be closed if kernel_thread() fails

Bug #68096.


Patch from Pavel Emelianov <>:
vfs_read() wants the 2nd argument to have __user attribute

Found by sparse.


Patch from Vasily Tarasov <>:
[VZDQ] OOPS due to vzquota format operations are not implemented.

If usual quota is launched it uses usual vfs_quota_on which utilize format operations == NULL and it causes oops.

OpenVZ Bug #184.


Patch from Pavel Emelianov <>:
[VZDQ] Compilation fix for CONFIG_VZ_QUOTA_UGID=n case

  • ifdefs in a couple of places;
  • moved some code out of compiled-out file;
  • 'ifdef' instead of 'if defined'.

OpenVZ Bug #222.


Patch from Vasily Tarasov <>:
Turns off quota in spite of errors while syncing inodes.

Bug #65186.


Patch from Pavel Emelianov <>:
[VZDQ] Cleanups in vzquota code

  • C99 syntax in structures initialization
  • __user attribute where appropriate

Found by sparse.


Patch from Kir Kolyshkin <>:
Fixes vzwdog compilation in case CONFIG_FLAT_NODE_MEM_MAP is not set.


Patch from Andrey Savochkin <>:
Restore showing IRQ information in vzwdog.