From OpenVZ Virtuozzo Containers Wiki
< Download‎ | kernel‎ | rhel5‎ | 028stab060.2
Revision as of 13:41, 1 April 2009 by Kir (talk | contribs) (close the noinclude tag)
Jump to: navigation, search



Since 028stab059.6:

  • Rebased on 2.6.18-92.1.18 RHEL5 update (RHSA-2008-0957)
  • Backported some patches from RHEL5 update 92.1.22 (RHSA-2008-1017)
  • Fixed utimensat system call (OpenVZ Bug #970)
  • Fixed CAP_AUDIT capability in CT (for dbus)
  • Added UB_SWAPINFO resource (for Oracle in CTs, needs vzctl >= 3.0.24)
  • NFS deadlocks fixed
  • Many small fixes in CPT code


Same as in 028stab059.6, plus:



No new issues.


Ported from RHEL5 2.6.18-92.1.22.el5 kernel

  • linux-2.6-nfs-v4-credential-ref-leak-in-nfs4_get_state_owner.patch
  • linux-2.6-net-ipv4-fix-byte-value-boundary-check.patch
  • linux-2.6-fs-don-t-allow-splice-to-files-opened-with-o_append.patch
  • linux-2.6-drm-i915-driver-arbitrary-ioremap.patch


Patch from Vitaliy Gusev <>
[PATCH] CPT: Fix ip_conntrack_ftp usage counter leak

Function ip_conntrack_helper_find_get() gets module counter. So put a conntrack after putting in the hash and handling the conntrack's expect list.


Patch from Vitaliy Gusev <>

Don't allow chkpnt VE with mounted ext2/ext3, etc filesystems.

Allow checkpoint only for mounted nodev and "external" filesystem.

This check protects from error on restore:

  CPT ERR: ffff810007113000,102 :-2 mounting /root/some_dir ext3 40000000

as do_one_mount() doesn't pass mntdev to mount().

[xemul: actually, the reason we don't support filesystems other than virtual and tmpfs is because we simply can't (easily) get the mount options for them to cpt and restore ]

Bug #131737.


Patch from Vasily Averin <>
cpt: incorrect printk modificator in iter_one_mm

printk inside iter_one_mm() used "%lx" for pgprot_val(), but it is "long long" on i386 PAE kernels. The CPT_FID has the %s inside, so improper arguments lenghts can cause oops while dereferencing the string ptr.

Bug #128474.


Patch from Pavel Emelianov <>
cpt: compilation fix for sit restoring in !IPv6 case

OpenVZ Bug #1060.


Patch from Vitaliy Gusev <>
cpt: Fix leak during checkpointing overmounted /dev/null

Bug #130958.


Patch from Vitaliy Gusev <>
[PATCH] CPT: put 'expect' after insert to the 'conntrack'

During restore conntrack, we need to put expect after allocating ip_conntrack_expect and do something with one. Expect will be freed or immediate (if nobody has this expect) or during cleanup/timer hooks. Otherwise expect never will be freed.

Note: Approaches for kernels 2.6.18 and 2.6.9 are different. For example see help() in "net/ipv4/netfilter/ip_conntrack_netbios_ns.c"


Patch from Vitaliy Gusev <>
Restore information about tcp listening sockets (cpt_state == TCP_LISTEN)

Not all options are important. Only missed ipv6only can cause error if other application want to listen the same port for IPv4 any address.

tp->XXX are inherited by children (noticed by Alexey Kuznetsov), so we need also to restore these options.

Comment from Alexey:
It [everything before] was not OK. The feature which are broken are important, but not actually critical except for ipv6only.

F.e. DEFER_ACCEPT is broken -> but nobody will notice, it just will not be deferred.


Patch from Pavel Emelianov <>
cpt: dump udp stats and udp6, not just udp6 twice

This is actually harmless, since both stats have equal size, although somewhat incorrect result is produced on restore.

Found when compiling kernel with no IPv6 support.

OpenVZ Bug #1060.


Patch from Konstantin Khlebnikov <>
cpt: restore only bc resources really presented in cpt image.

store UB_RESOURCES in cpt_beancounter_image while checkpointing. (leave all new added resources with default limits filled at bc alloc)

change cpt_content of cpt_beancounter_image to CPT_CONTENT_ARRAY to detect structure version without bumping cpt image version, because in old images __cpt_pad field (reused for cpt_ub_resources) uninitilized.

add missed error handling inside rst_undump_ubc -- toss errors from restore_one_bc to higher level.

Bug #115800.


Patch from Pavel Emelianov <>
CPT: Fix VDSO page handling wrt new VDSO setup in RHEL5

The main difference is that now we have an array of whole *one* page, rather than just a virtual address. The other stuff it that the vma->vm_ops now point to vma_special_ops.


Patch from Pavel Emelianov <>
fairsched: Sanitize fairsched manipulations on ve startup

First of all we won't be able to call them after we fix capability checks. Second of it is that taking the fairsched mutex 4 times on startup is an overkill.


Patch from Konstantin Ozerkov <>
quota: Properly split comap (i.e. v1) declarations from all the others

In short words, this patch moves CONFIG_QUOTA_COMPAT stuff from <linux/quota.h> into separate include file. This is needed for fix compilation error when CONFIG_SECCOMP option enabled (declaration cross reference).

OpenVZ Bug #972.


Patch from Denis Lunev <>
br: do not always transmit packets to real Ethernet via bridge

Bridge in via_phys_dev mode always transmits packets via master_dev even this is not actually required as master_dev->dev_hard_xmit is called unconditinally.

This patch do a simple thing. When packet is trying to send via master_dev (first time), master_dev is replaced with bridge->dev. IMHO this approach should be used from the very beginning.

Additionally, locking on TX path is fixed. In older case we can jump inside bridge->hard_start_xmit with TX lock from actual device held.

Bug #129292.


Patch from Konstantin Khlebnikov <>
ms: backport utimensat systemcall and machinery

Step1: steal piece of code from mainsteam (last commit 2d8f3038)

Bug #121508. OpenVZ Bug #970.


Patch from Konstantin Khlebnikov <>
ms: backport utimensat systemcall and machinery (p3)

Step3: inject sys_utimensat into syscall tables.

Bug #121508. OpenVZ Bug #970.


Patch from Konstantin Khlebnikov <>
ms: backport utimensat systemcall and machinery (p2)

Step2: fixes wrt 2.6.18 kernel:

  • replace struct path usage with struct dentry and struct nameidata.
  • rename new do_utimes to __do_utimes and make it static.
  • rewrite permition checks to existent calls.

Bug #121508. OpenVZ Bug #970.


Patch from Konstantin Khlebnikov <>
CPU hotplug: fix cpu_is_offline() on !CONFIG_HOTPLUG_CPU

Cherrypicked from mainstream commit a263898f (from Ingo Molnar <>) Bug #126915.


Patch from Konstantin Khlebnikov <>
[PATCH] hotplug: Allow modules to use the cpu hotplug notifiers even if !CONFIG_HOTPLUG_CPU

Backported patch from Avi Kivity <> (git:47e627bc)

The following patchset allows a host with running virtual machines to be suspended and, on at least a subset of the machines tested, resumed. Note that this is orthogonal to suspending and resuming an individual guest to a file.

A side effect of implementing suspend/resume is that cpu hotplug is now supported. This should please the owners of big iron.

This patch:

KVM wants the cpu hotplug notifications, both for cpu hotplug itself, but more commonly for host suspend/resume.

In order to avoid extensive #ifdefs, provide stubs when CONFIG_CPU_HOTPLUG is not defined.

In all, we have four cases:

  • UP: register and unregister stubbed out
  • SMP+hotplug: full register and unregister
  • SMP, no hotplug, core: register as __init, unregister stubbed (cpus are brought up during core initialization)
  • SMP, no hotplug, module: register and unregister stubbed out (cpus cannot be brought up during module lifetime)

Signed-off-by: Avi Kivity <>
Cc: Ingo Molnar <>
Cc: Rusty Russell <>
Cc: Oleg Nesterov <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

OpenVZ Bug #1027.


Patch from Marat Stanichenko <>

Patch from Marat (mstanichenko@), acked-by Den (den@)
Another attempt.

The previous patch (diff-ms-rtnlcompat-20080711) doesn't fix the problem because at the end of the rtnetlink_rcv_msg() "type" is not equal to RTM_NEWLINK. It is changed at the beginning of the fuction (see "type -= RTM_BASE"). So, we must take it into account.

Bug #115250.

Moved from 028stab059.stable specs to list.


Patch from Pavel Emelianov <>
utimes: compilation fix for x86_64 COMPAT=y case :\


Patch from Denis Lunev <>
nfs: warning into dmesg on vzquota/NFS server conflict

OpenVZ Bug #1086.


Patch from Marat Stanichenko <>
We should avoid writing to EOI register during NMI cause Intel specification declares the opposite.

Bug #132139.


Patch from Pavel Emelianov <>
x86_64: Compat system calls for UBC and fairsched

Required by PSBM

Bug #131966.


Patch from Konstantin Ozerkov <>
ubc: Fix compilation when CONFIG_UBC_DEBUG_KMEM enabled

This patch fixes broken kernel compilation when enabled CONFIG_UBC_DEBUG_KMEM.

OpenVZ Bug #1048.


Patch from Konstantin Khlebnikov <>
ubc: Upgrade UB_SWAPPAGES to full-blooded resource.

The limit value will be used as configured CT swap size to show in /proc/swaps and /proc/meminfo. Default is UB_MAXVALUE

Bug #115800.


Patch from Pavel Emelianov <>

We neither have nor want (yet) it virtualized.


Patch from Pavel Emelianov <>
ve: Keep the CAP_SETVEID in container


That's OK - CAP_SETVEID checks are already removed.


Patch from Konstantin Khlebnikov <>
mounts: show /dev/xxx devices near ve root mounts, rather than just xxx

Required for fixing autofs in rhel5 container:


Patch from Konstantin Khlebnikov <>
ve: Fill swap size/usage with data from UB_SWAPPAGES in meminfo notifier.

Don't show swap if the limit is unlimited (default state).

Bug #115800.


Patch from Denis Lunev <>
ip: check for owner_env on bind bucket is extra

The reason: bind bucket carries owner_env on itself and this check has been just performed above in inet_csk_get_port. Moreover, this check is bogus as sk2 can be a timewait bucket.

This check has been already removed in netns code by Pavel.

Bug #127484.


Patch from Pavel Emelianov <>
ve: Don't check for CAP_SETVEID - use more ... imagination

  • This patch:

The proposed check correctly detects the root in ve0. However, we lose the ability to create containers with some fancy tool, that has the CAP_SETVEID capability only, but we don't have such.

The cap itself is declared to be obsoleted, but there's no need in rewriting vzctl in a rush - things will still work. If we'll want to manipulate audit caps from the vzctl we'll make it via features.

  • Overall history:

Don't ban CAP_AUDIT_XXX capabilities in container to make the dbus-daemon work.

After two (maybe tree) days of brain storm me and Den finally gave birth to this solution. So...

First of all AUDIT will be banned in container. Since dbus refused not to set audit caps we don't want it to mess with it in any case.

Next step is to note, that CAP_AUDIT_CONTROL coincides with the CAP_VE_ADMIN, which is not that bad (besides, dbus doesn't try to set this one up) and we leave one alone.

And finally - the CAP_AUDIT_WRITE, which coincides with the most delicate one - CAP_SETVEID. The latter one is explicitly dropped on container start and there's no way to set one (dbus tries this and fails) back. Simple "don't clear it" solution is too dangerous.

TO handle *this* case we

  1. replace all checks to capable(CAP_SETVEID) to more complicated, but still matching ve0's root only;
  2. don't ban the CAP_SETVEID (== CAP_AUDIT_WRITE == the_one_dbus_needs);
  3. remember, that this capability is present on ve startup and thus we automatically have the CAP_AUDIT_WRITE required by dbus;
  4. carefully handle the case, when we enter container in do_env_create and try to call fairsched system calls.

That's it. No fraud, just manual dexterity  ;)

Bug #117448.


Patch from Vitaliy Gusev <>
Fix NULL dereference virtualized ip_nat variables via netlink

If VE is allowed to contrack but is not allowed to ip_nat and ip_conntrack_netlink is loaded then user from VE can hang host: First Ooops in ip_nat_core.c:ip_nat_proto_find_get, second in ip_nat_core.c:find_appropriate_src() with host going to panic as read_lock_bh is held:

Unable to handle kernel NULL pointer dereference at 0000000000000030 RIP:
  [<ffffffff881636c1>] :ip_nat:ip_nat_proto_find_get+0x61/0xa0
Process lt-ctnl_test (pid: 10587, veid=1000, threadinfo ffff81000b8da000, task ffff810005e87040)
Stack:  ffff81000fb001f8 ffff810015f2fe98 ffff81000b8db888 ffffffff8819a362
  0000000000000000 0000000000000000 ffff81000b8db8a8 ffff81000fb001f8
  ffff81000b8dba48 ffff81000b8dba20 ffff81000b8db908 ffffffff8819a6f9
Call Trace:
  [<ffffffff8819a362>] :ip_conntrack_netlink:ctnetlink_parse_nat_proto+0x92/0xe0
  [<ffffffff8819a6f9>] :ip_conntrack_netlink:ctnetlink_create_conntrack+0x349/0x4e0
  [<ffffffff8819bcf7>] :ip_conntrack_netlink:ctnetlink_new_conntrack+0x367/0x9c0
  [<ffffffff8819bd28>] :ip_conntrack_netlink:ctnetlink_new_conntrack+0x398/0x9c0
  [<ffffffff8106061f>] __lock_acquire+0xcff/0xd50
  [<ffffffff8812d52b>] :nfnetlink:nfnetlink_rcv_msg+0x20b/0x230
  [<ffffffff8812d350>] :nfnetlink:nfnetlink_rcv_msg+0x30/0x230
  [<ffffffff8812d5c0>] :nfnetlink:nfnetlink_rcv+0x70/0x174
  [<ffffffff811fefaa>] netlink_data_ready+0x1a/0x60
  [<ffffffff811ffa3b>] netlink_sendmsg+0x51b/0x560
  [<ffffffff8102be10>] default_wake_function+0x0/0x10
  [<ffffffff811e1a5e>] sock_sendmsg+0xee/0x110
  [<ffffffff8104e9f0>] autoremove_wake_function+0x0/0x40
  [<ffffffff81253f29>] _spin_unlock_irqrestore+0x49/0x60
  [<ffffffff8105f33c>] mark_held_locks+0x7c/0xb0
  [<ffffffff8106061f>] __lock_acquire+0xcff/0xd50
  [<ffffffff811e1845>] move_addr_to_kernel+0x25/0x40
  [<ffffffff811ea714>] verify_iovec+0x54/0xb0
  [<ffffffff811e26a6>] sys_sendmsg+0x246/0x2c0
  [<ffffffff8111300b>] __up_read+0x9b/0xb0
  [<ffffffff81051cf6>] up_read+0x26/0x30
  [<ffffffff8101e791>] do_page_fault+0x4e1/0x8e0
  [<ffffffff81250e5b>] thread_return+0x98/0x1cd
  [<ffffffff8105f54b>] trace_hardirqs_on+0x11b/0x160
  [<ffffffff81250e5b>] thread_return+0x98/0x1cd
  [<ffffffff8105f54b>] trace_hardirqs_on+0x11b/0x160
  [<ffffffff812534d3>] trace_hardirqs_on_thunk+0x35/0x37
  [<ffffffff8100a006>] system_call+0x7e/0x83

Bug #127153.


Patch from Denis Lunev <>
lockd: do not attempt to shutdown lockd hosts from other environments

This codepath is invoked during lockd stop which, in turn, is per/VE. The consequence is simple and bad - timeout on RPC operations. User visible consequence is the following message in dmesg:

lockd: couldn't shutdown host module!

Bug #126918.


Patch from Marat Stanichenko <>
ve: Use vpid in pi_futex code.

As we use tasks' vpid to own pi futex we should do it everywhere.

Bug #132768.


Patch from Vitaliy Gusev <>
printk: fix lockdep warnings if kernel compiled with CONFIG_LOCKDEP

vprintk() to VE causes:

   [ BUG: lock held at task exit time! ]
   iptables/8203 is exiting with locks still held!
   1 lock held by iptables/8203:
    #0: (sk_lock-AF_INET){--..}, at: [<ffffffff81213341>] ip_setsockopt+0x61/0xa0
   stack backtrace:
   Call Trace:
    [<ffffffff8100b78a>] show_trace+0xca/0x3b0
    [<ffffffff8100ba85>] dump_stack+0x15/0x20
    [<ffffffff8105e469>] debug_check_no_locks_held+0x89/0xa0
    [<ffffffff8103aa7e>] do_exit+0xe2e/0xe80
    [<ffffffff8103aba0>] sys_exit_group+0x0/0x20

Note: to reproduce this you can type in VE:

    iptables -A INPUT -m tcp --dport 22 -j DROP


Patch from Konstantin Khlebnikov <>
ve: Add /proc/swaps file inside CT.

Fill the size/used values with the ones from the meminfo virtinfo notifier.

Show one fake swap partition (/dev/null) with the same size/used as in /proc/meminfo. If --meminfo == none show overall swap statisctics from HN.

Bug #115800.


Patch from Konstantin Ozerkov <>
vzquota: replace quota master block semaphore with mutex

Bug #120822.


Patch from Konstantin Ozerkov <>
vzquota: replace master lock semaphore with mutex

Bug #120822.