From OpenVZ Virtuozzo Containers Wiki
< Download‎ | kernel‎ | rhel4‎ | 023stab037.3
Revision as of 16:40, 20 March 2008 by Kir (talk | contribs) (created)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



  • CPT fixes
  • compat iptables fixes
  • dead state tasks leak fixed
  • Fixes for broken teamspeak application
  • ext3 fixes
  • Microcode update fix
  • UBC dcache fix
  • OOM killer fix
  • Fixes for compilation with gcc4
  • Mainstream bridges security fix
  • /proc/cpuinfo Mhz output fix.
  • Security fix for local port range.
  • Old vzdq detached inode warning fix.
  • Other fixes.




  • In-kernel sysfs/uevent layer is now updated to be compatible with FC5 and SLES10 userland.



Patch from Alexey Dobriyan: Debug patch for resolving OVZ bugs #341, #116, #177.

  • Extend ->origin into array for previous origin tracking.
  • Print last two origins in debugging message.
  • Also print i_mode, i_op, i_fop, ... of the offending inode.


Patch from Kirill: move_task_off_dead_cpu() requires interrupts to be disabled, while migrate_dead() calls it with enabled interrupts. Added appropriate comments to functions and added BUG_ON(!irqs_disabled()) into double_rq_lock() and double_lock_balance() which are the real source of such bugs.

Signed-Off-By: Kirill Korotaev <>


Patch from Denis:

This patch fixes the following problem: if a process is being killed inside __alloc_pages by OOM, it can't exit till it frees some space, which is impossible for now. The patch allows to dig into reserves for such a process.

Bugs #71604, #71179.


Patch from Alexey: gcc complains about scanned being used uninitialized and it's right.


Patch from Pavel: Fix negative vm_rss accounting.

Bug #71680.


Patch from Andrey:

Some file system used inside VE do not have mount point (nfs, fuse). We do not need to perform checks and umount for them.

Existing check is relaxed and umount is performed conditionally.


Patch from Andrey:

When we moving network device from VE0 we must check that VE doesn't have device with the same name (e.g. we can create veth device with name eth0 inside VE and try to move eth0 device from VE0 to this VE).


Patch from Roman Chechnev: Implementation of userspace events

Summary patches and changes:

  • export of SEQNUM to userspace (creates /sys/kernel)
  • kobject: adjust hotplug_seqnum increment to keep userspace and kernel agreeing.
  • kobject: fix build error if CONFIG_HOTPLUG is not enabled.
  • kobject: hotplug_seqnum is not 64 bits on all platforms, so fix it.
  • ksyms: don't implement /sys/kernel/hotplug_seqnum if CONFIG_HOTPLUG is not enabled.
  • Implemetation of userspace events through a netlink socket
  • kobject_uevent warning fix
  • kobject_uevent: fix init ordering
  • kevent: standardize on the event types
  • kobject: add CONFIG_DEBUG_KOBJECT
  • kevent: add block mount and umount support
  • kobject: add add_hotplug_env_var()
  • kevent: add __bitwise kobject_action to help the compiler check for misusages
  • Make kobject_hotplug() work even if the kobject's kset doesn't implement any
  • kobject_uevent warning fix
  • hotplug: prevent skips in sequence number from happening
  • kobject: fix hotplug bug with seqnum
  • take me home, hotplug_path[]
  • Move hotplug_path[] out of kmod.[ch] to kobject_uevent.[ch]
  • kevent: fix build error if CONFIG_KOBJECT_UEVENT is not selected.
  • USB: use add_hotplug_env_var() in core/usb.c
  • Use add_hotplug_env_var() in firmware loader
  • fix unnecessary increment in firmware_class_hotplug() and USB core
  • drivers/usb/core/usb.c: add MODALIAS env var to hotplug
  • usb: class driver pass dev_t to the class core
  • PCI: add MODALIAS to hotplug event for pci devices
  • PCI: Remove newline from pci MODALIAS variable
  • PCI: move pci core to use add_hotplug_env_var()
  • add the physical device and the bus to the hotplug environment
  • add the driver name to the hotplug environment
  • driver core: allow struct bin_attributes in class devices
  • class_simple: pass dev_t to the class core
  • class core: export MAJOR/MINOR to the hotplug env
  • block: genhd: terminate, set to next free slot, shrink available space
  • avoid problems with kobject_set_name and name with %
  • Hotplug: Make dev->bus checking consistent
  • Driver core: add driver symlink to device
  • add the bus name to the hotplug environment
  • Driver core: add "bus" symlink to class/block devices
  • add sysfs attr to re-emit device hotplug event


Patch from Pavel Emelianov <>: [UBC] Fix UB_NUMFILE accounting optimisation leak

In 2.6.16 files are put via RCU, so ub_file_uncharge() is called in IRQ context. Thus non-atomic decrement of file_precharged must be done with IRQs disabled.


Patch from Pavel Emelianov <>: [UBC] Don't allow precharged files exhaust kmemsize

When file is put it may be added to precharged value to some task thus holding UB_NUMFULE and UB_KMEMSIZE resources.

The problem is that files do not start uncharging till ub_barrier_farnr() is hit for UB_NUMFILE. For ub0 ub_barrier_farnr() can happen only after hitting kmemsize barrier. Thus kmemsize reurce gets completely exhausted.

On 2.6.9 this problem is not easyli reproducible as files are put in the context of closing task usually. On 2.6.16 files are put via RCU and thus - in other task's context.


Patch from Pavel Emelianov <>: [UBC] Set correct precharge values for init_task.

Otherwise file freeing will happen in "swapper" context and will spoil all statistics due to "negative" unsigned long value.

OpenVZ Bug #322.


Patch from mainstream: [PATCH] binfmt_elf: clearing bss may fail

So we discover that Borland's Kylix application builder emits weird elf files which describe a non-writeable bss segment.

So remove the clear_user() check at the place where we zero out the bss. I don't _think_ there are any security implications here (plus we've never checked that clear_user() return value, so whoops if it is a problem).

Signed-off-by: Pavel Machek <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

OpenVZ Bug #332.


Patch from Andrey:

Wrong structure with inode attributes were used in fixup_file_content(). We must be sure that we won't broke file system, thus we do not set inode mode attributes from S_IFMT mask.

Fix for bugs:
Bug #71135.
Bug #71161.


Patch from Andrey: Due to silly mistake wrong inode mode were set on restore (cpt_mode were used instead of cpt_i_mode).

This patch should be used instead of diff-cpt-rst-deleted-owner-b-20061030.

Fix for bugs:
Bug #71135.
Bug #71161.


Patch from Kostja: don't create /proc/<PID> dentry/inode for threads that are not group leaders. Sets /proc/<TGID>/task/<PID> dentry as a task->proc_dentry for such threads.

Bug #69536.

This should eliminate the problem of growing number of processes in X state.


Patch from Pavel:

Optimized kmemsize accounting calls ub_slab_(ub)charge with IRQs disabled, but debugging code isn't aware of it...

Bug #70694.


Patch from Andrey: Context should be set from device (dev->owner_env) even for devices from VE0.


Patch from Alexey: [CPT] fail when VE refers to an invisible file

Checkpointing used to ignore EINVAL returned by d_path(). It was workaround for tmpfs shmem files, which use detached mounts. But this means that real invisible paths are detected too late: checkpointing succeds and restore fails, which is not acceptable.

When d_path() fails, check that it is shmem file. If it is not, fail immediately.


Patch from Dmitry: Fixed matches modules refcount in case of error in compat_copy_from_user()


Patch from mainstream: [IPV4]: Fix lost routes in fn_hash netlink dumps.

Spotted by, the fn_hash_dump_bucket() main loop does not increment 'i' properly, and thus routes will not be listed, when the test 'i < s_i' passes.

The bug was added when the code was converted over to hlist_for_each_entry() by your's truly.

Signed-off-by: David S. Miller <>



Patches from Kir Kolyshkin <>: Various gcc4-related compilation fixes.


Patch from Alexey: [CPT] do not checkpoint/restore global process groups

The patch is three-fold:

1. Do not try to allocate process groups/sessions, unless they are not virtual. This is fix for bug #71825. However, it is too late to detect failure.

2. Do not checkpoint VE, if it contains references to extenal process groups/session ids. It is _destructive_ part. It definitely will prevent migration of some commonly used configurations, when some deficient daemon (sort of qmail) forgets to daemonize itself and it is started by vzctl exec.

Workaround is possible in theory at level of vzctl, if it makes the second fork and setsid() after VE_ENTER. It is not impossible, because entered process is not required to be child of vzctl, actual reaping and waiting is done not by wait4(), but with control pipe. Another way is to use clone(CLONE_PARENT), but it is also tricky.

3. Do the same checks before migration started to prevent failure due to #2 after rsync phase.


Patch from Alexandr Andreev: fix misprint (error) in sysctl numbers.


Patch from Alexey: [CPT] checkpoint/restore conntrack on 2.6.9

Comparing to 2.6.8, 2.6.9 tracks much more information about TCP connections. We need to reserve additional space for it. Right now it still does not make much of sense to add additional attribute just because it is already known that in 2.6.16..19 conntrack is even more different.

So, we just change format of all the record and try to fill missing field with some reasonable values, when migrating from 2.6.8.


Patch from mainstream:

bridge: fix possible overflow in get_fdb_entries (CVE-2006-5751)

Make sure to properly clamp maxnum to avoid overflow (CVE-2006-5751).

Signed-off-by: Chris Wright <>


Patch from Andrey with fixes from Dmitry Monakhov:

in journal=ordered or journal=data mode retry in ext3_prepare_write() breaks the requirements of journaling of data with respect to metadata. The fix is to call commit_write to commit allocated zero blocks before retry.

Author: Andrey Savochkin <>
Signed-Off-By: Kirill Korotaev <>


Patch from Dmitry (dmonakhov@), modified by Kirill:

This patch fixes issues introduced by diff-ext3-pgfault11 patch

  • remove unused variables
  • fix incorrect recursion detection

Bug #71881.


Patch from Alexandr Andreev:

SysRq debugger memory dumping enhancements:

  • 32/64 bit architectures support (80 chars per line);
  • skipping lines with zero bytes;


Patch from Alexey Dobriyan:

Fix of deadlock in ptrace_attach().


fix: commit f5b40e363ad6041a96e3da32281d8faa191597b9
Fix ptrace_attach()/ptrace_traceme()/de_thread() race

fix in fix: commit f358166a9405e4f1d8e50d8f415c26d95505b6de
ptrace_attach: fix possible deadlock schenario with irqs
ptrace_traceme cleanup:
commit 6b9c7ed84837753a436415097063232422e29a35

[PATCH] use ptrace_get_task_struct in various places

write_can_lock() part was dropped since it doesn't exist in 2.6.9. it was replaced with schedule().

PTRACE_TRACEME chunk was projected by hand on i386, x86_64, sparc, sparc64, ppc, ppc64 due to ptrace_traceme() cleanup done after 2.6.9. If more archs need coverage, yell on me.

Bug #72235.
Bug #61233.


Patch from mainstream, ported by Kostja(khorenko@):

fix removes the microcode's size check on x86.

Should be applied to 2.6.9-x series, 2.6.18.

Bug #72356. (which applies after clean-up patch

# ChangeSet
#   2006/09/27 08:26:18-07:00
#   [PATCH] x86 microcode: don't check the size
#   IA32 manual says if micorcode update's size is 0, then the size is
#   default size (2048 bytes). But this doesn't suggest all microcode
#   update's size should be above 2048 bytes to me. We actually had a
#   microcode update whose size is 1024 bytes. The patch just removed the
#   check.
#   Signed-off-by: Shaohua Li <>
#   Cc: Tigran Aivazian <>
#   Signed-off-by: Andrew Morton <>
#   Signed-off-by: Linus Torvalds <>


Patch from Denis (den@) based on idea from Pavel:

This patch fixes dcache leak on race from dcache_charge[_forced] and dcache_unchange. The idea: do not trust dentry_bc after count state change.

Bug #72051.


Patch from Denis (den@):

This patch calculates OOM generations directly. The counter is increased when MM of process killed by OOM is finally destroyed.

Bug #71980.


Patch from Alexey Dobriyan:

At one place qlnk origin is not set which may hide useful info later.


Patch from Alexandr Andreev:

  • account tty structures to kmemsize
  • setup driver->refcount correctly. Doesn't affect anything.


Patch from Dmitry (dim@):

Issue found by Patrick McHardy. After checks reordering target and matches checks that they could be used for this hook returns true always due to not initialized e->comefrom field. So, order restored, necessary checks moved in mark_source_chains(). For compats this issue exists from the beginning.


Patch from Alexey (alexey@):

[PATCH] VE_ENTER switches to virtual pid

When PID of the process is used by another processes as their PGID/SID, we cannot do this. Otherwise, we can safely switch to virtual pid.

Difference of previous version is in one line: do_env_enter() can be done when the process already has a virtual pid. (This sounds crazy, but this is what happens with checkpointing. :-)).


Patch from Alexey Dobriyan:

ve_scale_khz() ignores the number of virtual cpus in the node leading to strange results in /proc/cpuinfo:

	0.5 * 4 * 1000MHz
	------------------- => 500MHz (but it should be 666 MHz)

Also, initialize ->vcpus of fairsched init node to something sensible to avoid division by zero. ->vcpus was not explicitly initialized at startup.

Bug #71984.


Patch from Alexey Dobriyan:

VE has simple idle time collection logic (per VCPU ->strt_idle_time, ->idle_time). For ->idle_time incrementing ->strt_idle_time must not be 0. This happens when the very first task is scheduled on VCPU. Before that all VCPU statistics is zeroed out because of ve = kzalloc(sizeof(struct ve_struct)); including ->strt_idle_time.

All this leads to suprising /proc/stat and, as a consequence, top(1) output:

# vzctl exec 140 cat /proc/stat
cpu  83 0 150 65654 173 0 0
cpu0 66 0 98 64839 167 0 0
cpu1 15 0 47 369 6 0 0
cpu2 0 0 4 446 0 0 0
cpu3 0 0 0 0 0 0 0	<===
cpu4 0 0 0 0 0 0 0	<===
cpu5 0 0 0 0 0 0 0	<===
cpu6 0 0 0 0 0 0 0	<===
cpu7 0 0 0 0 0 0 0	<===

When user, system and nice times are 0%, it's OK. But when idle time is _also_ 0%, it's surprising.

The solution is to start idle_time collecting state machine when VCPU is added.

As a nice side offect, when you start VE with 2 VCPUs, later add 3-rd, it's idle time will start ticking from the moment of addition.

OpenVZ Bug #366.


Patch from Alexey Dobriyan, modified by Kirill:

When file is opened and unlinked before vzquotaon, but generic_delete_inode() is called after vzquotaon, scary message appears:

VZDQ: detached inode not in creation, orig 5, dev dm-0, inode 73828039, fs ext3
current 18761 (httpd), VE 102, time 77463.605075
[<ed79c296>] vzquota_det_qmblk_recalc+0x256/0x270 [vzdquota]
[<ed79c302>] vzquota_inode_qmblk_recalc+0x52/0x70 [vzdquota]
[<ed79c573>] vzquota_inode_data+0xb3/0xf0 [vzdquota]
[<ed79c449>] vzquota_inode_init_call+0x19/0x80 [vzdquota]
[<021fb940>] ext3_delete_inode+0x0/0x120
[<ed79e47f>] vzquota_initialize+0xf/0x20 [vzdquota]
[<0219d983>] generic_delete_inode+0x173/0x190
[<021996f6>] dput_recursive+0x56/0x230
[<0217f333>] __fput+0x123/0x1b0
[<0217d4c2>] filp_close+0x52/0xa0
[<0217d57a>] sys_close+0x6a/0xa0

However, there is no need to scary admin:
1) inode is in I_FREEING state.
2) vzquota never saw inode before, thus it doesn't know what to do with it.
3) inode was unlinked outside of vzquota area of interest, otherwise -EBUSY would have returned on quotaon.

So, do nothing and let inode silently die.

Many thanks to Alexey Kuznetsov for spelling out reliable testcase.

OpenVZ Bug #116
OpenVZ Bug #177
OpenVZ Bug #341
Bug #60532.
Bug #61431.
Bug #55275.


Patch from Denis:

Hide the content of /proc/net/sockstat rather than the file itself to keep sysstat from crashing.

Bug #72587.


Patch from Denis:

This patch fixed dcache accouting turning off. ub_dentry_walk does not guarantee the order it meets dentry leafs and nodes. So, just set d_inuse to -1 and uncharge all at once.

Bug #72730.


Patch from Denis:

tcp_v4_get_port fixed. Treat local_port_range[0] > local_port_range[1] as local_port_range[1] == local_port_range[0].

Bug #72736.


Patch from Dmitry Monakhov:

ext3_journal_stop() should be called after ext3_prepare_failure() unconditionally. i.e. always stop.