From OpenVZ Virtuozzo Containers Wiki
Jump to: navigation, search



  • Rebase to RHEL5 8.1.8 kernel
  • Critical fix in CPT
  • Minor fixes for bridge, XEN x8664, CPT, 4GB split, nfs, vpids, etc.
  • Fix swsusp on SLES, CBQ fairness on low rates, NFS startup deadlock.
  • CBQ fairness on low rates fixed
  • NFS startup deadlock.

Config changes





Patch from Sergey Ya Korshunoff (seyko2@)
Fix TSS handling in vm86.c in Xen kernel.

Fix TSS handling in vm86.c in Xen kernel. There was a stupid misprint due to which load_esp0() was not called in Xen kernels at all.


Patch from Andrey Mirkin <>
[PATCH] CPT: remove redundant kfree()

Remove redundant kfree() call from open_deleted() function. Now ii is static structure and kfree on it leads to oops :/

Bug #84173.


Patch from Evgeny Kravtsunov <>
[PATCH] ebtables: ebtables_among fails on check() on x86-64

ebtables module calls the checker ebt_among_check() that compares the correct size of user supplied data.

Userspace size is calculated in the following way (ebtables-2.0.8-1):

 EBT_ALIGN(EBT_ALIGN(sizeof(struct ebt_among_info)) + X)

While kernel calculates size as:

 EBT_ALIGN(sizeof(struct ebt_among_info) + X)

On x86_64 EBT_ALIGN does alignment on 8 bytes, so the problem arises.

OpenVZ Bug #576.


Patch from Dmitry Mishin <>
[PATCH] Fix bridge removal with active master device

Fix bridge removal with active master device: simple misprint.


Patch from Vitaliy Gusev <>
[PATCH] IA64: mmap returns EINVAL if len==0

mmap on IA64 architecture returns EINVAL when len==0, while old kernel behaviour is to return addr in this case.

Though POSIX requires EINVAL in this case and it was fixed in mainstream around ~2.6.16, we still have to keep compatibility for some time with old stupid apps like rpm which did exactly this and expected success :/

Bug #83938.


Patch from Dmitry Monakhov <>
[PATCH] 4gb split: fix broken suspend

Following code was removed by 4gb split patch set, after this suspend was broken. Fix it.

Bug #84909.


Patch from Andrey Mirkin <>
[PATCH] CPT: check ctx->file for NULL

We need to be sure that dumpfile pointer (ctx->file) is not NULL, because we can't start dump without it.

Also we need to return error like EINTR instead of ERESTART*, because we just can't simply restart dump ioctl. The reason is that dumpfile is already closed and we need to reopen it before calling dump ioctl second time.

Bug #84412.


Patch from Andrey Mirkin <>
[PATCH] CPT: ignore user signals in kernel threads

Under ptrace signals are not handled immediately and we have non-zero shared_pending mask on current task, so fork() returns -ERESTARTNOINTR and wait4() returns -ERESTARTSYS. We need to block signals SIGCHLD, SIGWINCH, SIGCONT and SIGURG to be sure that this signals will be ignored while kernel thread creation.

Bug #84412.


Patch from Kirill Korotaev <>
[PATCH] CPT: remove killing of external processes

External processes can't be easily detected. Even if process has a virtual pid, it doesn't mean it has no any connectiions to VE0 like opened files/libraries etc.

So remove this feature at all and return back as it was - external processes should prevent from CPT.

Revert of the patches:

  • diff-cpt-kill-external-process-20070125
  • diff-cpt-kill-external-processes-b-20070515


Patch from Roman Chechnev <>
[PATCH] autofs4: compat layer for x8664

autofs4 uses platform dependant protocol which has 'long' data types inside data structures which are passed to/from user-space via pipe (sic!)...

Thanks to this 32bit autofs tools do not work with 64 bit kernel.

Bug #82040.


Patch from Jan Kara <>
[PATCH] jbd: remove_transaction fix

We have to check that also the second checkpoint list is non-empty before dropping the transaction.

Signed-off-by: Jan Kara <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

X-Git-Tag: v2.6.16-rc2~350
X-Git-Url: 43c3e6f5abdf6acac9b90c86bf03f995bf7d3d92


Patch from Konstantin Khorenko <>
[PATCH] bridge: race between br_del_if() and port_carrier_check()

This patch eliminates a race between br_del_if() and port_carrier_check() which leads to the oops in the latter function. This patch is a port of 2 mainstream patches:

[BRIDGE] br_if: Fix oops in port_carrier_check

Signed-off-by: Jarek Poplawski <>
Acked-by: Stephen Hemminger <>
Signed-off-by: David S. Miller <>
commit a10d567c89dfba90dde2e0515e25760fd74cde06


[BRIDGE]: eliminate workqueue for carrier check

Having a work queue for checking carrier leads to lots of race issues.
Simpler to just get the cost when data structure is created and
update on change.

Signed-off-by: Stephen Hemminger <>
Signed-off-by: David S. Miller <>
commit 269def7c505b4d229f9ad49bf88543d1e605533e

Bug #84789.


Patch from Konstantin Khorenko <>
[BRIDGE]: adding new device to bridge should enable if up

Port of mainstream patch:

[BRIDGE]: adding new device to bridge should enable if up
Aji Srinivas [Thu, 8 Mar 2007 00:10:53 +0000 (16:10 -0800)]
One change introduced by the workqueue removal patch is that adding an
interface that is up to a bridge which is also up does not ever call
br_stp_enable_port(), leaving the port in DISABLED state until we do
ifconfig down and up or link events occur.

The following patch to the br_add_if function fixes it.
This is a regression introduced in 2.6.21.

Signed-off-by: Stephen Hemminger <>
Signed-off-by: David S. Miller <>

commit de79059ecd7cd650f3788ece978a64586921d1f1

Bug #84789.


Patch from Kirill Korotaev <>
[PATCH] bridge: fix unaligned access to br->bridge_id

bridge_id is an unaligned structure of chars, which MUST be aligned on 2 bytes boundary for compare_ether_addr().

However, when we added

 unsigned char                   via_phys_dev;

field to struct net_bridge we broke this inexplicit alignment.

So move our field to a bit another place, returning back alignment of bridge_id.

Bug #84852.


Patch from Vitaliy Gusev <>
Debug and workaround patch for "division by zero" in sch_cbq module

Debug and workaround patch for "division by zero" in sch_cbq module (in cbq_normalize_quanta() function). For some unknown reason "division by zero" occurs and this patch should help to understand the broken math.

Bug #83243.


Patch from Kirill Korotaev <>
[PATCH] reiserfs: fix key decrementing

This patch fixes a bug in function decrementing a key of stat data item.

Offset of reiserfs keys are compared as signed values. To set key offset to maximal possible value maximal signed value has to be used.

This bug is responsible for severe reiserfs filesystem corruption which shows itself as warning vs-13060. reiserfsck fixes this corruption by filesystem tree rebuilding.

Signed-off-by: Vladimir Saveliev <>
Cc: <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

X-Git-Tag: v2.6.21-rc7~16
X-Git-Url: 6d205f120547043de663315698dcf5f0eaa31b5c


Patch from Alexey Dobriyan <>
[PATCH] proc: remove pathetic ->deleted WARN_ON

WARN_ON(de && de->deleted); is sooo unreliable. Why?

proc_lookup				remove_proc_entry
===========				=================
[find proc entry]
					[find proc entry]

WARN_ON(de && de->deleted);			...

					if (!atomic_read(&de->count))
						de->deleted = 1;

So, if you have some strange oops [1], and doesn't see this WARN_ON it means nothing.

[1] try_module_get() of module which doesn't exist, two lines below should suffice, or not?

Signed-off-by: Alexey Dobriyan <>

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

X-Git-Tag: v2.6.22-rc1~756
X-Git-Url: 578c8183c116e623d53b05d4c79762d053c7090f


Patch from Alexey Dobriyan <>
Code implementing ptrace_attach() does ~1/3 of job of attaching _before_

Code implementing ptrace_attach() does ~1/3 of job of attaching _before_ checking if attaching process have permissions to mess with target task at all. Given the overall raciness of utrace such code is recipe for trouble. Do ptrace_may_attach() check earlier.

NOTE: Right now

while (1)
	ptrace(PTRACE_ATTACH, pid, NULL, NULL);

reliably (and _quickly_) crashes kernel if pid is pid of process like syslogd normal user can't attach to:

Unable to handle kernel NULL pointer dereference at 0000000000000000
RIP: [<0000000000000000>]


Patch from Vasily Tarasov <>
[PATCH] netfilter: wrong debug assertion in nat code

Simple compilation fix if NETFILTER_DEBUG is on


Patch from Vasily Tarasov <>
[PATCH] netfilter: skb struct doesn't have nf_debug anymore

nf_debug field is missing in modern kernels, but in some places we still refer to it.

OpenVZ Bug #627.


Patch from Vasily Tarasov <>
[PATCH] venet: lots of spaces in /proc/vz/veinfo output

After introducing IPv6 support for venet device, field width for IP addresses in /proc/vz/veinfo was increased from 15 to 39:;a=commitdiff;h=ddb2b95ff38b528f5def1bd4ae87108bf3fa6b7a

The output seems a bit ridiculous, when VE owns only IPv4 addresses: to much strange spaces.

This patch corrects it and fixes the bug:

OpenVZ Bug #625.


Patch from Evgeny Kravtsunov <>
[PATCH] Xen: x8664 OVZ changes

x8664 Xen OVZ changes according to x8664 arch changes.


Patch from Andrey Mirkin <>
[PATCH] 4GB split: add KERNEL_DS handling to copy_mount_options()

On i386 arch with 4gb split kernel addresses can be more than TASK_SIZE (e.g. > 0xc0000000). That causes copy_mount_options() to return -EFAULT when called with kernel supplied buffers, i.e. when get_fs() == KERNEL_DS.

Bug #85041.


Patch from Alexandr Andreev <>
[PATCH]: small fix to compile kernel without VCPU support


Patch from Jing Min Zhao <>
[NETFILTER]: nf_conntrack_h323: add checking of out-of-range on choices' index values

Choices' index values may be out of range while still encoded in the fixed length bit-field. This bug may cause access to undefined types (NULL pointers) and thus crashes (Reported by Zhongling Wen).

This patch also adds checking of decode flag when decoding SEQUENCEs.

Signed-off-by: Jing Min Zhao <>
Signed-off-by: Patrick McHardy <>


Patch from Matt Mackall <>
[PATCH] PaX: wakeup threshold limits

If root raised the default wakeup threshold over the size of the output pool, the pool transfer function could overflow the stack with RNG bytes.

(Bug reported by the PaX Team <>)

Cc: Theodore Tso <>
Cc: Willy Tarreau <>
Signed-off-by: Matt Mackall <>
Signed-off-by: Chris Wright <>

drivers/char/random.c |    9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)


Patch from Pavel Emelianov <>
[PATCH] IPC: fix potential user leak

When user locks an ipc shmem segmant with SHM_LOCK ctl and the segment is already locked the shmem_lock() function returns 0. After this the subsequent code leaks the existing user struct:

== ipc/shm.c: sys_shmctl() ==
err = shmem_lock(shp-&gt;shm_file, 1, user);
if (!err) {
     shp-&gt;shm_perm.mode |= SHM_LOCKED;
     shp-&gt;mlock_user = user;

Other results of this are:

  1. the new shp->mlock_user is not get-ed and will point to freed memory when the task dies.
  2. the RLIMIT_MEMLOCK is screwed on both user structs.

The exploit looks like this:

  id = shmget(...);
  setresuid(uid, 0, 0);
  shmctl(id, SHM_LOCK, NULL);
  setresuid(uid + 1, 0, 0);
  shmctl(id, SHM_LOCK, NULL);

My solution is to return 0 to the userspace and do not change the segment's user.

Bug #78998.


Patch from David Moore <>
[PATCH] swiotlb: add missing phys_to_virt() call

Adds missing call to phys_to_virt() in the lib/swiotlb.c:swiotlb_sync_sg() function. Without this change, a kernel panic will always occur whenever a SWIOTLB bounce buffer from a scatter-gather list gets synced. Affected are especially Intel x86_64 machines with more than about 3 GB RAM.

Signed-off-by: David Moore <>
Signed-off-by: Stefan Richter <>
Signed-off-by: Greg Kroah-Hartman <>


OpenVZ Bug #645.


Patch from Dmitry Monakhov <>
[PATCH] BC: recharge vma if vm_flags changed after ->mmap() call

Several device drivers (sigh... ATI) can change vm_flags in their f_op->mmap method. Because of this mm->locked_vm changed after f_op->mmap was called. If ->vm_flags has been changed we have to recharge ub memory.


Patch from Dmitry Monakhov <>
[PATCH] BC: aidband - uncharge UB pages before charging to PB

By design we assume that page may be accounted only in UB or only in PB counter.

Unfortunately this is not always true, and ATI driver does some strange things like mmaping pages with PTEs to user space (maybe it is even a security hole in ATI driver, who knows?)

So ATI driver exports pages via mmap(2) to userspace which was already accounted in UB (pte pages are charged to kmemsize). In this case accounting conflict happens and BUG_ON(head->pb_magic != PB_MAGIC) is triggered.

Solution: Uncharge page from UB counter and account it in PB.

Changes from v1: Add WARN_ON_ONCE according to Pavel's cmomments.


Patch from Denis Lunev <>
[PATCH] allow kthreads by default in VE (for NFS)

This patch allows kernel threads by default inside VE.


Patch from Evgeny Kravtsunov <>
When creating socket within VE the following ones are allowed:

  family             |       type                  |   protocol
  PF_UNIX            |                             |
  PF_LOCAL           |                             |
  PF_PACKET          |  Any existing*              |   Any existing
  PF_NETLINK         |                             |
                     | SOCK_STREAM +   IPPROTO_TCP
                     | SOCK_RAW    +   Any
                     |                             |   forced to
                     |                             |   IPPROTO_IP
  PF_INET6           | SOCK_DGRAM  +   IPPROTO_UDP
                     | SOCK_STREAM +   IPPROTO_TCP
                     | SOCK_RAW    +   Any
                     |                             |   forced to
                     |                             |   IPPROTO_IP

Here "any existing" means that only SOCK_RAW and SOCK_DGRAM will work: other ones will be rejected by corresponding ->create function (for.ex. netlink_create). And this reject is ok, as it is not bug provoking.

Other families (PF_IPX, PF_X25, PF_AX25, PF_ATMPVC, PF_APPLETALK) are not allowed for sockets within VE as they are not virtualized.

The problem is function vz_security_proto_check prevents creating sockets with family=PF_INET/PF_INET6 type=SOCK_RAW protocol=(something except IP, UDP, TCP, ICMP, RAW) which are valid according to source.

Patch splits vz_security_proto_check into 2 separate checks: 1) family check vz_security_family_check and 2) protocol check vz_security_protocol_check. First one checks is the family value allowed in __sock_create, second one - checks if created socket contains the correct (virtualized) protocol. vz_security_protocol_check is placed inside create functions inet_create and inet6_create. This change will allow to create any socket within VE with type SOCK_RAW for any protocol that is not implemented in kernel and encapsulates its packets into IP packet (for example VRRP protocol).

In rtnetlink_dump_all and rtnetlink_rcv_msg functions calls of vz_security_proto_check are replaced by the call of vz_security_family_check.

Patch implements default deny security policy.

OpenVZ Bug #611.


Patch from Vitaliy Gusev <>
[PATCH] net: excessive UDP lost on VE send path

When tring to send big UDP packets from VE then other side receive about 60% of all IP fragmentated packets and about 10% of all UDP packets that was sent from VE. Fragmentated IP-packets are dropped on an ethernet interface because an interface's queue is full.

The ethernet interface's queue get full as venet/veth device passes fragmentated IP-packet with calling a sk_buff's destructor (by skb_orphan), socket's buffer become free, although it IP-packet isn't passed through the ethernet device. Therefore bulk IP-packets are sent through venet/veth interface that is much more than the real ethernet interface can transfer.

Decision: venet/veth interface call skb_orphan only for non IP-packets. For IP packets skb_orhpan (actually destructor) is called later: in IP local or when skb is delivered to ethernet and __kfree_skb() is called.

Tested with venet, veth, veth + vlan (host-node).

Thanks to Denis Lunev and Alexey Kuznetsov for ideas and help.


Patch from Denis Lunev <>
This patch ensures that VE is up and running during RPC connect. This

This patch ensures that VE is up and running during RPC connect. This staff can be run as a schedule_work when all tasks has been dead.

OpenVZ Bug #513.


Patch from Kirill Korotaev <>
[PATCH] VE: sys_getpgid/sid should depend on context

sys_getpgid/sid() should return global pid of VE task if info is requisted from VE0 task. Actually, not critical, but still. let's fix it.

Bug #85662.


Patch from Evgeny Kravtsunov <>
Patch fixes compilation error: XEN_CPUID is undefined in

Patch fixes compilation error: XEN_CPUID is undefined in include/asm-x86_64/mach-xen/asm/msr.h. To define XEN_CPUID on x84_64 patch attached makes msr.h to include xen/interface/arch-x86_64.h.

Patch prepared by Evgeniy Kravtsunov:
DRBD driver update 8.0.3 -> 8.0.4

Patch attached updates drbd version from 8.0.3 to 8.0.4. In 8.0.4 a set of oopses is fixed according to drbd changelog:

OpenVZ Bug #615.


Patch from Vasily (vvs@):

RHEL5 forget to apply last of our megaraid_mbox fixes:

From: Andrey Mirkin <>
Date: Mon, 16 Oct 2006 08:08:43 +0000 (+0400)
Subject: [PATCH] scsi: megaraid_{mm,mbox}: 64-bit DMA capability fix
X-Git-Tag: v2.6.19-rc3~208

[PATCH] scsi: megaraid_{mm,mbox}: 64-bit DMA capability fix

It is known that 2 LSI Logic MegaRAID SATA RAID Controllers (150-4 and
150-6) don't support 64-bit DMA.  Unfortunately currently this check is
wrong and driver sets 64-bit DMA mode for these devices.

Signed-off-by: Andrey Mirkin <>
Acked-by: Vasily Averin <>
Signed-off-by: Linus Torvalds <>


Patch from Dmitry Monakhov (dmonakhov@):
If gfs_blk2rgrpd() has failed bh is leaked on error path in gfs_shrink().


Patch from Vagin Andrey (avagin@):

Device-Mapper's "delay" target delays reads and/or writes and maps them to different devices.

QA team needs this feature to do certain tests on top of a slow storage: vzabackup, filesystem tests, etc.

Backport from 2.6.22.


Patch from Alexandr Andreev <>
[PATCH] x86-64: do not use virt_to_page on kernel data address

  • virt_to_page() call should be used on kernel linear addresses and not on kernel text and data addresses. Swsusp code uses it on kernel data (statically allocated swsusp_header).
  • Allocate swsusp_header dynamically so that virt_to_page() can be used safely.
  • I am changing this because in next few patches, __pa() on x86_64 will no longer support kernel text and data addresses and hibernation breaks.

Signed-off-by: Vivek Goyal <>
Signed-off-by: Andi Kleen <>


[SWSUSP]: correct virt_to_page() usage in swsusp

Bug #86406.


Patch from Vitaliy Gusev <>
[PATCH] CBQ: fix unfairness when gettimeofday clock source is used

sch_cbq with gettimeofday clock source has limit 2000000 usec for the idle (undertime) time. Therefore when we try to set bandwidth less than 10000 bits/s then sch_cbq doesn't work (idle time want to become about 4000000 usec).

Triggered by RHEL5 which switched from jiffies clocksource to gettimeofday() BTW, why? According to ANK this should work poorly, since gettimeofday can take as much as 100 microseconds...

Bug #86375.


Patch from Pavel Emelianov <>
[PATCH] BC: fix several issues in /proc/bc

find /proc/bc doesn't work with several errors reported.


  1. getdents() sometimes returns EOVERFLOW due to sign expansion in generated entries' inode numbers;
  2. bc and subbc have equal generated inode numbers;
  3. /proc/bc has broken (from find's POV) nlink count.

Fix it all.


Patch from Vitaliy Gusev <>
[PATCH] net: allow ethtool ops inside VE

This patch allows ethtool operations into VE with CAP_VE_NET_ADMIN capability.


Patch from Vitaliy Gusev <>
[PATCH] venet: compilation warning fix

label "out" is not used anymore. Fix the warning.


Patch from Denis Lunev <>
[PATCH] initialize ve0.op_sem earlier

ve0->op_sem has been initialized on vecalls modules loading, but nowdays can be used before vzmon during NFS initialization...

Bug #86869.


Patch from Alexey Dobriyan <>
[PATCH] ptrace: fix task->mm dereference out of task_lock()

Utrace code removed task_lock() around ->mm checks in ptrace_attach(), but ->mm->vps_dumpable continued to be checked without task_lock().