From OpenVZ Virtuozzo Containers Wiki
Jump to: navigation, search


  • Rebased on RH 42.0.8-EL
  • Compat fixes for x86_64 (inode numbers and sys_stime)
  • CPT UBC preserving is fixed and during restore limits are increased now
  • UBC othersockbuf over-optimization fix
  • Updated 3ware driver up to version
  • NFS in VE fix
  • No more panic on oops
  • sys_waitid virtualization fixes
  • Lots of other small fixes and debug patches

Config changes

Same as 023stab040.1, plus:




Patch from Andrey (saw@), modified by Evgeniy:
Fix for over-optimization of OTHERSOCKBUF accounting.

For those sockets there is no protection by socket sock.

Bug was provoked by optimization of charging/uncharging othersockbufs: diff-ubc-tcpsndopt-20060429

In brief idea is the following: optimization is based on assumption that soket is always locked by lock_sock and protected from using the socket by more than one users simultaneously. But current assumption is wrong for datagram sockets (for example PF_UNIX ones), that are not locked in the majority of cases. This provokes race condition between 2 users of ths same dgram socket. As for tcp sockets - they are always locked (or it can be done so), - this prevents races.

Bug #70974.
Bug #74089.


Patch from Andrey:
Some error can occur during rst_swapoff() and sys_swapoff().

In case of -EINVAL we do not need to perform cleanup. In all other cases we should do it.

Move cleanup in separate function and perform it in loop unless success or -EINVAL. Clear TIF_SIGPENDING flag in case of pending signals to make sure that sys_swapoff() won't be interrupted, restore this flag on exit if it was cleared.

Bug #74725.


Patch from Andrey:

While restore process we can exceed UBC limits, because during restore process more resources are used.

Bug #71159.


Patch from Andrey:
Change ubc image format to remove magic numbers like 6 and 12.


Patch from Andrey:

Change order of ubc parameters in image file. Now we are storing resource pairs (ub_parms and ub_store) as one unit:

Previous format was:
KMEMSIZE parms, LOCKEDPAGES parms, ..., KMEMSIZE store, LOCKEDPAGES store, ...

With new format it is simpler to increase number of ubc resources.


Patch from Kirill:

The original patch which was used in OVZ/VZ was diff-ms-fs-preparewrite-eh-20061005.

It is a pity, but it was broken by RH when commited to RHEL4 update (linux-2.6.13-buffer.patch). __block_prepare_write() error handling is done incorrectly, since IO initiated on some of the buffers should be waited for to complete (wait_on_buffer).

Fix it with this incremental patch which makes VZ code the same as it was for a long time already.


Patch from Andrey:

1. Index of lazy page was checked incorrectly:

-               if (page_nr > PAGE_SIZE/sizeof(struct pgin_desc*)) {
+               if (page_nr >= PGINDIR_SIZE/sizeof(struct pagein_desc*)) {

so we could try to access outside of array boundaries and oops.

Bug #74455.
Bug #75539.

2. Current lazy migration is limited to 512MB on x86-64. Increase table size to be able to store up to 2097152 lazy pages (8 Gb).


Patch from Vasily:
fixes kmap PTE0 leakage: pte_unmap() missed on error path in install_page()

Bug #75560.


Patch from Kirill:

During 4GB split port to 2.6.18 it was found that 2.6.9 kernel incorrectly inserts unitialized yet pgd to pgd_list. it is wrong, initialize it first.


Patch from Andrey:

pfn index was checked incorrectly while lookup/alloc, so that we could get out of the array boundaries and oops.

related to the same bugs with lazy migration:
Bug #74455.
Bug #75539.


Patch from Pavel, found by Vasiliy:

When porting to new mm locking one unmap+unlock was lost. Found due to (but not fixes):
Bug #75448.


Patch from Andrey:
UBC were saved and restored incorrectly:

for (i = 0; i < UB_RESOURCES; i++)
	dump_one_bc_parm(v->cpt_parms, bc->ub_parms, 0);

Only KMEMSIZE values were saved and restored in this case.

1. Do not restore UBC if we get image with previous version.

2. cpt_parms has space for 32x2 resources, however, first UB_RESOURCES * 2 are used. i.e. not 24 of 32 and 24 of 32. keep this for compatibility.


Patch from Andrey:
Add check for image version.

Allow to restore only images from 2.6.9 kernel. Actually the following combinations are now allowed only:
2.6.8/2.6.9 <-> 2.6.9+plus patch
2.6.9 -> 2.6.16+ (this patch disables this combination)


patch from Dmitry (dmonakhov@):
Backported from mainstream v2.6.13
[PATCH] ext3: drop quota references before releasing inode

We must drop references to quota structures before releasing the inode.

Signed-off-by: Jan Kara <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

commit ab6862e6dab813ecde9ae7da506188dc1e9f11bb


patch from Dmitry (dmonakhov@):
Backported from mainstream v2.6.13
[PATCH] ext2: drop quota reference before releasing inode

We must drop references to quota structures before releasing the inode.

Signed-off-by: Jan Kara <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

commit c7e9a52ef0089492bba457dfb8eba1a54e19f24a


Patch from Evgeny:

1. when DISK_QUOTA is switched off in /etc/vz/vz.config, sim_statfs takes kstatfs from underlying fs. reiserfs do not initialize f_ffree (free inodes) and f_files in kstatfs. So we need to zero out kstatfs structure before asking reiserfs.

2. reiserfs used to initialize f_ffree to -1 (in 2.4.x). it was an exception among other filesystems that could be used for determining that fs is reiserfs. In 2.6.x f_ffree is not initialized by reiserfs at all. So need to distinguish reiserfs another way. Use fsmagic.

OpenVZ Bug #199.


Patch from Alexey Dobriyan:

Proposed patch to fix #5 in

To reproduce, do

  • grab poc at the end of advisory.
  • add line "eph.p_memsz = 4096;" after "eph.p_filesz = 4096;"
    where first "4096" is something equal to or greater than 4096.
  • ./poc /usr/bin/sudo && ls -l

Here I get:

-rw------- 1 ad   ad   102400 2007-01-15 19:17 core
---s--x--x 2 root root 101820 2007-01-15 19:15 /usr/bin/sudo

Check for MAY_READ as binfmt_misc.c does.


Patch from Vasily:

fixed initialization of sysctl_vsyscall variable, currently vsyscall_init always overwrites the value zeroed in time_init_gtod()

Bug #73353.


Patch from Alexey:
[CPT] sigsuspend could hang forever after restore

Do not restart syscalls with TIF_RESTORE_SIGMASK in cpt.

It was severe bug. First, we do not need to restart such syscalls, they are restarted by core on exit from syscall. Second, it was wrong to restart syscall but do not clear TIF_RESTORE_SIGMASK and do not restore mask. If some signal happens here, it will be delivered, but syscall is restarted and sigsusend() will not exit hanging forever.

To knowledge base: restart without checking for TIF_RESTORE_SIGMASK is not allowed.


Patch from Alexandr Andreev:
32bit compat_sys_stime is missing on x86_64

OpenVZ Bug #438.


Patch from mainstream:
[PATCH 1/3] make static counters in new_inode and iunique be 32 bits

From: Jeff Layton <>


When a 32-bit program that was not compiled with large file offsets does a stat and gets a st_ino value back that won't fit in the 32 bit field, glibc (correctly) generates an EOVERFLOW error. We can't do anything about fs's with larger permanent inode numbers, but when we generate them on the fly, we ought to try and have them fit within a 32 bit field.

This patch takes the first step toward this by making the static counters in these two functions be 32 bits.

Signed-off-by: Jeff Layton <>
Acked-By: Kirill Korotaev <>


Patch from Vasily Tarasov (vtaras@), improves CFQ IO scheduler:

"CFQ in 2.6.9 kernel creates a request queue for each process and performs round robin procedure other these queues, selecting one request from each queue. This patch adds time-slice for each per-process queue: it means, that during timeslice only requests from the certain queue are serviced. Such mechanism is used in CFQ 2.6.18."

Bug #71929.


Patch from Vasiliy:
3ware driver update to


Patch from Evgeny:

svc_recvfrom (net/sunrpc/svcsock.c) function switches the context to ve0 and never returns to ve context. This may cause oops when VE private area is placed on nfs partition.

Bug #76354.


Patch from mainstream:
[PATCH] buffer: memorder fix

unlock_buffer(), like unlock_page(), must not clear the lock without ensuring that the critical section is closed.

Mingming later sent the same patch, saying:

We are running SDET benchmark and saw double free issue for ext3 extended attributes block, which complains the same xattr block already being freed (in ext3_xattr_release_block()). The problem could also been triggered by multiple threads loop untar/rm a kernel tree.

The race is caused by missing a memory barrier at unlock_buffer() before the lock bit being cleared, resulting in possible concurrent h_refcounter update. That causes a reference counter leak, then later leads to the double free that we have seen.

Inside unlock_buffer(), there is a memory barrier is placed *after* the lock bit is being cleared, however, there is no memory barrier *before* the bit is cleared. On some arch the h_refcount update instruction and the clear bit instruction could be reordered, thus leave the critical section re-entered.

The race is like this: For example, if the h_refcount is initialized as 1,

cpu 0:                                   cpu1
--------------------------------------   -----------------------------------
lock_buffer() /* test_and_set_bit */
lock_buffer() /* test_and_set_bit */
h_refcount = h_refcount+1; /* = 2*/     h_refcount = h_refcount + 1; /*= 2 */
....                                    ......

We lost a h_refcount here. We need a memory barrier before the buffer head lock bit being cleared to force the order of the two writes. Please apply.

Signed-off-by: Nick Piggin <>
Signed-off-by: Mingming Cao <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: 72ed3d035855841ad611ee48b20909e9619d4a79


Patch from Evgeny Kravtsunov <>:
[SIMFS] get lower vfsmount on simfs mount

This prevents lower FS from being umounted while simfs is mounted.

OpenVZ Bug #451.;a=commit;h=63f1ecae912ee9614bcad23cc147ca5557f8b547


Patch from mainstream:
[PATCH] return ENOENT from ext3_link when racing with unlink

Return -ENOENT from ext[34]_link if we've raced with unlink and i_nlink is 0. Doing otherwise has the potential to corrupt the orphan inode list, because we'd wind up with an inode with a non-zero link count on the list, and it will never get properly cleaned up & removed from the orphan list before it is freed.

[ build fix]
Signed-off-by: Eric Sandeen <>
Cc: <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: 2988a7740dc0dd9a0cb56576e8fe1d777dff0db3

Bug #74302.


Patch from Denis:
This patch changes default for per/UB TW buckets limitations

OpenVZ Bug #460.


Patch from Kirill:
Print some debug info instead of BUG in pb_add_list_ref().

Actually there is a bug in copy_page_range logic: if the page was reserved then it is never tied to UB with PBC. Good. However, if the page is unreserved later, then next copy_page_range will blindly assume that it should have been tied already(!). And can be dissappointed by the fact it is not.

The bad thing is that packet_mmap() from net/packet/af_packet.c maps exactly such pages...

but I don't see message:
printk(KERN_DEBUG "packet_mmap: vma is busy: %d\n", atomic_read(&po->mapped));


Patch from Kirill:
Debug for valid_swaphandles() oops from Strato.

Check for correct swp_entry and print spinlock magic when doing BUG()


Patch from Alexey Kuznetsov:
Forgotten bits of pid virtualization in sys_wait*


Patch from Kirill:

RH has changed the default behaviour of the kernel: now it panics on oops :/ return it back to continue


patch ported by Kostja (khorenko@):
Areca driver v1.20.0X.13-61107 added.

Sources from Areca site:

Bug #59933.