From OpenVZ Virtuozzo Containers Wiki
Jump to: navigation, search



  • Fixes/improvements in checkpointing, NFS in VE, IOPRIO, CPU scheduler
  • NMI watchdog is now disabled by default for i686 kernels.
  • Attansic L1 Gigabit Ethernet driver added.

Config changes


  • -CONFIG_NMI_WATCHDOG=y (i686 only)


  • +CONFIG_ATL1=m



Patch from Alexey Kuznetsov <>:
[CPT] bug in restore net routes

When netroute section in dump is padded, restore tries to interpret padding as the next rtnetlink message and deadlocks interpreting it as message of zero length.


Patch from Alexey Dobriyan <>:
[PATCH] Fix unlocked access to task list from /proc/pid/oom_score

Failing code was prefetch hidden in

list_for_each_entry(child, &p->children, sibling) {

in badness(). badness() is reachable from two points. One is proc_oom_score, another is out_of_memory() => oom_select_bad_process() => badness().

Second path grabs tasklist_lock, while first doesn't.


Patch from Alexey Dobriyan <>:
[IOPRIO] dereference after free

save queue pointer in order not to dereference freed cfq_bc structure.


Patch from Denis Lunev <>:
Removes warning about special pids (from NFS kernel thread spawning).

OpenVZ Bug #470.
Bug #77832.


Patch from Alexey Kuznetsov <>:
Replaced with version from Roland McGrath


Patch from Andrey Mirkin <>:
[CPT] Fix IPv6 addresses restore

All IPv6 addresses based on MAC are created with valid lifetime 0. We checkpoint them and try to restore, but fail as inet6_addr_add() returns -EINVAL if valid_lft is zero.

We can use ifaddr flags to find correct values for prefered and valid life times.

Kernel creates automatically local ipv6 address based on MAC address on it when interface is upped. We can manually remove this address. So, if we want to be sure that VE will have exactly the same set of addresses after restore we should remove all IPs and after that add all IPs from dump.


Patch from Andrey Mirkin <>:
[CPT] unlimit dcachesize on restore

Recently we have added adjusting of 3 limits on restore to not fail because of hitting limits. Now we have to add another one - dcachesize.

Bug #77889.
Bug #77890.
Bug #77896.


Patch from Alexandr Andreev <>:
[SCHED] Improve vcpu scheduling taking into account cache hotness

In original VZ kernel schedule_vcpu() takes next VCPU from vsched->active list, and it doesn't take in to account vcpu->last_pcpu, so VCPU's can jump from PCPU to PCPU too often.

Try to skip 'hot' VCPU's, i.e. VCPU's that were running on some other PCPU recently. Time slice threshold is tunable via /proc/sys/kernel/vcpu_hot_timeslice


Patch from Alexandr Andreev <>:
[SCHED] Improve idle load balance

Idle balance is called from an idle thread on rebalance_tick(). load_balance() tries to find busiest group in idle_vsched, where there are no really running tasks.

With this patch, load_balance() will try to find a busiest vsched first, and in case of success, then find busiest group inside this vsched, and so on...


Patch from Kirill Korotaev <>:
[PATCH] Compilation fix fo idlebalance

Compilation fix for diff-fairsched-idlebalance-20070328


Patch from Alexey Dobriyan <>:
[PATCH] mainstream: fix sys_accept() error path

  • d_alloc() in sock_attach_fd() fails leaving ->f_dentry NULL
  • bail out to out_fd label, which does fput()/__fput() on new file
  • but __fput() assumes valid ->f_dentry

Bug #77930.


Patch from Dmitry Monakhov <> from mainstream:
[EXT3] "ext[34]: EA block reference count racing fix" performance fix

From: Andrew Morton <>

A little mistake in 8a2bfdcbfa441d8b0e5cb9c9a7f45f77f80da465 is making all transactions synchronous, which reduces ext3 performance to comical levels.

Cc: Mingming Cao <>
Signed-off-by: Andrew Morton <>


Patch from Kirill Korotaev <>:
[NMI] set default NMI watchdog timeout to 30 secs

Increase default NMI watchdog timeout to 30 seconds as it was in 2.6.9


Patch from Vasily Tarasov <>:
[IOPRIO] Call bc_findcreate_cfq_bc() out of q->queue_lock

Otherwise we may cause GFP_KERNEL allocation to happen with a spinlock held.

Bug #78000.


Patch from Pavel Emelianov <>:
[IOPRIO] Call bc_findcreate_cfq_bc() out of q->queue_loc (fix 2)

Fix to fix for call bc_findcreate_cfq_bc() out of q->queue_lock. iopriv should be initialized in both cases.


Patch from Pavel Emelianov <>:
[LOCKDEP] Another fix for virtualized filesystems lockdep

As described before, filesystems in our kernels are no longer static objects and thus lockdep refuses to work. This was (wrongly) fixed by setting one static class for all super block's semaphores and locks.

It turned out that different filesystems use different lock ordering for sb locks and some other ones, e.g. UDF may take inode->i_mutex under sb->s_lock, while ext3 takes sb->s_lock under inode->i_mutex. This is normal and doesn't create any deadlocks since super blocks are different. But lockdep detects a circular dependency in this case, as all super blocks are the same for him.

This is solved by setting a class from filesystem type on super block like it was before, but for virtualized filesystems (e.g. procfs, devpts) the fs template is used.

Bug #78110.


Patch from Denis Lunev <>:
[NFS] fix lockd context when bind mounted from VE0 to VE

This patch fixes NFS locking support over partitions bind mounted to VE from VE0.


Patch from Konstantin Khorenko <>:
[PROC] mainstream: race between proc_lookup() and sys_delete_module()

Fix for the race between proc_lookup() and sys_delete_module(): proc_lookup() can find PDE under proc_subdir_lock, on 2nd CPU sys_delete_module() removes pde and module, then first CPU tries to get de and module in proc_get_inode()... Bum...

Bug #77841.


Patch from Alexandr Andreev <>:
[VESTATS] use jiffies instead of cycles for mm stats

use jiffies instead of cycles for mm stats about page allocation latency.

This implementation if very simple but it's strictly not that accurate, because we can add 10 000 000 (or more) cycles (it's ~ 1 jiffy) even if actual allocation consumes < 10 000 cycles, but jiffy has been changed at the moment.


Patch from Evgeniy Kravtsunov <>:
[IOPRIO] Fix cfqq index calculation in async case

Field ioprio of task_struct consits of two numbers:

1) value of class (bits 14-16),
2) value of data (bits 0-13).
Value of data is allowed to belong the range [0, 7].

In current implementation of cfq_set_request tsk->ioprio is used as index of *async_cfqq[8] array.

It is wrong because tsk->ioprio can be >> 8.

This can cause to either corruption or reading insufficient value:

               if (!cfq_bc-&gt;async_cfqq[tsk-&gt;ioprio]) {
                       cfqq = cfq_get_queue(cfqd, key, tsk, gfp_mask);
                       if (!cfqq)
                               goto queue_fail;

                       cfq_bc-&gt;async_cfqq[tsk-&gt;ioprio] = cfqq;  &lt;&lt;&lt;corruption
               } else
                       cfqq = cfq_bc-&gt;async_cfqq[tsk-&gt;ioprio]; &lt;&lt;&lt;wrong value

Correct index should be calculated from tsk->ioprio by using corresponding functions and macros. Patch contains necessary updates.

Bug #78213.
probably fixes OpenVZ Bug #496.


patch prepared by Roman (rchechnev@):
atl1 driver ver. was ported in VZ kernel

this driver supports Attansic L1 gigabit ethernet cards. sources were taken from:


Patch from Alexandr Andreev <>:
[SCHED] small cleanup of code

Remove unnecessary argument this_pcpu (=== smp_processor_id()) from find_idle_target() and find_busiest_vsched()


Patch from Alexandr Andreev <>:
[SCHED] remove debug hunk from previous balance patch

My previous patch for load_balance() contains wrong condition statement, that I forget to remove after debugging.

In 028stab025.1 load_balance() will not pull tasks from a busiest VCPU's, if there are < 2 tasks running on current VCPU. Attached patch removes this incorrect check and fixes the problem.


Patch from Alexandr Andreev <>:
[SCHED] find_busiest_queue() should select VCPUs from given vsched only

In new scheme, we choose vsched in find_busiest_vsched(), i.e. before find_busiest_queue(), so when we look for busiest queue we must consider this vsched VCPU's only.

Bug #78385.
and maybe this:
Bug #78383.


Patch from Alexandr Andreev <>:
[SCHED] Cleanup: use vcpu_last_pcpu macro instead of vcpu->last_pcpu

Replace vcpu->last_pcpu by vcpu_last_pcpu(vcpu), to fix compilation without CONFIG_VSCHED_VCPU


Patch from Alexey Kuznetsov <>:
[IA64] strace -f does not work with utrace

The patch is submitted to with the following note:

ptrace implements -f flag catching clone() syscall and adjusting clone flags to set CLONE_PTRACE. utrace patch breaks this.

Older ptrace used to simulate peek/poke to top of user RBS, so that from user viewpoint registers stored in kernel RBS looked like registers stored in user RBS.

utrace patch tried to improve this (to be honest, it does not look as an improvement, but apparently author of those changes knows this better). It forces _real_ writeback of kernel RBS to user space (why?). The bug is that it never reads those registers back, so that all the changes to this area of user RBS are lost.

One variant of fix is enclosed. Not quite self-consistent, because the result of PTRACE_POKEDATA is never dumped back to real userspace. But at least it works.


Patch from Vasily Tarasov <>:
[IOPRIO] elevator switch oops fix

When elevator switch happens and UBs persist, putting of async cfqq can happen second time due to non-NULL value in array.

OpenVZ Bug #526.


Patch from Vasily Tarasov <>:
[IOPRIO] new cfq queue putting mechanism

It's better to use original cfqq put function from CFQ then rewrite it. Use elevator_ops structure for exporting it.

Bug #78358.


Patch from Alexandr Andreev <>:
[SCHED] Fix for cpu_of()

In new scheme, i.e. when physical cpu mask is used whenever it's possible (in find_busiest_vsched(), find_busiest_queue() and so on) cpu_of() must also return physical cpu id for given vcpu.

We have to use virtual id's (vcpu->id) only for vsched maps and for process cpus allowed mask. In all other cases we need to use physical masks to account physical CPU's topology.

Bug #78679.
Bug #78676.


Patch from Alexandr Andreev <>:
[SCHED] VCPU should be initialized completely before deletion

There is a race in vsched_del_vcpu() - we can kill migration_thread() even if it has not started yet, i.e. migration_thread() function is not called at all. So, migrate_live_tasks() and migrate_dead_tasks() will not be called on this vcpu while migration thread is killed. But there can be some tasks, that have already migrated on this vcpu, because this vcpu is already marked as online.

This bug can be easily reproduced. On a busy host with many running tasks user can run:

# vzctl set NODE --cpus 1
# vzctl set NODE --cpus 4
# vzctl set NODE --cpus 1

In this case, after the second vzctl, migration thread on VCPU 2 will be created and just waked up, but it can be not really started (scheduled) yet if there are a lot of other more priority tasks running on the host. If it will not be scheduled before the third vzctl call, there will be kernel bug in vsched_del_vcpu():

 * also, since this moment VCPU is offline, so migration_thread
 * won't accept any new tasks...
vmigration_call(&amp;migration_notifier, CPU_DEAD, vcpu);
BUG_ON(rq-&gt;nr_running != 0);

Bug #78487.


Patch from Alexandr Andreev <>:
[SCHED] find_busiest_group() should use pcpu mask

VCPUs should be skipped according to pcpu mask


Patch from Denis Lunev <>:
This patch fixes unattended use of parent->sighand.

It should be:

  • guarded with tasklist_lock
  • checked for NULL inside the lock

Bug #78657.


Patch from Vasily Tarasov <>:
[IOPRIO] compilation fix in case UBC_IO_ACCT is off

Compilation fix in case UBC_IO_ACCT is off.

OpenVZ Bug #527.


Patch from Pavel Emelianov <>:
[BC] Don't make pre-created INDEX_AC and INDEX_L3 caches UBC

This made size-32 and size-64 caches on i386 be the same capacity as size-X(UBC) ones.


Patch from Andrey Mirkin <>:
[BC] Fix potential beancounter refcount leak

On some error paths we forget to put beancounter. This patch fixes two such places:

  • sys_setluid()
  • bc_entry_open()

Bug #77231.