Open main menu

OpenVZ Virtuozzo Containers Wiki β


< Download‎ | kernel‎ | rhel5‎ | 028stab035.1



  • Rebase to RHEL5 8.1.4 kernel.
  • Mainstream security fixes.
  • Improvements, optimizations, fixes in most subsystems.
  • DRBD update to 8.0.3.
  • Xen/OpenVZ fixes to run RHEL5 in Dom0/U.
  • Fixes for SPARC and PPC.

Config changes







Patch from Kirill Korotaev <>
4GB split LDT reload fix from RHEL4u5


Patch from Andrey Mirkin <>
[CPT] 2.6.9 <-> 2.6.18 features mask compatibility issue

Use VE_FEATURES_OLD mask for old (< 2.6.18 kernel) CPT images.

Bug #81468


Patch from Alexey Kuznetsov <>
[CPT] too aggressive sys_futex() restart

Checkpointing used to enforce restart of sys_futex even when
it returns -EINTR to workaround for sick return value of FUTEX_WAIT.
Of course, this is wrong (f.e. it means restart of timed FUTEX_WAIT
with original timeout :-(), but do not have much of choice if we do
not want to break everything.

At least one case can be relaxed. If we have signal pending,
when we restore we must not restart. This pending signal would
interrupt FUTEX_WAIT in any case. This fixes sem_wait()


Patch from Andrey Mirkin <>

We have a problem with external processes.
If someone enters to VE forks and does some job w/o exec,
then the process is not considered as external (pids are virtual),
but some of the files (e.g. libs) can be from HN, i.e. external.

Temporary and quick fix for this bug:
On suspend kill processes which have mm->vps_dumpable == 0.

Bug #81722


Patch from Alexey Kuznetsov <>
[CPT] prevent changes of VM after VE was checkpointed

It is possible that processes' VM is changed after VE is checkpointed
and killed. At the moment it will happen when a process set clear_parent_tid
or robust list pointers. It was not considered a problem, because
VM is about to be destroyed in any case.

But one case was missed: corresponging VM areas could be mapped
to a file. If it is not deleted, the change will reach file system
and migrate. Oops. F.e. shared locked futex will be unlocked after
migration. (glibc tst-robust8 test)


Patch from Alexey Kuznetsov <>
[CPT] VE suspend cleanups

The patch fixes one bug. Sometimes one process  sleeps
in an uninterruptible state waiting for some event depending
on another process, which could be suspended.

I know three such cases:

1. Process did vfork() and waits when child will exec()
2. Thread did exec() and waits when its siblings will die.
3. Thread makes coredump and waits when siblings stop.

We detected case #1 directly by looking at tsk->vfork_done.
In another places suspend timed out and failed, which is obviously
incorrect. It is possible to handle cases #2,3 like we did with vfork,
but it is not necessary. The patch suggests universal solution:
we split suspend to several shorter  rounds: the first round
tries to suspend for 200msec, if it fails, VE is unfreezed
and suspend is retried after some time. We repeat the attempts
with increasing timeout until VE is frozen or major timeout (10sec)

Besides that, the patch reorders suspend code, so that it becomes
more or less readable.


Patch from Alexandr Andreev <>
[SCHED] optimization: dynamic vcpu_timeslice

vcpu_timeslice == -1 now has special meaning (and -1 is default value
now). In this case, actual vcpu_timeslice value will depend on number of
VCPU's ready to run:

assume N = ready_vcpus / nr_pcpus

for N <= 1, vcpu_timeslice will be 8
1 < N <= 2, vcpu_timeslice = 4
2 < N <= 3, vcpu_timeslice = 2
3 < N <= 4, vcpu_timeslice = 1
N > 4, vcpu_timeslice = 0

This patch lets significantly improve performance of 'context switch'
test from unixbench-4.1.0-wht-1, when several instances of this test is

On a host with 16 CPU's:

# cd unixbench-4.1.0-wht-1
# echo 0 > /proc/sys/kernel/vcpu_timeslice
# ./Run context1 16
# echo -1 > /proc/sys/kernel/vcpu_timeslice
# ./Run context1 16


Patch from Alexandr Andreev <>
[SCHED] compilation fix in case CONFIG_SCHED_VCPU=n

This patch fixes compilation of OVZ kernel with CONFIG_SCHED_VCPU=n

 Note: VE can't be started in any case due to fairsched syscall's returns
 ENOSYS, but I fixed fairsched and checked that VE can be started/stopped
  - it looks like it works )).


Patch from Vasily Tarasov <>
[PATCH] merging of async requests was abit incorrectly backported

patch diff-ms-cfq-allow-merge-b-20070424 was ported a bit incorrectly.
It resulted in wrong async requests merging.

Bug #80857


Patch from Alexey Dobriyan <>

Backport of
    commit d52b908646b88cb1952ab8c9b2d4423908a23f11
    Author: Miklos Szeredi <>
    Date:   Tue May 8 00:23:46 2007 -0700

    fix quadratic behavior of shrink_dcache_parent()

    The time shrink_dcache_parent() takes, grows quadratically with the depth
    of the tree under 'parent'.  This starts to get noticable at about 10,000.

    These kinds of depths don't occur normally, and filesystems which invoke
    shrink_dcache_parent() via d_invalidate() seem to have other depth
    dependent timings, so it's not even easy to expose this problem.

    However with FUSE it's easy to create a deep tree and d_invalidate()
    will also get called.  This can make a syscall hang for a very long

    This is the original discovery of the problem by Russ Cox:

    The following patch fixes the quadratic behavior, by optionally allowing
    prune_dcache() to prune ancestors of a dentry in one go, instead of doing
    it one at a time.

    Common code in dput() and prune_one_dentry() is extracted into a new helper
    function d_kill().

    shrink_dcache_parent() as well as shrink_dcache_sb() are converted to use
    the ancestry-pruner option.  Only for shrink_dcache_memory() is this
    behavior not desirable, so it keeps using the old algorithm.

    Signed-off-by: Miklos Szeredi <>
    Cc: Al Viro <>

    Cc: Maneesh Soni <>
    Acked-by: "Paul E. McKenney" <>
    Cc: Dipankar Sarma <>
    Cc: Neil Brown <>

    Cc: Trond Myklebust <>

    Cc: Christoph Hellwig <>
    Signed-off-by: Andrew Morton <>
    Signed-off-by: Linus Torvalds <>

Additionally merged:
    commit 24c32d733dd44dbc5b9dcd0b8de58e16fdbeac7
    From: Andrew Morton <>
    Date: Tue, 8 May 2007 07:23:49 +0000 (-0700)
    Subject: mm: shrink parent dentries when shrinking slab
    X-Git-Tag: v2.6.22-rc1~799

    mm: shrink parent dentries when shrinking slab

    Teach the dentry slab shrinker to aggressively shrink parent dentries when
    shrinking the dentry cache.

    This is done to attempt to improve the situation where the dentry slab cache
    gets a lot of internal fragmentation due to pages containing directory
    dentries.  It is expected that this change will cause some of those dentries
    to be reaped earlier, and with less scanning.

    Needs careful testing.

    Cc: Miklos Szeredi <>

    Signed-off-by: Andrew Morton <>
    Signed-off-by: Linus Torvalds <>

Typical numbers after mkdir("foo")/chdir("foo") done N times and
immediate "time vzctl stop"


	real    1m14.529s	1m16.602s	1m16.143s
	user    0m0.009s	0m0.014s	0m0.007s
	sys     1m4.569s	1m6.638s	1m7.187s
	real    0m10.078s	0m10.080s	0m10.079s
	user    0m0.007s	0m0.012s	0m0.012s
	sys     0m0.055s	0m0.053s	0m0.054s

Less easy case for this patch is the following configuration

	*--*--*--* ...
	 \  \  \  \
	  *  *  *  *

Speedup for this case is less rosy but significant anyway:

	L	before		after

	4096	11.40s		9.75s
	8192	24.00s		16.80s
	65536	15m39.897s	5m29.738s

Bug #73640


Patch from Ingo Molnar <>
[PATCH] futex: PI state locking fix

commit 21778867b1c8e0feb567addb6dc0a7e2ca6ecdec
Author: Ingo Molnar <>
Date:   Fri Mar 16 13:38:31 2007 -0800

[PATCH] futex: PI state locking fix

Testing of -rt by IBM uncovered a locking bug in wake_futex_pi(): the PI
state needs to be locked before we access it.

Signed-off-by: Ingo Molnar <>
Acked-by: Thomas Gleixner <>

Cc: Chuck Ebbert <>
Cc: <>

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from Alexey Kuznetsov <>
[PATCH] PI futex oops (mainstream)

Serialization in PI futexes is severely broken, lots of bugs, lots.
But only one is known which crashes kernel.

It is possible that new pi state isadded to pi_state_list
after the task did exit cleanup already. So that, when task
struct is released pi_state list remains in corrupted state.

Locally exploitable.


Patch from Neil Brown <>
[NFS] Remove warning: VFS is out of sync with lock manager

But keep it as a dprintk

The message can be generated in a quite normal situation:
 If a 'lock' request is interrupted, then the lock client needs to
  record that the server has the lock, incase it does.
 When we come the unlock, the server might say it doesn't, even
  though we think it does (or might) and this generates the message.

Signed-off-by: Neil Brown <>

Acked-by: Trond Myklebust <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>



Patch from Alexandr Andreev <>
[SLAB] cache_reap() function must be binded, or take into account vcpus

it must pass actual (up-to-dated) numa node id to drain_cache() aftre cond-resched.

Bug #81234


Patch from Alexandr Andreev <>
[SYSRQ] show correct sysrq help message.

Fix sysrq-h help message which was broken by SysRq-debugger patch.

Bug #81612


Patch from Denis Lunev <>
[BC] cleanup: remove unused functions

Cleanup: remove a bit of unused code


Patch from Alexandr Andreev <>
[BC] dcache: new style of array_cache entries access

cosmetics: new style of array_cache entries access


Patch from Alexandr Andreev <>
[BC] dcache: drain alien caches on nodes on dcache acct on

drain alien caches on nodes on dcache walking through
dentry slabs lists when turning dcache accounting on/off

Bug #81116


Patch from Alexandr Andreev <>
[BC] dcache: pass correct node number to kmem_cache_free_block()

Fix warning in slab_put_obj() in debug kernels due to incorrect
node number passed to kmem_cache_free_block().

Bug #81116


Patch from Alexey Dobriyan <>
[PATCH] dcache: move might_sleep() from under preempt_disable()

dput() had might_sleep() check in the very beginning.
After renaming to dput_recursive() and adding preempt_disable() call
this check ended up in region with disabled preemption, so with
CONFIG_DEBUG_SPINLOCK_SLEEP=y and preemption on dmesg gets heavily

So, do might_sleep() check earlier.


Patch from Pavel Emelianov <>
[BC] rework /proc/bc code

Proc files creation suffered from two disadvantages:
1. It was racy in respect to bc remove/create
2. It couldn't correctly show hierarchical beancounters

Rework showing BC info by overriding readdir and lookup
methods for inodes under /proc/bc. The entry layout itself
is kept unchanged.

Plus create new entry names /proc/bc/<id>/debug to show
BC's id, parent credentials and some in-kernel memory
pointers purely for debugging.

Signed-off-by: Pavel Emelianov <>


Patch from Pavel Emelianov <>
[BC] cleanup: introcude top_beancounter helper

Many resources are accounted to top beancounter only.
Introduce a helper to make code look nicer.

When saving the mapped page's ub need to save top beancounter
as it is done for non-mapped write.

Bug #81224

Signed-off-by: Pavel Emelianov <>
Signed-off-by: Kirill Korotaev <>


Patch from Vasily Tarasov <>
[PATCH] sysctl: add VE capability boundary set sysctl

The user wishes to have virtualized kernel.cap-bound sysctl in order to
use lcap tool to observe capabilities allowed in VE. We have cap_default
field on ve_struct, that can be modified to be used as virtualized
cap_bset. This patch:
- renames cap_default -> ve_cap_bset
- virtualizes cap_bset
- introduces new proc and strategy routines to handle appropriate sysctl

OpenVZ Bug #524


Patch from Denis Lunev <>
[VE] VE init signal delivery reworked to be similar to host

Prevent VE init from receiving unexpected signals sent from VE
including *fatal* ones. Signals sent from VE0 are still allowed,
e.g. for fast VE stop.
Fix for sys_reboot called from VE to force VE death
(SIGKILL is sent in the context of VE).

OpenVZ Bug #533


Patch from Dmitry Mishin <>
[BRIDGE] bridge deliver to original eth0 device

- now packets are input to the local system as they are coming from phys
  device only;
- fixed bunch of bugs with VE <-> HN communications.


Patch from Alexey Dobriyan <>
[VE] unalias IPv6 iptables bit mask from IPv4

VE_IP_MANGLE flag is used as mask for both IPv4 and IPv6 modules which
is no-no, because ip6table_mangle can be loaded after VE start.

Choose numbers to not contradict with vzctl header.

Temporarily mirror VE_IP_IPTABLES into IPv6 mask.
When vzctl will start doing right thing, this mirroring can be dropped.

OpenVZ Bug #561


Patch from Denis Lunev <>
[NFS] fix NFS auto-umount timeout sysctl

This patch:
- changes the timeout units from jiffies to seconds
- fixes assignment from userspace (was impossible, since UINT_MAX was treated as negative)


Patch from Denis Lunev <>
[NFS] virtualize NLM hosts cache

This patch virtualizes NLM hosts cache

Bug #74374


Patch from Denis Lunev <>
[NFS] shutdown NFS properly if hanged

This patch properly shutdown NFS if it is stalled.


Patch from Kirill Korotaev <>
[NFS] Compilation fix for diff-ve-nfs-stop-b-20070502 when built as module

diff-ve-nfs-stop-20070502 requires some symbols to be exported.

Signed-Off-By: Kirill Korotaev <>


Patch from Pavel Emelianov <>
When setting explicit vpid into ve's pidmap we need to

When setting explicit vpid into ve's pidmap we need to
dec nr_free counter by one.

This does not fix any BUG, it just make pidmap information
consistent and hels to work faster when pidmap is full.


Patch from Pavel Emelianov <>
[PATCH] proc: don't hash task dentries in VE0

When task dies the proc dentries, that may be hashed are
shrunk with shrink_dcache_parent(). The problem is that
this routine doesn't guarantee that all the entries will
be flushed and thus pid may still have reference from the
appropriate inode.

When we have such dentries in VE0 holding pids from ve
this leads to pid leakage and inability to release the
beancounter after ve stop.

So don't hash such dentries - remove them immediately.

Bug #80025

Signed-off-by: Pavel Emelianov <>


Patch from Denis Lunev <>
[PATCH] vmalloc info during OOM locking

vmlist_lock can't be held under any spin_lock which is help with IRQ.
This assumption is always broken for __alloc_pages.

Modified by Kirill: drop vprintstat() from show_mem() at all

Bug #81199


Patch from Vasily Tarasov <>
[PATCH] vzdquota: compilation fix for ppc32

While compiling on ppc32 the following error appears:

  Building modules, stage 2.
WARNING: "__cmpdi2" [fs/vzdquota.ko] undefined!
make[2]: *** [__modpost] Error 1

The problem is that switch((long_long_var)) is not
a primitive for ppc32 gcc: libgcc.a is needed,
which is out of the kernel.

The problem was noticed by mbaranzak user at forum
and he found the reason of it.


Patch from Denis Lunev <>
[VZDQ] sb->put_super can be NULL in valid cases

put_super() superblock operation was not checked for NULL in vzquota
leading to NULL dereference.

OpenVZ Bug #541
Bug #81936


Patch from Vasiliy (vvs@):
Fixes oops on read from some i2o proc files.

Fixes oops on read from some i2o proc files.
Minor issue because i2o_proc module is not used currently.


Patch from Evgeny Kravtsunov <>
[4GB-SPLIT] Fixes required for Xen kernel compilation


Patch from Alexey Kuznetsov <>
[CPT] Fix possible deadlock in checkpointing of mm

Learned it wrong once and did not relearn. anon_vma lock
cannot be taken under page table lock. And it is taken and
should be taken in reversed order, cpt_mm even has a special
hack due to wrong understanding: look at chunk converting ugly
spin_trylock to spin_lock.

Difference of previous version: in one case (does not happen normally,
but yet), page table lock could remain locked.

Bug #82785


Patch from Andrey Mirkin <>
While checkpointing due to memory shortage CPT processes

While checkpointing due to memory shortage CPT processes
can be killed and tmpfs will not be saved.

During restore we will see such errors:

CPT ERR: e0000002ef9c5000,111 :-2 mounting /dev/pts devpts 40000000
CPT ERR: e0000002ef9c5000,111 :rst_namespace: -2

Bug #79854

This happens as /dev is tmpfs now and its content was not saved during

We need to check exit status of tar and iptables-save to be sure that they
exited normally.

Changes from v1:
- return -EINVAL in case of error


Patch from Evgeny Kravtsunov <>
[XEN] Fix LDT handling - There is one chunk LDT data only


Patch from Vasily Tarasov <>
[PATCH] cfq: remove redundant cfq_find_next_crq() function call

mainstrem fix. cfq_find_next_crq() will be called later.
This fix is incorporated in;a=commitdiff;h=21183b07ee4be405362af8454f3647781c77df1b


Patch from mainstream, prepared by Dmitry Monakhov <>
[PATCH 5/6] fuse: fix bug in control filesystem mount

The BUG in fuse_ctl_add_dentry() could be triggered if the control
filesystem was unmounted and mounted again while one or more fuse
filesystems were present.

The fix is to reset the dentry counter in fuse_ctl_kill_sb().

Bug reported by Florent Mertens.

Signed-off-by: Miklos Szeredi <>

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream, prepared by Dmitry Monakhov <>
[PATCH 3/6] fuse: fix dereferencing dentry parent

There's no locking for ->d_revalidate, so fuse_dentry_revalidate() should use
dget_parent() instead of simply dereferencing ->d_parent.

Due to topology changes in the directory tree the parent could become negative
or be destroyed while being used.  There hasn't been any reports about this

Signed-off-by: Miklos Szeredi <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream, prepared by Dmitry Monakhov <>
[PATCH] fuse: fix mknod of regular file

The wrong lookup flag was tested in ->create() causing havoc (error or
Oops) when a regular file was created with mknod() in a fuse

Thanks to J. Cameijo Cerdeira for the report.

Kernels 2.6.18 onward are affected.  Please apply to -stable as well.

Signed-off-by: Miklos Szeredi <>


Patch from mainstream, prepared by Dmitry Monakhov <>
[PATCH 1/6] fuse: locking fix for nlookup

[PATCH 1/6] fuse: locking fix for nlookup
An inode could be returned by independent parallel lookups, in this case an
update of the lookup counter could be lost resulting in a memory leak in

Signed-off-by: Miklos Szeredi <>

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream, prepared by Dmitry Monakhov <>
[PATCH 4/6] fuse: fix Oops in lookup

Fix bug in certain error paths of lookup routines.  The request object was
reused for sending FORGET, which is illegal.  This bug could cause an Oops
in 2.6.18.  In earlier versions it might silently corrupt memory, but this
is very unlikely.

These error paths are never triggered by libfuse, so this wasn't noticed
even with the 2.6.18 kernel, only with a filesystem using the raw kernel

Thanks to Russ Cox for the bug report and test filesystem.

Signed-off-by: Miklos Szeredi <>
Cc: <>

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream, prepared by Dmitry Monakhov <>
[PATCH 2/6] fuse: fix spurious BUG

Fix a spurious BUG in an unlikely race, where at least three parallel lookups
return the same inode, but with different file type.  This has not yet been
observed in real life.

Allowing unlimited retries could delay fuse_iget() indefinitely, but this is
really for the broken userspace filesystem to worry about.

Signed-off-by: Miklos Szeredi <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream, prepared by Dmitry Monakhov <>
[PATCH 6/6] [PATCH] fuse: validate rootmode mount option

If rootmode isn't valid, we hit the BUG() in fuse_init_inode.  Now
EINVAL is returned.

Signed-off-by: Timo Savola <>
Signed-off-by: Miklos Szeredi <>

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from Kirill Korotaev <>
[PATCH] Fix powerpc compilation which was broken in

Christian Kaiser reported broken powerpc compilation due to fix:;a=commitdiff;h=f102c840f7f72492a83c93fa65396fe0edcf1df6

In file included from drivers/media/video/pwc/pwc-uncompress.c:29:
include/asm/current.h: In function get_current:
include/asm/current.h:23: warning: implicit declaration of function offsetof
include/asm/current.h:23: error: expected expression before struct
make[4]: *** [drivers/media/video/pwc/pwc-uncompress.o] Error 1
make[3]: *** [drivers/media/video/pwc] Error 2
make[2]: *** [drivers/media/video] Error 2
make[1]: *** [drivers/media] Error 2
make: *** [drivers] Error 2


Patch from Kirill Korotaev <>
Fix SBUS compilation on SPARC64.

asm/unistd.h should be included for _syscallX() define usage.


Patch from Alexandr Andreev <>
[BC] dentry_cache: drain alien caches only when CONFIG_NUMA is set

Fix oops introduced by previous patch diff-ubc-dentry-free_alien-20070510:
We should deal with NUMA only when CONFIG_NUMA is set.

Bug #82721


Patch from Vasily Tarasov <>
[BC] ioprio: account for requests in driver

Previously we didn't take into account beancounter's requests that are
already in driver. Now we consider beancounter as empty
(no requests in it) only if there are no requests in CFQ _and_ in driver.

This patch improves fairness dramatically and fixes bug #81508

Bug #81508


Patch from Vasily Tarasov <>
[BC] ioprio: range of ioprios were checked incorrectly

Fix ioprio range check: ioprio range is 0..7


Patch from Vasily Tarasov <>
[BC] ioprio: BC slice scaling fix

Now slice is CFQ BC timeslice is scaled from X to 2*X ms.


Patch from Evgeny Kravtsunov <>
[VE] rename local macro ADDR() to avoid conflict with Xen

Fix redefinition of ADDR in veth.c.
macro with name ADDR is also used by Xen in include/asm/mach-xen/asm/synch_bitops.h.


Patch from Denis Lunev <>
Unfortunately, counter on RPC client counts only clones, but not real

Unfortunately, counter on RPC client counts only clones, but not real
ussage. cl_dead flag leads to client destruction in rpc_release_client
while it is really in use. So, we have to introduce a new flag with the
meaning similar to cl_dead in all places except rpc_release_client.
True recounting is too expensive.

Based on idea from Dmitry Monakhov.

Bug #82764
Bug #82875


Patch from Denis Lunev <>
Add NFS timeout handle for TCP transport


Patch from Evgeny Kravtsunov <>
Fixes compilation error in Xen driver


Patch from Alexey Dobriyan <>
[VZDQ] Trivial fix for lookup

"" was memcmp'd for only 11 symbols,
allowing names like "vzaquota.grouX" to be looked up successfully.


Patch from Alexey Dobriyan <>
[VZDQ] Fix GFP_KERNEL allocation under spin lock in vzdq_aquot_buildmntlist

Switch allocation from GFP_KERNEL to GFP_ATOMIC under vfsmount_lock
in vzdq_aquot_buildmntlist()


Patch from Kirill Korotaev <>
[PATCH] Cleanup 4GB split/Xen modification regarding init_tss


Patch from Alexey Kuznetsov <>
[CPT] additional protected VE task list

Nobody want this. But another ideas are absent until now,
which could mean it is just impossible to do painlessly.

Each task is enlisted on one more list (vetask_auxlist),
which is protected with tasklist_lock and it is its only difference
of normal ve task list, which is accessed by RCU rules.


Patch from Andrey Mirkin <>
[PATCH] CPT: check if ip_talbes are enabled before dumping them

iptables-save returns error if module ip_tables is not loaded.
So, we just do not need to dump iptables at all if this module is not loaded in VE.
Don't try to dump iptables if they are not enabled in VE.


Patch from Alexey Kuznetsov <>
[CPT] fix compilation warning due to cast u64 -> pointer


Patch from Alexey Kuznetsov <>
[CPT] one more compiler warning


Patch from Andrey Mirkin <>
[PATCH] CPT: checkpoint inodes with deleted reference

Consider the following scenario:
1. Create file (file1) and hard link to it (file2)
2. Open file2
3. Unlink file2

After that during checkpointing we will have the following err:
Can not dump VE: Device or resource busy deleted reference to existing inode,
  checkpointing is impossible

The inode in question is not deleted, but it is not foundable
from inside checkpointed process group and not easy foundable on the disk :/

So we are trying to find another dentry with the same inode in 2 common places:
1. In inode->i_dentry alias list
2. In dir in which deleted dentry itself is located

Bug #72540


Patch from Alexey Kuznetsov <>
[CPT][IA64] save/restore NaT values

This patch closes two remaining holes in IA64 cpt implementation,
both are of no immediate practical value, but neraly impossible to fix
after we freeze layout of cpt image structs.

1. Migration between hosts with different layout of struct thread_info,
   which is possible, if some new bits are added to thread_info in newer
2. NaT bits are acurately saved and restored. This is required only
   when some application uses control speculative loads, current compilers
   are not able to do this, but this can change.


Patch from Alexey Kuznetsov <>
[CPT][IA64] some prctl flags are forgotten

Some apps do lots of unaligned accesses, know about this
and use prctl(). We did not save/restore those flags and
got flood of warnings after checkpointing.


Patch from Alexey Kuznetsov <>
[CPT] cosmetic fix to match 2.6.9 text

In 2.6.9 it was a critical bug, image would be corrupted
because of broken alignment if we did not make this.
In 2.6.18 it is just nice.


Patch from Vasily Tarasov <>
Rework inotify changes, so that old API is still available

Rework inotify changes, so that old API is still available
for 3rd party code like aufs.


Patch from Alexey Kuznetsov <>
[CPT] namespace semaphore possible deadlock

Old bug. To my shame I knew about this, but ignored.

 Deadlock is possible in two cases:
 1. tar is not a tar, but something maliciously doing mount/unmount
 2. tar is a good tar, but it takes namespace semaphore for read
 f.e. to read /proc/mounts. If someone in system does mount/unmount
 and blocked taking write semaphore, the second read semaphore deadlocks.

The fix is to drop namespace semaphore. We make mntget() on current
mnt, so that it will not disappear from under us. It still can be
unmounted with MNT_DETACH, in this case we cannot proceed with
scanning mnt list and we must not: unmounting something inside
VE while checkpointing is an obvious good reason to fail.


Patch from Pavel Emelianov <>
[PATCH] CPT: fix potential pid leak

When restoring the VE restore_one_signal_struct() can
occasionally leak some pids.

Bug #82895


Patch from Andrey Mirkin <>
[PATCH] CPT: restore deleted files (hardlink case)

The bug was here all the time, but it was never triggered as we never entered
the following path on checkpointing:

if (!IS_ROOT(d) && d_unhashed(d)) {
        struct file *parent;
        parent = iobj->o_parent;
        if (!parent ||
                        (!IS_ROOT(parent->f_dentry) &&

d_unhashed(parent->f_dentry))) {
                /* Inode is not deleted, but it does not
                 * have references from inside checkpointed
                 * process group. We have options:
                 * A. Fail, abort checkpointing
                 * B. Proceed. File will be cloned.
                 * A is correct, B is more complicated */
                /* Just as a hint where to create deleted file */
                if (ino->i_nlink != 0) {
                        eprintk_ctx("deleted reference to existing inode,
checkpointing is impossible\n");
                        return -EBUSY;
        } else {
<<< HERE
                /* Refer to _another_ file name. */
                err = cpt_dump_filename(parent, 0, ctx);
                if (err)
                        return err;
                if (S_ISREG(ino->i_mode) || S_ISDIR(ino->i_mode))
                        dump_it = 0;

So, in image file for deleted file we always had its content and never
a reference to another file.

The fix is straightforward: check the type of the object in the image and
restore file content if needed.


Patch from Alexey Kuznetsov <>
[CPT] restore last pid

Sigh. I hoped it is not necessary. It is. bash goes insane
when its children get not monotonic pids.

The place where we store saved last pid is unusual.
Violating tradition I extend one of cpt image structs.
This should be ok: migration to older kernels will be prohibited,
migration from old to new ones is OK.


Patch from Alexey Kuznetsov <>
[CPT] restore packet socket

Binding of packet socket was skipped.

One tricky bit: getsockname returns "real" sockaddr length
and bind() does not accept real name, it wants sizeof(struct sockaddr_ll).

Missing bits:
- multicast list, incuding promisc status
- statistics is not restored

Enough for beginning, the rest requires surgery in core.


Patch from Andrey Mirkin <>
[PATCH] CPT: print file name when fail to open it

Print file name if we failed to open it.
This information will be usefull for resolving problems.

Bug #83180


Patch from Andrey Mirkin <>
[PATCH] CPT: adjust UBC limits before restoring processes

Move UBC limit adjustments in more appropriate place,
where it is actually needed.


Patch from Andrey Mirkin <>
[PATCH] CPT: fix kernel_thread error code checks

Some versions of tar return non-zero error code if it was not possible to
write warning message to stderr. So, we need to open /dev/null for it.
But during restore we will face another problem - /dev is stored on tmpfs, so
we are not able to open /dev/null and we need to create it.

Also there is another bug which come to CPT code from mainstream kernel_thread
helper. If our function returns an error (e.g. exec failed) it doesn't place
correct exit code to edi register before calling do_exit.

Bug #83183


Patch from Kirill Korotaev <>
[PATCH] sched: boot CPU can have non-zero ID (sparc)

I was blindly assuming that boot processor ID is always 0,
which was not true on SUN4U machine where boot CPU has ID 1
and 2nd CPU has ID 0. Strange, but it is.
So replace 0 with real processor id in the code.


Patch from Vasily Tarasov <>
[PATCH] Fix iowait stats in VE0

2.6.18 OVZ kernels don't account iowait time,
this value is always displayed as zero:

$ cat /proc/stat  | grep cpu
cpu  1700 5 1818 11790110 0 60 204 0
cpu0 893 4 1020 5894818 0 56 174 0
cpu1 807 1 797 5895291 0 4 29 0

This happens since calculations usually happen in idle context.
Actually there is no good definition of iowait for global VE0 context.
And the whole iowait concept is arguable, but still, let's try to account
as good as possible.

OpenVZ Bug #588


Patch from Kirill Korotaev <>
[PATCH] ppc: fix screwed OVZ syscall numbers

Fix enumeration of OVZ syscall numbers on powerpc

Thanks to Christian Kaiser for noticing this.


Patch from Vasily Tarasov <>
[PATCH] Fix debug messages when CONFIG_DEBUG_PREEMPT is used

OpenVZ kernels produce a lot of similar messages:

BUG: using smp_processor_id() in preemptible [00000001] code:
caller is io_schedule+0x22/0x53
Call Trace: ...

Two reasons of these messages:
1) we call smp_processor_id() from io_schedule/io_schedule_timeout
   without preemption disabled. minor, raw_smp_processor_id() should be used.
2) task_struct->cpus_allowed has mask of vcpus instead of pcpus.
   Therefore debug_smp_processor_id() function fails to check that the
   process can run only on one current cpu.

The patch fixes both issues.

OpenVZ Bug #577


Patch from Evgeny Kravtsunov <>
[PATCH] rename vcpu_info to vcpu_struct due to conflict with Xen

Rename vcpu_info to vcpu_struct due to conflict with Xen
which uses the same name for its data structure (sigh... globally...)

Thanks to seyko2 for testing OVZ-Xen kernel.


Patch from Alexandr Andreev <>
[PATCH] sched: fix up fairsched tick duration according to VCPU timeslice

With latest 'vcpu dynamic timeslice' patch we broke
fairsched scheduler logic a bit, which assumed,
that fairsched_schedule() must be called on each timer tick.

New bigger fairsched timeslice was introduced:
this value must be always >= vcpu timeslice

Bug #82969


Patch from Kirill Korotaev <>
[PATCH] fairsched: fix VCPU info in show regs on x86-64


Patch from Matt Mackall <>
Add data from zero-entropy random_writes directly to output pools to

Add data from zero-entropy random_writes directly to output pools to
avoid accounting difficulties on machines without entropy sources.

Tested on lguest with all entropy sources disabled.

Signed-off-by: Matt Mackall <>
Acked-by: "Theodore Ts'o" <>


Patch from Matt Mackall <>
random: fix seeding with zero entropy

Add data from zero-entropy random_writes directly to output pools to
avoid accounting difficulties on machines without entropy sources.

Tested on lguest with all entropy sources disabled.

Signed-off-by: Matt Mackall <>
Acked-by: "Theodore Ts'o" <>
Signed-off-by: Linus Torvalds <>



Patch from Kirill Korotaev <>
[PATCH] ext3: lost brelse in ext3_read_inode()

One of error path in ext3_read_inode() leaks bh,
since brelse is forgoten.
Move brelese under bad_inode label, so that it is freed.

Signed-Off-By: Kirill Korotaev <>


Patch from Vasily Averin <>
[PATCH] ext3: orphan list corruption on bad inodes

This patch fixes ext3 orphan list corruption due to bad inodes
created in ext3_read_inode().

Trying to catch orhpan list corruption in OpenVZ we found
the following debug messages in the logs:

May 30 10:39:38 df-rs-l24 kernel:
 EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0
 Inode 00000101a15b7840: orphan list check failed!
 00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f
 Call Trace: [<ffffffff80211ea9>] ext3_destroy_inode+0x79/0x90
  [<ffffffff801a2b16>] sys_unlink+0x126/0x1a0
  [<ffffffff80111479>] error_exit+0x0/0x81
  [<ffffffff80110aba>] system_call+0x7e/0x83

first messages says that unlinked inode has i_nlink=0,
then ext3_unlink() adds this inode into orphan list.

second message means that this inode has not been removed from orphan list, and
inode dump shows that i_fop = &bad_file_ops

Then I've discovered that bad_file_ops can be set only in make_bad_inode().
ext3_read_inode() can call make_bad_inode() without any
error/warning messages in the following case:
        if (inode->i_nlink == 0) {
                if (inode->i_mode == 0 ||
                    !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
                        /* this inode is deleted */
                        brelse (bh);
                        goto bad_inode;

i.e when
   inode->i_nlink == 0
   !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)

Bad inode can live some time, ext3_unlink can add it to orphan list then,
but ext3_delete_inode() doesn't delete this inode from orphan list,
since inode is bad. As a result we have orphan list corruption
detected in ext3_destroy_inode().

This issue present in rhel4/rhel5/mainstream kernels too.

Bug #83419


Patch from Vasily Averin <>
[PATCH] ext3: orphan list corruption on bad inodes (v2)

Changes to previous patch:
  instead of fixing ext3_unlink() better fix all the paths
  were bad inode can be found and used, i.e. lookup() and get_parent()

After ext3 orphan list check has been added into ext3_destroy_inode()
(please see my previous patch) the following situation has been detected:
 EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0
 Inode 00000101a15b7840: orphan list check failed!
 00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f
 Call Trace: [<ffffffff80211ea9>] ext3_destroy_inode+0x79/0x90
  [<ffffffff801a2b16>] sys_unlink+0x126/0x1a0
  [<ffffffff80111479>] error_exit+0x0/0x81
  [<ffffffff80110aba>] system_call+0x7e/0x83

First messages said that unlinked inode has i_nlink=0,
then ext3_unlink() adds this inode into orphan list.

Second message means that this inode has not been removed from orphan list.
Inode dump has showed that i_fop = &bad_file_ops and it can be set in
make_bad_inode() only. Then I've found that ext3_read_inode() can call
make_bad_inode() without any error/warning messages,
for example in the following case:
        if (inode->i_nlink == 0) {
                if (inode->i_mode == 0 ||
                    !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
                        /* this inode is deleted */
                        brelse (bh);
                        goto bad_inode;

Bad inode can live some time, ext3_unlink can add it to orphan list, but
ext3_delete_inode() do not deleted this inode from orphan list. As
result we can have orphan list corruption detected in ext3_destroy_inode().

However it is not clear for me how to fix this issue correctly.

As far as i see is_bad_inode() is called after iget() in all places excluding
ext3_lookup() and ext3_get_parent(). I believe it makes sense to add bad inode
check to these functions too and call iput if bad inode detected.

Signed-off-by:  Vasily Averin <>


Patch from Alexey Kuznetsov <>
[IA64] ptrace returns garbage for NaT bits

An old bug. Nobody needed those NaT bits, so that it was not noticed.


Patch from Vasily Averin <>
[PATCH] disable "Disabled Privacy Extensions" messages

Hide annoing useless messages about disabled
IPv6 privacy extensions, which is always triggered by loopback:
"lo: Disabled Privacy Extensions"

Bug #83651


Patch from Vasily Averin <>
[NET]: "wrong timeout value" in sk_wait_data() v2

sys_setsockopt() do not check properly timeout values for
SO_RCVTIMEO/SO_SNDTIMEO, for example it's possible to set negative timeout
values. POSIX do not defines behaviour for sys_setsockopt in case negative
timeouts, but requires that setsockopt() shall fail with -EDOM if the send and
receive timeout values are too big to fit into the timeout fields in the socket
In current implementation negative timeout can lead to error messages like
"schedule_timeout: wrong timeout value".

Proposed patch:
- checks tv_usec and returns -EDOM if it is wrong
- do not allows to set negative timeout values (sets 0 instead) and outputs
ratelimited information message about such attempts.

Signed-off-By: Vasily Averin <>
Signed-off-by: David S. Miller <>

X-Git-Tag: v2.6.22-rc3


Patch from Alexandr Andreev <>
[PATCH] NFS: Fix race in nfs_release_page()

invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.

Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.

Signed-off-by: Trond Myklebust <>

Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.

[ cleanup]
Signed-off-by: Peter Zijlstra <>

Cc: Trond Myklebust <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

From Alexandr:
This 'new' invalidate/release logic also fixes our problem with
mmap/write/read data corruption when several processes use the same
mmaped file on NFS

Bug #81896;a=commit;h=e3db7691e9f3dff3289f64e3d98583e28afe03db


Patch from Denis Lunev <>
[PATCH] nfs: oops during LTP over NFS (direct io)

Problem reported by Denis Lunev and QA, fix from mainstream

incorrect comparison of "int" and "unsigned int" variables is fixed in
nfs_direct_read_schedule and nfs_direct_write_schedule.

Bug #81589


Patch from Denis Lunev <>
[PATCH] nfs: AB-BA deadlock on rpc_sched_lock/queue->lock locks

This patch fixes possible AB-BA deadlock for rpc_sched_lock/queue->lock
in rpc_run_child().

Normal sequence is presented in rpc_set_active:
 - rpc_sched_lock goest first
 - queue->lock is nested.

Bug #82518


Patch from Trond Myklebust <>
[PATCH] nfs: fix req refcnt leak preventing umount

Original Denis Lunev analyses:
- nfs_direct_req_alloc creates dreq with dreq->kref->refcount == 2
- on success path the kref_put is called in
      nfs_direct_read_schedule ->  nfs_direct_complete
   and in nfs_direct_wait
- on error path only first put occured
The same problem occures on direct_write path

Mainstream patch version from Trond Myklebust <>:
The current code is leaking a reference to dreq->kref when the calls to
nfs_direct_read_schedule() and nfs_direct_write_schedule() return an

Thanks to Denis V. Lunev for spotting the bug and proposing the original

Signed-off-by: Trond Myklebust <>


Patch from Akinobu Mita <>
use simple_read_from_buffer in kernel/

Cleanup using simple_read_from_buffer() for /dev/cpuset/tasks and

Cc: Paul Jackson <>
Cc: Randy Dunlap <>

Signed-off-by: Akinobu Mita <>

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

X-Git-Tag: v2.6.22-rc1


Patch from Patrick McHardy <>
[NETFILTER]: {ip,nf}_conntrack_sctp: fix remotely triggerable NULL ptr dereference

When creating a new connection by sending an unknown chunk type, we don't
transition to a valid state, causing a NULL pointer dereference in
sctp_packet when accessing sctp_timeouts[SCTP_CONNTRACK_NONE].

Fix by don't creating new conntrack entry if initial state is invalid.

Noticed by Vilmos Nebehaj <>

CC: Kiran Kumar Immidi <>
Signed-off-by: Patrick McHardy <>


Patch from Alexey Dobriyan <>
[PATCH] seqfile: bash can hang in a loop reading from proc file

Original problem: in some circumstances seq_file interface can present
infinite proc file to the following script when normally said proc file
is finite:

	while read line; do
		[do something with $line]
	done </proc/$FILE

bash, to implement such loop does essentially

	read(0, buf, 128);
	[find \n]
	lseek(0, -difference, SEEK_CUR);

Consider, proc file prints list of objects each of them consists of many
lines, each line is shorter than 128 bytes.

Two objects in list, with ->index'es being 0 and 1. Current one is 1,
as bash prints second object line by line.

Imagine first object being removed right before lseek().
traverse() will be called, because there is negative offset.
traverse() will reset ->index to 0 (!).
traverse() will call ->next() and get NULL in any usual iterate-over-list
code using list_for_each_entry_continue() and such. There is one object in
list now after all...
traverse() will return 0, lseek() will update file position and pretend
everything is OK.

So, what we have now: ->f_pos points to place where second object will be
printed, but ->index is 0. seq_read instead() of returning EOF, will start
printing first line of first object every time it's called, until enough
objects are added to ->f_pos return in bounds.

Fix is to update ->index only after we're sure we saw enough objects down
the road.

Signed-off-by: Alexey Dobriyan <>

Bug #82819


Patch from Kirill Korotaev <>
[PATCH] ubc: fix compilation with CONFIG_UBC_DEBUG_IO=y

During rework of UBC /proc compilation with UBC_DEBUG_IO was
broken a bit.


Patch from Kirill Korotaev <>
[PATCH] ubc: export ubc helpers for case CONFIG_UNIX=m

Export ub_sock_getwres_other, since unix sockets can
call it from the module (unix.ko) when CONFIG_UNIX=m.

Thanks to Rafael Isturiz for having non-standart config  :)  and reporting this.


Patch from Kirill Korotaev <>
[PATCH] VE cpu stats should be exported to user space in clocks

VE cpu stats should be exported to user space in clocks intead of jiffies.


Patch from Alexey Dobriyan <>
[PATCH] Unalias VE_IP_NAT for ip_nat and iptable_nat modules

If ip_nat and ip_tables modules are loaded before VE start, and
iptable_nat after VE start, on VE stop kernel will crash in
ipt_unregister_table() attempting to unregister NULL table.

Split VE_IN_NAT flag responsible for two modules.

OpenVZ Bug #607


Patch from Vasily Tarasov <>
[PATCH] arp: allow set arp cache entries from VE

It is secure since later we use __dev_get_by_name() function which is
aware about current context.


Patch from Andrey Mirkin <>
[PATCH] veth: rework VE traffic filtering

Mac filtering in veth_xmit() was a bit incorrect:
broadcasts and multicasts were allowed from VE.
Rearrange code, make it more clear and assymetric :/


Patch from Kirill Korotaev <>
[PATCH] veth: multicasts should be forwarded as well

Right now veth_xmit passes broadcasts only.
It is a bug. Multicasts should be allowed as well.

Thanks to Daniel Pittman for noticing this.


Patch from Denis Lunev <>
[PATCH] disable OOM_DISABLE inside VE

Prevent disabling of OOM from inside VE. Basically, it is safe to
allow priority changes inside VE, as in normal case we select UB and a
process inside UB then.


Patch from Alexey Kuznetsov <>
[PATCH] VE: reparent threaded init correctly

If init is multithreaded (yes, imagine, this happens :-)),
its threads are reparented to VE init, so that we get parents
in the same thread group. Nothing especially bad happens,
only checkpointing cannot restore such sick configuration.


Patch from Alexey Dobriyan <>
[PATCH 1/2] VE: allow proc setattr on local proc entries

If PDE is local to VE, there is no reason to not allow setattr on it --
changes won't affect corresponding global PDE and other VEs.

OpenVZ Bug #509


Patch from Alexey Dobriyan <>
[PATCH] proc: brown paper bag bug in proc's ->setattr

->setattr is called for something innocent like mtime updates, so
outright banning of ->setattr on global proc entries was sadistic.

Check if ->setattr is called with mask indicating MODE, UID, GID change
and check for globalness only in this case.

OpenVZ Bug #604
OpenVZ Bug #509


Patch from Alexey Dobriyan <>
[PATCH] VE: make /proc/kmsg to be VE local

Some people used to doing "chmod g+r /proc/kmsg". Make PDE corresponding
to /proc/kmsg local to VE, so it's possible to setattr it.

OpenVZ Bug #509


Patch from Vitaliy Gusev <>
Fix LTP test failure in syslog test.

LTP failure is minor and simple: it calls syslog(2) with wrong arguments
and awaiting for an error. But syslog() returns 0 since VE doesn't
have real console and console loglevel.

Thanks Christian Kaiser2 <> for noticing this.


Patch from Kirill Korotaev <>
[PATCH] init vps_dumpable early on exec

Since CPT uses vps_dumpable flag now for determining
external processes on checkpointing, we need to initialize
it earlier on mm creation on exec. Otherwise it can race.


Patch from Alexey Dobriyan <>
[PATCH] VZDQ: Fix lockdep warning about s_umount dependancy

Lockdep learns false dependency due to vz_restore_symlink()
and later complains about possible circular locking when quotaon is

Temporarily up ->s_umount semaphore to workaround this.

OpenVZ Bug #585


Patch from Evgeny Kravtsunov <>
[PATCH] Fixes for Xen arch compilation / work


Patch from Alexandr Andreev <>
[PATCH] VE: ve0 processes intialization

VE0 processes were initialized twice:

  • in copy_process()
  • in prepare_ve0_process() from init_ve_system()

This is redundant and unneeded. Leading to wrong ve0.pcounter