Open main menu

OpenVZ Virtuozzo Containers Wiki β

Changes

Download/kernel/rhel5/028stab035.1/changes

65,859 bytes added, 01:17, 21 March 2008
created
== Changes ==
* Rebase to RHEL5 8.1.4 kernel.
* Mainstream security fixes.
* Improvements, optimizations, fixes in most subsystems.
* DRBD update to 8.0.3.
* Xen/OpenVZ fixes to run RHEL5 in Dom0/U.
* Fixes for SPARC and PPC.
=== Config changes ===
Added:
* +<code>CONFIG_LEGACY_PTYS=y</code>
* +<code>CONFIG_LEGACY_PTY_COUNT=256</code>
* +<code>CONFIG_GFS_FS=m</code>
* +<code>CONFIG_PREEMPT_VOLUNTARY=y</code>
* +<code>CONFIG_PREEMPT_BKL=y</code>
Removed:
* CONFIG_UBC_DEBUG_KMEM
<includeonly>[[{{PAGENAME}}/changes#Patches|{{Long changelog message}}]]</includeonly><noinclude>
=== Patches ===

==== diff-arch-4gb-ldt-irqs-20070515 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
4GB split LDT reload fix from RHEL4u5

</div>

==== diff-cpt-features-known-mask-20070514 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[CPT] 2.6.9 &lt;-&gt; 2.6.18 features mask compatibility issue

<pre class="simple">
Use VE_FEATURES_OLD mask for old (&lt; 2.6.18 kernel) CPT images.
</pre>

Bug #81468

</div>

==== diff-cpt-futex-eintr-20070510 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] too aggressive sys_futex() restart

<pre class="simple">
Checkpointing used to enforce restart of sys_futex even when
it returns -EINTR to workaround for sick return value of FUTEX_WAIT.
Of course, this is wrong (f.e. it means restart of timed FUTEX_WAIT
with original timeout :-(), but do not have much of choice if we do
not want to break everything.

At least one case can be relaxed. If we have signal pending,
when we restore we must not restart. This pending signal would
interrupt FUTEX_WAIT in any case. This fixes sem_wait()
</pre>
</div>

==== diff-cpt-kill-external-processes-b-20070515 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;

<pre class="simple">
We have a problem with external processes.
If someone enters to VE forks and does some job w/o exec,
then the process is not considered as external (pids are virtual),
but some of the files (e.g. libs) can be from HN, i.e. external.

Temporary and quick fix for this bug:
On suspend kill processes which have mm-&gt;vps_dumpable == 0.
</pre>

Bug #81722

</div>

==== diff-cpt-prevent-vm-changes-20070510 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] prevent changes of VM after VE was checkpointed

<pre class="simple">
It is possible that processes' VM is changed after VE is checkpointed
and killed. At the moment it will happen when a process set clear_parent_tid
or robust list pointers. It was not considered a problem, because
VM is about to be destroyed in any case.

But one case was missed: corresponging VM areas could be mapped
to a file. If it is not deleted, the change will reach file system
and migrate. Oops. F.e. shared locked futex will be unlocked after
migration. (glibc tst-robust8 test)
</pre>
</div>

==== diff-cpt-suspend-cleanup-20070510 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] VE suspend cleanups

<pre class="simple">
The patch fixes one bug. Sometimes one process sleeps
in an uninterruptible state waiting for some event depending
on another process, which could be suspended.

I know three such cases:

1. Process did vfork() and waits when child will exec()
2. Thread did exec() and waits when its siblings will die.
3. Thread makes coredump and waits when siblings stop.

We detected case #1 directly by looking at tsk-&gt;vfork_done.
In another places suspend timed out and failed, which is obviously
incorrect. It is possible to handle cases #2,3 like we did with vfork,
but it is not necessary. The patch suggests universal solution:
we split suspend to several shorter rounds: the first round
tries to suspend for 200msec, if it fails, VE is unfreezed
and suspend is retried after some time. We repeat the attempts
with increasing timeout until VE is frozen or major timeout (10sec)
expires.

Besides that, the patch reorders suspend code, so that it becomes
more or less readable.
</pre>
</div>

==== diff-fairsched-dyn-vcpu-timeslice-20070518 ====
<div class="change">

Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[SCHED] optimization: dynamic vcpu_timeslice

<pre class="simple">
vcpu_timeslice == -1 now has special meaning (and -1 is default value
now). In this case, actual vcpu_timeslice value will depend on number of
VCPU's ready to run:

assume N = ready_vcpus / nr_pcpus

for N &lt;= 1, vcpu_timeslice will be 8
1 &lt; N &lt;= 2, vcpu_timeslice = 4
2 &lt; N &lt;= 3, vcpu_timeslice = 2
3 &lt; N &lt;= 4, vcpu_timeslice = 1
N &gt; 4, vcpu_timeslice = 0

This patch lets significantly improve performance of 'context switch'
test from unixbench-4.1.0-wht-1, when several instances of this test is
running.

On a host with 16 CPU's:

# cd unixbench-4.1.0-wht-1
# echo 0 &gt; /proc/sys/kernel/vcpu_timeslice
# ./Run context1 16
108.4
# echo -1 &gt; /proc/sys/kernel/vcpu_timeslice
# ./Run context1 16
435.3
</pre>
</div>

==== diff-fairsched-vcpuoff-comp-20070426 ====
<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[SCHED] compilation fix in case CONFIG_SCHED_VCPU=n

<pre class="simple">
This patch fixes compilation of OVZ kernel with CONFIG_SCHED_VCPU=n

Note: VE can't be started in any case due to fairsched syscall's returns
ENOSYS, but I fixed fairsched and checked that VE can be started/stopped
- it looks like it works )).
</pre>
</div>

==== diff-ms-cfq-allow-merge-c-20070507 ====

<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[PATCH] merging of async requests was abit incorrectly backported

<pre class="simple">
patch diff-ms-cfq-allow-merge-b-20070424 was ported a bit incorrectly.
It resulted in wrong async requests merging.
</pre>

Bug #80857

</div>

==== diff-ms-dcache-fix-quadratic-shrink ====

<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;

<pre class="simple">
Backport of
commit d52b908646b88cb1952ab8c9b2d4423908a23f11
Author: Miklos Szeredi &lt;mszeredi@suse.cz&gt;
Date: Tue May 8 00:23:46 2007 -0700

fix quadratic behavior of shrink_dcache_parent()

The time shrink_dcache_parent() takes, grows quadratically with the depth
of the tree under 'parent'. This starts to get noticable at about 10,000.

These kinds of depths don't occur normally, and filesystems which invoke
shrink_dcache_parent() via d_invalidate() seem to have other depth
dependent timings, so it's not even easy to expose this problem.

However with FUSE it's easy to create a deep tree and d_invalidate()
will also get called. This can make a syscall hang for a very long
time.

This is the original discovery of the problem by Russ Cox:

http://article.gmane.org/gmane.comp.file-systems.fuse.devel/3826

The following patch fixes the quadratic behavior, by optionally allowing
prune_dcache() to prune ancestors of a dentry in one go, instead of doing
it one at a time.

Common code in dput() and prune_one_dentry() is extracted into a new helper
function d_kill().

shrink_dcache_parent() as well as shrink_dcache_sb() are converted to use
the ancestry-pruner option. Only for shrink_dcache_memory() is this
behavior not desirable, so it keeps using the old algorithm.

Signed-off-by: Miklos Szeredi &lt;mszeredi@suse.cz&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;

Cc: Maneesh Soni &lt;maneesh@in.ibm.com&gt;
Acked-by: "Paul E. McKenney" &lt;paulmck@us.ibm.com&gt;
Cc: Dipankar Sarma &lt;dipankar@in.ibm.com&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;

Cc: Trond Myklebust &lt;trond.myklebust@fys.uio.no&gt;

Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

Additionally merged:
commit 24c32d733dd44dbc5b9dcd0b8de58e16fdbeac7
From: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Date: Tue, 8 May 2007 07:23:49 +0000 (-0700)
Subject: mm: shrink parent dentries when shrinking slab
X-Git-Tag: v2.6.22-rc1~799
X-Git-Url:
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=24c32d733dd44dbc5b9dcd0b8de58e16fdbeac76

mm: shrink parent dentries when shrinking slab

Teach the dentry slab shrinker to aggressively shrink parent dentries when
shrinking the dentry cache.

This is done to attempt to improve the situation where the dentry slab cache
gets a lot of internal fragmentation due to pages containing directory
dentries. It is expected that this change will cause some of those dentries
to be reaped earlier, and with less scanning.

Needs careful testing.

Cc: Miklos Szeredi &lt;mszeredi@suse.cz&gt;

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

Typical numbers after mkdir("foo")/chdir("foo") done N times and
immediate "time vzctl stop"

Before:
N=32768

real 1m14.529s 1m16.602s 1m16.143s
user 0m0.009s 0m0.014s 0m0.007s
sys 1m4.569s 1m6.638s 1m7.187s
After:
real 0m10.078s 0m10.080s 0m10.079s
user 0m0.007s 0m0.012s 0m0.012s
sys 0m0.055s 0m0.053s 0m0.054s

Less easy case for this patch is the following configuration

*--*--*--* ...
\ \ \ \
* * * *

Speedup for this case is less rosy but significant anyway:

L before after

4096 11.40s 9.75s
8192 24.00s 16.80s
65536 15m39.897s 5m29.738s
</pre>

Bug #73640

</div>

==== diff-ms-futex-locking-bug-20070510 ====
<div class="change">
Patch from Ingo Molnar &lt;mingo@elte.hu&gt;<br/>
[PATCH] futex: PI state locking fix

<pre class="simple">
commit 21778867b1c8e0feb567addb6dc0a7e2ca6ecdec
Author: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Fri Mar 16 13:38:31 2007 -0800

[PATCH] futex: PI state locking fix

Testing of -rt by IBM uncovered a locking bug in wake_futex_pi(): the PI
state needs to be locked before we access it.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;

Cc: Chuck Ebbert &lt;cebbert@redhat.com&gt;
Cc: &lt;stable@kernel.org&gt;

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>

==== diff-ms-futex-oops-20070510 ====

<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[PATCH] PI futex oops (mainstream)

<pre class="simple">
Serialization in PI futexes is severely broken, lots of bugs, lots.
But only one is known which crashes kernel.

It is possible that new pi state isadded to pi_state_list
after the task did exit cleanup already. So that, when task
struct is released pi_state list remains in corrupted state.

Locally exploitable.
</pre>
</div>

==== diff-ms-nfs-rm-warn-20070515 ====
<div class="change">
Patch from Neil Brown &lt;neilb@suse.de&gt;<br/>
[NFS] Remove warning: VFS is out of sync with lock manager

<pre class="simple">
But keep it as a dprintk

The message can be generated in a quite normal situation:
If a 'lock' request is interrupted, then the lock client needs to
record that the server has the lock, incase it does.
When we come the unlock, the server might say it doesn't, even
though we think it does (or might) and this generates the message.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;

Acked-by: Trond Myklebust &lt;trond.myklebust@fys.uio.no&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=46bae1a9a767f3ae8e636d96f9b95703df34b398

</pre>
</div>

==== diff-ms-slab-numa-bind-20070514 ====
<div class="change">

Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[SLAB] cache_reap() function must be binded, or take into account vcpus

<pre class="simple">
it must pass actual (up-to-dated) numa node id to drain_cache() aftre cond-resched.
</pre>

Bug #81234

</div>

==== diff-sysrq-debug-b-20070511 ====

<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;
<br/>
[SYSRQ] show correct sysrq help message.

<pre class="simple">
Fix sysrq-h help message which was broken by SysRq-debugger patch.
</pre>

Bug #81612

</div>

==== diff-ubc-cleanup-20070517 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;
<br/>
[BC] cleanup: remove unused functions

<pre class="simple">
Cleanup: remove a bit of unused code
</pre>
</div>

==== diff-ubc-dentry-acentry-20070510 ====

<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;
<br/>
[BC] dcache: new style of array_cache entries access

<pre class="simple">
cosmetics: new style of array_cache entries access
</pre>
</div>

==== diff-ubc-dentry-free_alien-20070510 ====
<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[BC] dcache: drain alien caches on nodes on dcache acct on

<pre class="simple">
drain alien caches on nodes on dcache walking through
dentry slabs lists when turning dcache accounting on/off
</pre>

Bug #81116

</div>

==== diff-ubc-dentry-free_block-20070510 ====
<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[BC] dcache: pass correct node number to kmem_cache_free_block()

<pre class="simple">
Fix warning in slab_put_obj() in debug kernels due to incorrect
node number passed to kmem_cache_free_block().
</pre>

Bug #81116

</div>

==== diff-ubc-move-might-sleep-20070517 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[PATCH] dcache: move might_sleep() from under preempt_disable()

<pre class="simple">
dput() had might_sleep() check in the very beginning.
After renaming to dput_recursive() and adding preempt_disable() call
this check ended up in region with disabled preemption, so with
CONFIG_DEBUG_SPINLOCK_SLEEP=y and preemption on dmesg gets heavily
spammed.

So, do might_sleep() check earlier.
</pre>
</div>

==== diff-ubc-proc-rework-b-20070515 ====
<div class="change">
Patch from Pavel Emelianov &lt;xemul@openvz.org&gt;<br/>
[BC] rework /proc/bc code

<pre class="simple">
Proc files creation suffered from two disadvantages:
1. It was racy in respect to bc remove/create
2. It couldn't correctly show hierarchical beancounters

Rework showing BC info by overriding readdir and lookup
methods for inodes under /proc/bc. The entry layout itself
is kept unchanged.

Plus create new entry names /proc/bc/&lt;id&gt;/debug to show
BC's id, parent credentials and some in-kernel memory
pointers purely for debugging.

Signed-off-by: Pavel Emelianov &lt;xemul@sw.ru&gt;
</pre>
</div>

==== diff-ubc-top-beancounter-20070515 ====
<div class="change">
Patch from Pavel Emelianov &lt;xemul@openvz.org&gt;<br/>
[BC] cleanup: introcude top_beancounter helper

<pre class="simple">
Many resources are accounted to top beancounter only.
Introduce a helper to make code look nicer.

When saving the mapped page's ub need to save top beancounter
as it is done for non-mapped write.
</pre>

Bug #81224

<pre class="simple">
Signed-off-by: Pavel Emelianov &lt;xemul@sw.ru&gt;
Signed-off-by: Kirill Korotaev &lt;dev@sw.ru&gt;
</pre>
</div>

==== diff-ve-cap-bset-20070504 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;
<br/>
[PATCH] sysctl: add VE capability boundary set sysctl

<pre class="simple">
The user wishes to have virtualized kernel.cap-bound sysctl in order to
use lcap tool to observe capabilities allowed in VE. We have cap_default
field on ve_struct, that can be modified to be used as virtualized
cap_bset. This patch:
- renames cap_default -&gt; ve_cap_bset
- virtualizes cap_bset
- introduces new proc and strategy routines to handle appropriate sysctl
</pre>

{{bug|524}}

</div>

==== diff-ve-init-signals-20070514 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[VE] VE init signal delivery reworked to be similar to host

<pre class="simple">
Prevent VE init from receiving unexpected signals sent from VE
including *fatal* ones. Signals sent from VE0 are still allowed,
e.g. for fast VE stop.
Fix for sys_reboot called from VE to force VE death
(SIGKILL is sent in the context of VE).
</pre>

{{bug|533}}

</div>

==== diff-ve-net-bridge-via-phys-dev2-20070514 ====
<div class="change">
Patch from Dmitry Mishin &lt;dim@openvz.org&gt;<br/>
[BRIDGE] bridge deliver to original eth0 device

<pre class="simple">
- now packets are input to the local system as they are coming from phys
device only;
- fixed bunch of bugs with VE &lt;-&gt; HN communications.
</pre>
</div>

==== diff-ve-nf-ipt6-aliasing-20070515 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[VE] unalias IPv6 iptables bit mask from IPv4

<pre class="simple">
VE_IP_MANGLE flag is used as mask for both IPv4 and IPv6 modules which
is no-no, because ip6table_mangle can be loaded after VE start.

Split VE_IP_IPTABLES into VE_IP_IPTABLES and VE_IP_IPTABLES6.
Same for VE_IP_FILTER and VE_IP_MANGLE.
Choose numbers to not contradict with vzctl header.

Temporarily mirror VE_IP_IPTABLES into IPv6 mask.
When vzctl will start doing right thing, this mirroring can be dropped.
</pre>

{{bug|561}}

</div>

==== diff-ve-nfs-abortset-20070504 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[NFS] fix NFS auto-umount timeout sysctl

<pre class="simple">
This patch:
- changes the timeout units from jiffies to seconds
- fixes assignment from userspace (was impossible, since UINT_MAX was treated as negative)
</pre>
</div>

==== diff-ve-nfs-hostcache-20070510 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[NFS] virtualize NLM hosts cache

<pre class="simple">
This patch virtualizes NLM hosts cache
</pre>

Bug #74374

</div>

==== diff-ve-nfs-stop-20070502 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[NFS] shutdown NFS properly if hanged

<pre class="simple">
This patch properly shutdown NFS if it is stalled.
</pre>
</div>

==== diff-ve-nfs-stop-b-20070502 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[NFS] Compilation fix for diff-ve-nfs-stop-b-20070502 when built as module

<pre class="simple">
diff-ve-nfs-stop-20070502 requires some symbols to be exported.

Signed-Off-By: Kirill Korotaev &lt;dev@sw.ru&gt;
</pre>
</div>

==== diff-ve-pid-nr-fix-20070503 ====
<div class="change">
Patch from Pavel Emelianov &lt;xemul@openvz.org&gt;<br/>
When setting explicit vpid into ve's pidmap we need to

<pre class="simple">

When setting explicit vpid into ve's pidmap we need to
dec nr_free counter by one.

This does not fix any BUG, it just make pidmap information
consistent and hels to work faster when pidmap is full.
</pre>
</div>

==== diff-ve-proc-hash-pid-dentries-20070516 ====
<div class="change">
Patch from Pavel Emelianov &lt;xemul@openvz.org&gt;<br/>
[PATCH] proc: don't hash task dentries in VE0

<pre class="simple">
When task dies the proc dentries, that may be hashed are
shrunk with shrink_dcache_parent(). The problem is that
this routine doesn't guarantee that all the entries will
be flushed and thus pid may still have reference from the
appropriate inode.

When we have such dentries in VE0 holding pids from ve
this leads to pid leakage and inability to release the
beancounter after ve stop.

So don't hash such dentries - remove them immediately.
</pre>

Bug #80025

<pre class="simple">
Signed-off-by: Pavel Emelianov &lt;xemul@sw.ru&gt;
</pre>
</div>

==== diff-ve-showmem-locking-20070414 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[PATCH] vmalloc info during OOM locking

<pre class="simple">
vmlist_lock can't be held under any spin_lock which is help with IRQ.
This assumption is always broken for __alloc_pages.

Modified by Kirill: drop vprintstat() from show_mem() at all
</pre>

Bug #81199

</div>

==== diff-vzdq-ppc32-comp-20070518 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[PATCH] vzdquota: compilation fix for ppc32

<pre class="simple">
While compiling on ppc32 the following error appears:

Building modules, stage 2.
MODPOST
WARNING: "__cmpdi2" [fs/vzdquota.ko] undefined!
make[2]: *** [__modpost] Error 1

The problem is that switch((long_long_var)) is not
a primitive for ppc32 gcc: libgcc.a is needed,
which is out of the kernel.

The problem was noticed by mbaranzak user at forum
and he found the reason of it.
</pre>
</div>

==== diff-vzdq-putsuper-20070518 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[VZDQ] sb-&gt;put_super can be NULL in valid cases

<pre class="simple">
put_super() superblock operation was not checked for NULL in vzquota
leading to NULL dereference.
</pre>

{{bug|541}}
<br/>
Bug #81936

</div>

==== diff-i2o-procread-20070509 ====
<div class="change">
Patch from Vasiliy (vvs@):<br/>
Fixes oops on read from some i2o proc files.

<pre class="simple">
Fixes oops on read from some i2o proc files.
Minor issue because i2o_proc module is not used currently.
</pre>
</div>

==== diff-arch-4gb-xen-20070523 ====
<div class="change">
Patch from Evgeny Kravtsunov &lt;emkravts@openvz.org&gt;<br/>
[4GB-SPLIT] Fixes required for Xen kernel compilation

</div>

==== diff-cpt-mm-deadlock-20070523 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] Fix possible deadlock in checkpointing of mm

<pre class="simple">
Learned it wrong once and did not relearn. anon_vma lock
cannot be taken under page table lock. And it is taken and
should be taken in reversed order, cpt_mm even has a special
hack due to wrong understanding: look at chunk converting ugly
spin_trylock to spin_lock.

Difference of previous version: in one case (does not happen normally,
but yet), page table lock could remain locked.
</pre>

Bug #82785

</div>

==== diff-cpt-wait-fix-20070518 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
While checkpointing due to memory shortage CPT processes

<pre class="simple">
While checkpointing due to memory shortage CPT processes
can be killed and tmpfs will not be saved.

During restore we will see such errors:

CPT ERR: e0000002ef9c5000,111 :-2 mounting /dev/pts devpts 40000000
CPT ERR: e0000002ef9c5000,111 :rst_namespace: -2
</pre>

Bug #79854

<pre class="simple">
This happens as /dev is tmpfs now and its content was not saved during
checkpointing.

We need to check exit status of tar and iptables-save to be sure that they
exited normally.

Changes from v1:
- return -EINVAL in case of error
</pre>
</div>

==== diff-cpt-xen-ldt-20070523 ====
<div class="change">
Patch from Evgeny Kravtsunov &lt;emkravts@openvz.org&gt;<br/>
[XEN] Fix LDT handling - There is one chunk LDT data only

</div>

==== diff-ms-cfq-rm-redundant-find-next-req-20070509 ====

<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;
<br/>
[PATCH] cfq: remove redundant cfq_find_next_crq() function call

<pre class="simple">
mainstrem fix. cfq_find_next_crq() will be called later.
This fix is incorporated in
http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=21183b07ee4be405362af8454f3647781c77df1b
</pre>
</div>

==== diff-ms-fuse-bug-control-fs-20070521 ====
<div class="change">
Patch from mainstream, prepared by Dmitry Monakhov &lt;dmonakhov@openvz.org&gt;<br/>
[PATCH 5/6] fuse: fix bug in control filesystem mount

<pre class="simple">
The BUG in fuse_ctl_add_dentry() could be triggered if the control
filesystem was unmounted and mounted again while one or more fuse
filesystems were present.

The fix is to reset the dentry counter in fuse_ctl_kill_sb().

Bug reported by Florent Mertens.

Signed-off-by: Miklos Szeredi &lt;miklos@szeredi.hu&gt;

Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>

==== diff-ms-fuse-dentry-parent-20070521 ====
<div class="change">
Patch from mainstream, prepared by Dmitry Monakhov &lt;dmonakhov@openvz.org&gt;<br/>
[PATCH 3/6] fuse: fix dereferencing dentry parent

<pre class="simple">
There's no locking for -&gt;d_revalidate, so fuse_dentry_revalidate() should use
dget_parent() instead of simply dereferencing -&gt;d_parent.

Due to topology changes in the directory tree the parent could become negative
or be destroyed while being used. There hasn't been any reports about this
yet.

Signed-off-by: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</pre>
</div>

==== diff-ms-fuse-mknod-of-regular-file-20070521 ====

<div class="change">
Patch from mainstream, prepared by Dmitry Monakhov &lt;dmonakhov@openvz.org&gt;<br/>
[PATCH] fuse: fix mknod of regular file

<pre class="simple">
The wrong lookup flag was tested in -&gt;create() causing havoc (error or
Oops) when a regular file was created with mknod() in a fuse
filesystem.

Thanks to J. Cameijo Cerdeira for the report.

Kernels 2.6.18 onward are affected. Please apply to -stable as well.

Signed-off-by: Miklos Szeredi &lt;mszeredi@suse.cz&gt;
</pre>
</div>

==== diff-ms-fuse-nlookup-20070521 ====
<div class="change">

Patch from mainstream, prepared by Dmitry Monakhov &lt;dmonakhov@openvz.org&gt;<br/>
[PATCH 1/6] fuse: locking fix for nlookup

<pre class="simple">
[PATCH 1/6] fuse: locking fix for nlookup
An inode could be returned by independent parallel lookups, in this case an
update of the lookup counter could be lost resulting in a memory leak in
userspace.

Signed-off-by: Miklos Szeredi &lt;miklos@szeredi.hu&gt;

Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</pre>
</div>

==== diff-ms-fuse-oops-in-lookup-20070521 ====
<div class="change">
Patch from mainstream, prepared by Dmitry Monakhov &lt;dmonakhov@openvz.org&gt;<br/>
[PATCH 4/6] fuse: fix Oops in lookup

<pre class="simple">
Fix bug in certain error paths of lookup routines. The request object was
reused for sending FORGET, which is illegal. This bug could cause an Oops
in 2.6.18. In earlier versions it might silently corrupt memory, but this
is very unlikely.

These error paths are never triggered by libfuse, so this wasn't noticed
even with the 2.6.18 kernel, only with a filesystem using the raw kernel
interface.

Thanks to Russ Cox for the bug report and test filesystem.

Signed-off-by: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Cc: &lt;stable@kernel.org&gt;

Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</pre>
</div>

==== diff-ms-fuse-spurious-bug-20070521 ====
<div class="change">
Patch from mainstream, prepared by Dmitry Monakhov &lt;dmonakhov@openvz.org&gt;<br/>
[PATCH 2/6] fuse: fix spurious BUG

<pre class="simple">
Fix a spurious BUG in an unlikely race, where at least three parallel lookups
return the same inode, but with different file type. This has not yet been
observed in real life.

Allowing unlimited retries could delay fuse_iget() indefinitely, but this is
really for the broken userspace filesystem to worry about.

Signed-off-by: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</pre>
</div>

==== diff-ms-fuse-validate-rootmode-20070521 ====
<div class="change">
Patch from mainstream, prepared by Dmitry Monakhov &lt;dmonakhov@openvz.org&gt;<br/>
[PATCH 6/6] [PATCH] fuse: validate rootmode mount option

<pre class="simple">
If rootmode isn't valid, we hit the BUG() in fuse_init_inode. Now
EINVAL is returned.

Signed-off-by: Timo Savola &lt;tsavola@movial.fi&gt;
Signed-off-by: Miklos Szeredi &lt;mszeredi@suse.cz&gt;

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>

==== diff-ms-powerpc-compilation-20070523 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] Fix powerpc compilation which was broken in 2.6.18.8

<pre class="simple">
Christian Kaiser reported broken powerpc compilation due to 2.6.18.8 fix:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.18.y.git;a=commitdiff;h=f102c840f7f72492a83c93fa65396fe0edcf1df6

In file included from drivers/media/video/pwc/pwc-uncompress.c:29:
include/asm/current.h: In function get_current:
include/asm/current.h:23: warning: implicit declaration of function offsetof
include/asm/current.h:23: error: expected expression before struct
make[4]: *** [drivers/media/video/pwc/pwc-uncompress.o] Error 1
make[3]: *** [drivers/media/video/pwc] Error 2
make[2]: *** [drivers/media/video] Error 2
make[1]: *** [drivers/media] Error 2
make: *** [drivers] Error 2
</pre>
</div>

==== diff-ms-sparc64-compilation-20070523 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
Fix SBUS compilation on SPARC64.

<pre class="simple">
asm/unistd.h should be included for _syscallX() define usage.
</pre>
</div>

==== diff-ubc-dentry-free_alien-b-20070521 ====
<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[BC] dentry_cache: drain alien caches only when CONFIG_NUMA is set

<pre class="simple">
Fix oops introduced by previous patch diff-ubc-dentry-free_alien-20070510:
We should deal with NUMA only when CONFIG_NUMA is set.
</pre>

Bug #82721

</div>

==== diff-ubc-ioprio-on-dispatch-check-20070523 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[BC] ioprio: account for requests in driver

<pre class="simple">
Previously we didn't take into account beancounter's requests that are
already in driver. Now we consider beancounter as empty
(no requests in it) only if there are no requests in CFQ _and_ in driver.

This patch improves fairness dramatically and fixes bug #81508
</pre>

Bug #81508

</div>

==== diff-ubc-ioprio-range-prio-20070623 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[BC] ioprio: range of ioprios were checked incorrectly

<pre class="simple">
Fix ioprio range check: ioprio range is 0..7
</pre>
</div>

==== diff-ubc-ioprio-slice-scaling-20070623 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[BC] ioprio: BC slice scaling fix

<pre class="simple">
Now slice is CFQ BC timeslice is scaled from X to 2*X ms.
</pre>
</div>

==== diff-ve-net-veth-addr-macro-20070514 ====
<div class="change">
Patch from Evgeny Kravtsunov &lt;emkravts@openvz.org&gt;<br/>
[VE] rename local macro ADDR() to avoid conflict with Xen

<pre class="simple">
Fix redefinition of ADDR in veth.c.
macro with name ADDR is also used by Xen in include/asm/mach-xen/asm/synch_bitops.h.
</pre>
</div>

==== diff-ve-nfs-abortcorrupt-20070523 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
Unfortunately, counter on RPC client counts only clones, but not real

<pre class="simple">
Unfortunately, counter on RPC client counts only clones, but not real
ussage. cl_dead flag leads to client destruction in rpc_release_client
while it is really in use. So, we have to introduce a new flag with the
meaning similar to cl_dead in all places except rpc_release_client.
True recounting is too expensive.

Based on idea from Dmitry Monakhov.
</pre>

Bug #82764<br/>
Bug #82875

</div>

==== diff-ve-nfs-tcpabort-20070522 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
Add NFS timeout handle for TCP transport

</div>

==== diff-ve-xen-blktapmain-20070514 ====
<div class="change">
Patch from Evgeny Kravtsunov &lt;emkravts@openvz.org&gt;<br/>
Fixes compilation error in Xen driver

</div>

==== diff-vzdq-aquota-group-len-20070521 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[VZDQ] Trivial fix for vzaquota.group lookup

<pre class="simple">
"vzaquota.group" was memcmp'd for only 11 symbols,
allowing names like "vzaquota.grouX" to be looked up successfully.
</pre>
</div>

==== diff-vzdq-atomic-in-buildmntlist-20070521 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[VZDQ] Fix GFP_KERNEL allocation under spin lock in vzdq_aquot_buildmntlist

<pre class="simple">
Switch allocation from GFP_KERNEL to GFP_ATOMIC under vfsmount_lock
in vzdq_aquot_buildmntlist()
</pre>
</div>

==== diff-arch-4gb-xen-cleanup-20070528 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] Cleanup 4GB split/Xen modification regarding init_tss

</div>

==== diff-cpt-aux-task-list-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] additional protected VE task list

<pre class="simple">
Nobody want this. But another ideas are absent until now,
which could mean it is just impossible to do painlessly.

Each task is enlisted on one more list (vetask_auxlist),
which is protected with tasklist_lock and it is its only difference
of normal ve task list, which is accessed by RCU rules.
</pre>
</div>

==== diff-cpt-check-iptables-modules-20070604 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[PATCH] CPT: check if ip_talbes are enabled before dumping them

<pre class="simple">
iptables-save returns error if module ip_tables is not loaded.
So, we just do not need to dump iptables at all if this module is not loaded in VE.
Don't try to dump iptables if they are not enabled in VE.
</pre>
</div>

==== diff-cpt-compilation-warning-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] fix compilation warning due to cast u64 -&gt; pointer

</div>

==== diff-cpt-compile-warning-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] one more compiler warning

</div>

==== diff-cpt-deleted-ref-20070502 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[PATCH] CPT: checkpoint inodes with deleted reference

<pre class="simple">
Consider the following scenario:
1. Create file (file1) and hard link to it (file2)
2. Open file2
3. Unlink file2

After that during checkpointing we will have the following err:
Can not dump VE: Device or resource busy deleted reference to existing inode,
checkpointing is impossible

The inode in question is not deleted, but it is not foundable
from inside checkpointed process group and not easy foundable on the disk :/

So we are trying to find another dentry with the same inode in 2 common places:
1. In inode-&gt;i_dentry alias list
2. In dir in which deleted dentry itself is located
</pre>

Bug #72540

</div>

==== diff-cpt-ia64-nat-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT][IA64] save/restore NaT values

<pre class="simple">
This patch closes two remaining holes in IA64 cpt implementation,
both are of no immediate practical value, but neraly impossible to fix
after we freeze layout of cpt image structs.

1. Migration between hosts with different layout of struct thread_info,
which is possible, if some new bits are added to thread_info in newer
kernels.
2. NaT bits are acurately saved and restored. This is required only
when some application uses control speculative loads, current compilers
are not able to do this, but this can change.
</pre>
</div>

==== diff-cpt-ia64-unaligned-suppress-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT][IA64] some prctl flags are forgotten

<pre class="simple">
Some apps do lots of unaligned accesses, know about this
and use prctl(). We did not save/restore those flags and
got flood of warnings after checkpointing.
</pre>
</div>

==== diff-cpt-improve-align-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] cosmetic fix to match 2.6.9 text

<pre class="simple">
In 2.6.9 it was a critical bug, image would be corrupted
because of broken alignment if we did not make this.
In 2.6.18 it is just nice.
</pre>
</div>

==== diff-cpt-inotify-core-b-20070529 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
Rework inotify changes, so that old API is still available

<pre class="simple">
Rework inotify changes, so that old API is still available
for 3rd party code like aufs.
</pre>
</div>

==== diff-cpt-namespace-deadlock-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] namespace semaphore possible deadlock

<pre class="simple">

Old bug. To my shame I knew about this, but ignored.

Deadlock is possible in two cases:
1. tar is not a tar, but something maliciously doing mount/unmount
2. tar is a good tar, but it takes namespace semaphore for read
f.e. to read /proc/mounts. If someone in system does mount/unmount
and blocked taking write semaphore, the second read semaphore deadlocks.

The fix is to drop namespace semaphore. We make mntget() on current
mnt, so that it will not disappear from under us. It still can be
unmounted with MNT_DETACH, in this case we cannot proceed with
scanning mnt list and we must not: unmounting something inside
VE while checkpointing is an obvious good reason to fail.

</pre>
</div>

==== diff-cpt-pids-leak-20070525 ====
<div class="change">
Patch from Pavel Emelianov &lt;xemul@openvz.org&gt;
<br/>
[PATCH] CPT: fix potential pid leak

<pre class="simple">
When restoring the VE restore_one_signal_struct() can
occasionally leak some pids.
</pre>

Bug #82895

</div>

==== diff-cpt-restore-deleted-files-20070502 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[PATCH] CPT: restore deleted files (hardlink case)

<pre class="simple">
The bug was here all the time, but it was never triggered as we never entered
the following path on checkpointing:

if (!IS_ROOT(d) &amp;&amp; d_unhashed(d)) {
struct file *parent;
parent = iobj-&gt;o_parent;
if (!parent ||
(!IS_ROOT(parent-&gt;f_dentry) &amp;&amp;

d_unhashed(parent-&gt;f_dentry))) {
/* Inode is not deleted, but it does not
* have references from inside checkpointed
* process group. We have options:
* A. Fail, abort checkpointing
* B. Proceed. File will be cloned.
* A is correct, B is more complicated */
/* Just as a hint where to create deleted file */
if (ino-&gt;i_nlink != 0) {
eprintk_ctx("deleted reference to existing inode,
checkpointing is impossible\n");
return -EBUSY;
}
} else {
&lt;&lt;&lt; HERE
/* Refer to _another_ file name. */
err = cpt_dump_filename(parent, 0, ctx);
if (err)
return err;
if (S_ISREG(ino-&gt;i_mode) || S_ISDIR(ino-&gt;i_mode))
dump_it = 0;
}
}

So, in image file for deleted file we always had its content and never
a reference to another file.

The fix is straightforward: check the type of the object in the image and
restore file content if needed.
</pre>
</div>

==== diff-cpt-restore-lastpid-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] restore last pid

<pre class="simple">
Sigh. I hoped it is not necessary. It is. bash goes insane
when its children get not monotonic pids.

The place where we store saved last pid is unusual.
Violating tradition I extend one of cpt image structs.
This should be ok: migration to older kernels will be prohibited,
migration from old to new ones is OK.
</pre>
</div>

==== diff-cpt-restore-packet-socket-20070606 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[CPT] restore packet socket

<pre class="simple">
Binding of packet socket was skipped.

One tricky bit: getsockname returns "real" sockaddr length
and bind() does not accept real name, it wants sizeof(struct sockaddr_ll).

Missing bits:
- multicast list, incuding promisc status
- statistics is not restored

Enough for beginning, the rest requires surgery in core.
</pre>
</div>

==== diff-cpt-rst-file-error-msg-20070601 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[PATCH] CPT: print file name when fail to open it

<pre class="simple">
Print file name if we failed to open it.
This information will be usefull for resolving problems.
</pre>

Bug #83180

</div>

==== diff-cpt-ubc-adjust-on-restore-c-20070601 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[PATCH] CPT: adjust UBC limits before restoring processes

<pre class="simple">
Move UBC limit adjustments in more appropriate place,
where it is actually needed.
</pre>
</div>

==== diff-cpt-wait-fix-c-20070601 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[PATCH] CPT: fix kernel_thread error code checks

<pre class="simple">
Some versions of tar return non-zero error code if it was not possible to
write warning message to stderr. So, we need to open /dev/null for it.
But during restore we will face another problem - /dev is stored on tmpfs, so
we are not able to open /dev/null and we need to create it.

Also there is another bug which come to CPT code from mainstream kernel_thread
helper. If our function returns an error (e.g. exec failed) it doesn't place
correct exit code to edi register before calling do_exit.
</pre>

Bug #83183

</div>

==== diff-fairsched-boot-cpu-20070530 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] sched: boot CPU can have non-zero ID (sparc)

<pre class="simple">
I was blindly assuming that boot processor ID is always 0,
which was not true on SUN4U machine where boot CPU has ID 1
and 2nd CPU has ID 0. Strange, but it is.
So replace 0 with real processor id in the code.
</pre>
</div>

==== diff-fairsched-iowait-fix-20070624 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[PATCH] Fix iowait stats in VE0

<pre class="simple">
2.6.18 OVZ kernels don't account iowait time,
this value is always displayed as zero:

$ cat /proc/stat | grep cpu
cpu 1700 5 1818 11790110 0 60 204 0
cpu0 893 4 1020 5894818 0 56 174 0
cpu1 807 1 797 5895291 0 4 29 0

This happens since calculations usually happen in idle context.
Actually there is no good definition of iowait for global VE0 context.
And the whole iowait concept is arguable, but still, let's try to account
as good as possible.
</pre>

{{bug|588}}

</div>

==== diff-fairsched-ppc-syscalls-fix3-20070525 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] ppc: fix screwed OVZ syscall numbers

<pre class="simple">
Fix enumeration of OVZ syscall numbers on powerpc

Thanks to Christian Kaiser for noticing this.
</pre>
</div>

==== diff-fairsched-preempt-20070529 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[PATCH] Fix debug messages when CONFIG_DEBUG_PREEMPT is used

<pre class="simple">
If CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT are turned on,
OpenVZ kernels produce a lot of similar messages:

BUG: using smp_processor_id() in preemptible [00000001] code:
&lt;process&gt;/&lt;pid&gt;
caller is io_schedule+0x22/0x53
Call Trace: ...

Two reasons of these messages:
1) we call smp_processor_id() from io_schedule/io_schedule_timeout
without preemption disabled. minor, raw_smp_processor_id() should be used.
2) task_struct-&gt;cpus_allowed has mask of vcpus instead of pcpus.
Therefore debug_smp_processor_id() function fails to check that the
process can run only on one current cpu.

The patch fixes both issues.
</pre>

{{bug|577}}

</div>

==== diff-fairsched-rename-vcpu-info-20070517 ====
<div class="change">
Patch from Evgeny Kravtsunov &lt;emkravts@openvz.org&gt;<br/>
[PATCH] rename vcpu_info to vcpu_struct due to conflict with Xen

<pre class="simple">
Rename vcpu_info to vcpu_struct due to conflict with Xen
which uses the same name for its data structure (sigh... globally...)

Thanks to seyko2 for testing OVZ-Xen kernel.
</pre>
</div>

==== diff-fairsched-tickduration-20070528 ====
<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[PATCH] sched: fix up fairsched tick duration according to VCPU timeslice

<pre class="simple">
With latest 'vcpu dynamic timeslice' patch we broke
fairsched scheduler logic a bit, which assumed,
that fairsched_schedule() must be called on each timer tick.

New bigger fairsched timeslice was introduced:
this value must be always &gt;= vcpu timeslice
</pre>

Bug #82969

</div>

==== diff-fairsched-x8664-show-regs-20070528 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] fairsched: fix VCPU info in show regs on x86-64

</div>

==== diff-ms-entropy-fix-a-20070530 ====
<div class="change">
Patch from Matt Mackall &lt;mpm@elenic.com&gt;<br/>
Add data from zero-entropy random_writes directly to output pools to

<pre class="simple">
Add data from zero-entropy random_writes directly to output pools to
avoid accounting difficulties on machines without entropy sources.

Tested on lguest with all entropy sources disabled.

Signed-off-by: Matt Mackall &lt;mpm@elenic.com&gt;
Acked-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
</pre>
</div>

==== diff-ms-entropy-fix-b-20070530 ====
<div class="change">
Patch from Matt Mackall &lt;mpm@selenic.com&gt;<br/>
random: fix seeding with zero entropy

<pre class="simple">
Add data from zero-entropy random_writes directly to output pools to
avoid accounting difficulties on machines without entropy sources.

Tested on lguest with all entropy sources disabled.

Signed-off-by: Matt Mackall &lt;mpm@selenic.com&gt;
Acked-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=7f397dcdb78d699a20d96bfcfb595a2411a5bbd2
</pre>
</div>

==== diff-ms-ext3-iread-brelse-20070601 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] ext3: lost brelse in ext3_read_inode()

<pre class="simple">
One of error path in ext3_read_inode() leaks bh,
since brelse is forgoten.
Move brelese under bad_inode label, so that it is freed.

Signed-Off-By: Kirill Korotaev &lt;dev@openvz.org&gt;
</pre>
</div>

==== diff-ms-ext3-orhpan-list-corruption-20070531 ====
<div class="change">
Patch from Vasily Averin &lt;vvs@openvz.org&gt;<br/>
[PATCH] ext3: orphan list corruption on bad inodes

<pre class="simple">
This patch fixes ext3 orphan list corruption due to bad inodes
created in ext3_read_inode().

Trying to catch orhpan list corruption in OpenVZ we found
the following debug messages in the logs:

May 30 10:39:38 df-rs-l24 kernel:
EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0
Inode 00000101a15b7840: orphan list check failed!
00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f
...
Call Trace: [&lt;ffffffff80211ea9&gt;] ext3_destroy_inode+0x79/0x90
[&lt;ffffffff801a2b16&gt;] sys_unlink+0x126/0x1a0
[&lt;ffffffff80111479&gt;] error_exit+0x0/0x81
[&lt;ffffffff80110aba&gt;] system_call+0x7e/0x83

first messages says that unlinked inode has i_nlink=0,
then ext3_unlink() adds this inode into orphan list.

second message means that this inode has not been removed from orphan list, and
inode dump shows that i_fop = &amp;bad_file_ops

Then I've discovered that bad_file_ops can be set only in make_bad_inode().
ext3_read_inode() can call make_bad_inode() without any
error/warning messages in the following case:
...
if (inode-&gt;i_nlink == 0) {
if (inode-&gt;i_mode == 0 ||
!(EXT3_SB(inode-&gt;i_sb)-&gt;s_mount_state &amp; EXT3_ORPHAN_FS)) {
/* this inode is deleted */
brelse (bh);
goto bad_inode;
...

i.e when
inode-&gt;i_nlink == 0
and
!(EXT3_SB(inode-&gt;i_sb)-&gt;s_mount_state &amp; EXT3_ORPHAN_FS)

Bad inode can live some time, ext3_unlink can add it to orphan list then,
but ext3_delete_inode() doesn't delete this inode from orphan list,
since inode is bad. As a result we have orphan list corruption
detected in ext3_destroy_inode().

This issue present in rhel4/rhel5/mainstream kernels too.
</pre>

Bug #83419

</div>

==== diff-ms-ext3-orhpan-list-corruption-b-20070603 ====
<div class="change">
Patch from Vasily Averin &lt;vvs@openvz.org&gt;<br/>
[PATCH] ext3: orphan list corruption on bad inodes (v2)

<pre class="simple">
Changes to previous patch:
instead of fixing ext3_unlink() better fix all the paths
were bad inode can be found and used, i.e. lookup() and get_parent()

After ext3 orphan list check has been added into ext3_destroy_inode()
(please see my previous patch) the following situation has been detected:
EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0
Inode 00000101a15b7840: orphan list check failed!
00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f
..
Call Trace: [&lt;ffffffff80211ea9&gt;] ext3_destroy_inode+0x79/0x90
[&lt;ffffffff801a2b16&gt;] sys_unlink+0x126/0x1a0
[&lt;ffffffff80111479&gt;] error_exit+0x0/0x81
[&lt;ffffffff80110aba&gt;] system_call+0x7e/0x83

First messages said that unlinked inode has i_nlink=0,
then ext3_unlink() adds this inode into orphan list.

Second message means that this inode has not been removed from orphan list.
Inode dump has showed that i_fop = &amp;bad_file_ops and it can be set in
make_bad_inode() only. Then I've found that ext3_read_inode() can call
make_bad_inode() without any error/warning messages,
for example in the following case:
..
if (inode-&gt;i_nlink == 0) {
if (inode-&gt;i_mode == 0 ||
!(EXT3_SB(inode-&gt;i_sb)-&gt;s_mount_state &amp; EXT3_ORPHAN_FS)) {
/* this inode is deleted */
brelse (bh);
goto bad_inode;
..

Bad inode can live some time, ext3_unlink can add it to orphan list, but
ext3_delete_inode() do not deleted this inode from orphan list. As
result we can have orphan list corruption detected in ext3_destroy_inode().

However it is not clear for me how to fix this issue correctly.

As far as i see is_bad_inode() is called after iget() in all places excluding
ext3_lookup() and ext3_get_parent(). I believe it makes sense to add bad inode
check to these functions too and call iput if bad inode detected.

Signed-off-by: Vasily Averin &lt;vvs@sw.ru&gt;
</pre>
</div>

==== diff-ms-ia64-nat-ptrace-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[IA64] ptrace returns garbage for NaT bits

<pre class="simple">
An old bug. Nobody needed those NaT bits, so that it was not noticed.
</pre>
</div>

==== diff-ms-net-ipv6-privacy-msg-20070605 ====
<div class="change">
Patch from Vasily Averin &lt;vvs@openvz.org&gt;<br/>
[PATCH] disable "Disabled Privacy Extensions" messages

<pre class="simple">
Hide annoing useless messages about disabled
IPv6 privacy extensions, which is always triggered by loopback:
"lo: Disabled Privacy Extensions"
</pre>

Bug #83651

</div>

==== diff-ms-net-settimeout-20070525 ====
<div class="change">
Patch from Vasily Averin &lt;vvs@openvz.org&gt;<br/>
[NET]: "wrong timeout value" in sk_wait_data() v2

<pre class="simple">
sys_setsockopt() do not check properly timeout values for
SO_RCVTIMEO/SO_SNDTIMEO, for example it's possible to set negative timeout
values. POSIX do not defines behaviour for sys_setsockopt in case negative
timeouts, but requires that setsockopt() shall fail with -EDOM if the send and
receive timeout values are too big to fit into the timeout fields in the socket
structure.
In current implementation negative timeout can lead to error messages like
"schedule_timeout: wrong timeout value".

Proposed patch:
- checks tv_usec and returns -EDOM if it is wrong
- do not allows to set negative timeout values (sets 0 instead) and outputs
ratelimited information message about such attempts.

Signed-off-By: Vasily Averin &lt;vvs@sw.ru&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;

X-Git-Tag: v2.6.22-rc3
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=ba78073e6f70cd9c64a478a9bd901d7c8736cfbc;hp=c883f215a23a9352097b8d17fb8dae22ff134a14
</pre>
</div>

==== diff-ms-nfs-launder-20070530 ====
<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[PATCH] NFS: Fix race in nfs_release_page()

<pre class="simple">
invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.

Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.

Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;

Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.

[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;

Cc: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;

From Alexandr:
This 'new' invalidate/release logic also fixes our problem with
mmap/write/read data corruption when several processes use the same
mmaped file on NFS
</pre>

Bug #81896

<pre class="simple">
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e3db7691e9f3dff3289f64e3d98583e28afe03db
</pre>
</div>

==== diff-ms-nfs-odirect-20070529 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[PATCH] nfs: oops during LTP over NFS (direct io)

<pre class="simple">
Problem reported by Denis Lunev and QA, fix from mainstream

incorrect comparison of "int" and "unsigned int" variables is fixed in
nfs_direct_read_schedule and nfs_direct_write_schedule.
</pre>

Bug #81589

</div>

==== diff-ms-nfs-schedlock-20070530 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[PATCH] nfs: AB-BA deadlock on rpc_sched_lock/queue-&gt;lock locks

<pre class="simple">
This patch fixes possible AB-BA deadlock for rpc_sched_lock/queue-&gt;lock
in rpc_run_child().

Normal sequence is presented in rpc_set_active:
- rpc_sched_lock goest first
- queue-&gt;lock is nested.
</pre>

Bug #82518

</div>

==== diff-ms-nfs-umount-refcnt-leak-20070530 ====
<div class="change">
Patch from Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;<br/>
[PATCH] nfs: fix req refcnt leak preventing umount

<pre class="simple">
Original Denis Lunev analyses:
- nfs_direct_req_alloc creates dreq with dreq-&gt;kref-&gt;refcount == 2
- on success path the kref_put is called in
nfs_direct_read_schedule -&gt; nfs_direct_complete
and in nfs_direct_wait
- on error path only first put occured
The same problem occures on direct_write path

Mainstream patch version from Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;:
The current code is leaking a reference to dreq-&gt;kref when the calls to
nfs_direct_read_schedule() and nfs_direct_write_schedule() return an
error.

Thanks to Denis V. Lunev for spotting the bug and proposing the original
fix.

Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</pre>
</div>

==== diff-ms-security-cpuset-20070605 ====
<div class="change">
Patch from Akinobu Mita &lt;akinobu.mita@gmail.com&gt;<br/>
use simple_read_from_buffer in kernel/

<pre class="simple">
Cleanup using simple_read_from_buffer() for /dev/cpuset/tasks and
/proc/config.gz.

Cc: Paul Jackson &lt;pj@sgi.com&gt;
Cc: Randy Dunlap &lt;rdunlap@xenotime.net&gt;

Signed-off-by: Akinobu Mita &lt;akinobu.mita@gmail.com&gt;

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

X-Git-Tag: v2.6.22-rc1
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=85badbdf5120d246ce2bb3f1a7689a805f9c9006
</pre>
</div>

==== diff-ms-security-sctp-20070604 ====
<div class="change">
Patch from Patrick McHardy &lt;kaber@trash.net&gt;<br/>
[NETFILTER]: {ip,nf}_conntrack_sctp: fix remotely triggerable NULL ptr dereference

<pre class="simple">
When creating a new connection by sending an unknown chunk type, we don't
transition to a valid state, causing a NULL pointer dereference in
sctp_packet when accessing sctp_timeouts[SCTP_CONNTRACK_NONE].

Fix by don't creating new conntrack entry if initial state is invalid.

Noticed by Vilmos Nebehaj &lt;vilmos.nebehaj@ramsys.hu&gt;

CC: Kiran Kumar Immidi &lt;immidi_kiran@yahoo.com&gt;
Signed-off-by: Patrick McHardy &lt;kaber@trash.net&gt;
</pre>
</div>

==== diff-ms-seqfile-seek-20070601 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[PATCH] seqfile: bash can hang in a loop reading from proc file

<pre class="simple">
Original problem: in some circumstances seq_file interface can present
infinite proc file to the following script when normally said proc file
is finite:

while read line; do
[do something with $line]
done &lt;/proc/$FILE

bash, to implement such loop does essentially

read(0, buf, 128);
[find \n]
lseek(0, -difference, SEEK_CUR);

Consider, proc file prints list of objects each of them consists of many
lines, each line is shorter than 128 bytes.

Two objects in list, with -&gt;index'es being 0 and 1. Current one is 1,
as bash prints second object line by line.

Imagine first object being removed right before lseek().
traverse() will be called, because there is negative offset.
traverse() will reset -&gt;index to 0 (!).
traverse() will call -&gt;next() and get NULL in any usual iterate-over-list
code using list_for_each_entry_continue() and such. There is one object in
list now after all...
traverse() will return 0, lseek() will update file position and pretend
everything is OK.

So, what we have now: -&gt;f_pos points to place where second object will be
printed, but -&gt;index is 0. seq_read instead() of returning EOF, will start
printing first line of first object every time it's called, until enough
objects are added to -&gt;f_pos return in bounds.

Fix is to update -&gt;index only after we're sure we saw enough objects down
the road.

Signed-off-by: Alexey Dobriyan &lt;adobriyan@sw.ru&gt;
</pre>

Bug #82819

</div>

==== diff-ubc-proc-rework-c-20070604 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] ubc: fix compilation with CONFIG_UBC_DEBUG_IO=y

<pre class="simple">
During rework of UBC /proc compilation with UBC_DEBUG_IO was
broken a bit.
</pre>
</div>

==== diff-ubc-unix-exports-20070604 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] ubc: export ubc helpers for case CONFIG_UNIX=m

<pre class="simple">
Export ub_sock_getwres_other, since unix sockets can
call it from the module (unix.ko) when CONFIG_UNIX=m.

Thanks to Rafael Isturiz for having non-standart config :) and reporting this.
</pre>
</div>

==== diff-ve-cpustats-20070528 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] VE cpu stats should be exported to user space in clocks

<pre class="simple">
VE cpu stats should be exported to user space in clocks intead of jiffies.
</pre>
</div>

==== diff-ve-ip-nat-aliasing-20070605 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[PATCH] Unalias VE_IP_NAT for ip_nat and iptable_nat modules

<pre class="simple">
If ip_nat and ip_tables modules are loaded before VE start, and
iptable_nat after VE start, on VE stop kernel will crash in
ipt_unregister_table() attempting to unregister NULL table.

Split VE_IN_NAT flag responsible for two modules.
</pre>

{{bug|607}}

</div>

==== diff-ve-net-arp-set-perms-20070625 ====
<div class="change">
Patch from Vasily Tarasov &lt;vtaras@openvz.org&gt;<br/>
[PATCH] arp: allow set arp cache entries from VE

<pre class="simple">
It is secure since later we use __dev_get_by_name() function which is
aware about current context.

http://forum.openvz.org/index.php?t=tree&amp;th=2570&amp;mid=13209&amp;&amp;rev=&amp;reveal=
</pre>

</div>

==== diff-ve-net-veth-filtering-b-20070605 ====
<div class="change">
Patch from Andrey Mirkin &lt;major@openvz.org&gt;<br/>
[PATCH] veth: rework VE traffic filtering

<pre class="simple">
Mac filtering in veth_xmit() was a bit incorrect:
broadcasts and multicasts were allowed from VE.
Rearrange code, make it more clear and assymetric :/
</pre>
</div>

==== diff-ve-net-veth-multicast-20070604 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] veth: multicasts should be forwarded as well

<pre class="simple">
Right now veth_xmit passes broadcasts only.
It is a bug. Multicasts should be allowed as well.

Thanks to Daniel Pittman for noticing this.
</pre>
</div>

==== diff-ve-oom-adjust-20070604 ====
<div class="change">
Patch from Denis Lunev &lt;den@openvz.org&gt;<br/>
[PATCH] disable OOM_DISABLE inside VE

<pre class="simple">
Prevent disabling of OOM from inside VE. Basically, it is safe to
allow priority changes inside VE, as in normal case we select UB and a
process inside UB then.
</pre>
</div>

==== diff-ve-reparent-threaded-init-20070604 ====
<div class="change">
Patch from Alexey Kuznetsov &lt;alexey@openvz.org&gt;<br/>
[PATCH] VE: reparent threaded init correctly

<pre class="simple">
If init is multithreaded (yes, imagine, this happens :-)),
its threads are reparented to VE init, so that we get parents
in the same thread group. Nothing especially bad happens,
only checkpointing cannot restore such sick configuration.
</pre>
</div>

==== diff-ve-setattr-proc-20070524 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[PATCH 1/2] VE: allow proc setattr on local proc entries

<pre class="simple">
If PDE is local to VE, there is no reason to not allow setattr on it --
changes won't affect corresponding global PDE and other VEs.
</pre>

{{bug|509}}

</div>

==== diff-ve-setattr-proc-b-20070604 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[PATCH] proc: brown paper bag bug in proc's -&gt;setattr

<pre class="simple">
-&gt;setattr is called for something innocent like mtime updates, so
outright banning of -&gt;setattr on global proc entries was sadistic.

Check if -&gt;setattr is called with mask indicating MODE, UID, GID change
and check for globalness only in this case.
</pre>

{{bug|604}}
<br/>
{{bug|509}}

</div>

==== diff-ve-setattr-proc-kmsg-20070524 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[PATCH] VE: make /proc/kmsg to be VE local

<pre class="simple">
Some people used to doing "chmod g+r /proc/kmsg". Make PDE corresponding
to /proc/kmsg local to VE, so it's possible to setattr it.
</pre>

{{bug|509}}

</div>

==== diff-ve-syslog-20070601 ====
<div class="change">
Patch from Vitaliy Gusev &lt;vgusev@openvz.org&gt;<br/>
Fix LTP test failure in syslog test.

<pre class="simple">
LTP failure is minor and simple: it calls syslog(2) with wrong arguments
and awaiting for an error. But syslog() returns 0 since VE doesn't
have real console and console loglevel.

Thanks Christian Kaiser2 &lt;CKAISER2@de.ibm.com&gt; for noticing this.
</pre>
</div>

==== diff-ve-vpsdumpable-early-20070604 ====
<div class="change">
Patch from Kirill Korotaev &lt;dev@openvz.org&gt;<br/>
[PATCH] init vps_dumpable early on exec

<pre class="simple">
Since CPT uses vps_dumpable flag now for determining
external processes on checkpointing, we need to initialize
it earlier on mm creation on exec. Otherwise it can race.
</pre>
</div>

==== diff-vzdq-restore-symlinks-under-sem-20070524 ====
<div class="change">
Patch from Alexey Dobriyan &lt;adobriyan@openvz.org&gt;<br/>
[PATCH] VZDQ: Fix lockdep warning about s_umount dependancy

<pre class="simple">
Lockdep learns false dependency due to vz_restore_symlink()
and later complains about possible circular locking when quotaon is
done.

Temporarily up -&gt;s_umount semaphore to workaround this.
</pre>

{{bug|585}}

</div>

==== diff-xen-subarch-changes-20070528 ====
<div class="change">
Patch from Evgeny Kravtsunov &lt;emkravts@openvz.org&gt;<br/>
[PATCH] Fixes for Xen arch compilation / work

</div>

==== diff-ve-prepare-ve0-tasks-20070608 ====
<div class="change">
Patch from Alexandr Andreev &lt;aandreev@openvz.org&gt;<br/>
[PATCH] VE: ve0 processes intialization

VE0 processes were initialized twice:
* in copy_process()
* in prepare_ve0_process() from init_ve_system()

This is redundant and unneeded. Leading to wrong ve0.pcounter
</div>
</noinclude>