Download/kernel/rhel4/023stab043.2/changes

From OpenVZ Virtuozzo Containers Wiki
< Download‎ | kernel‎ | rhel4‎ | 023stab043.2
Revision as of 14:45, 20 March 2008 by Kir (talk | contribs) (created)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Changes

  • Rebased on RH 42.0.8-EL
  • Compat fixes for x86_64 (inode numbers and sys_stime)
  • CPT UBC preserving is fixed and during restore limits are increased now
  • UBC othersockbuf over-optimization fix
  • Updated 3ware driver up to 2.26.05.006 version
  • NFS in VE fix
  • No more panic on oops
  • sys_waitid virtualization fixes
  • Lots of other small fixes and debug patches

Config changes

Same as 023stab040.1, plus:

  • +CONFIG_SCSI_QLA4XXX_FAILOVER=y
  • +CONFIG_NETPOLL_TRAP=y
  • +CONFIG_NETDUMP=m
  • +CONFIG_EXPORTFS=m
  • +CONFIG_NFS_ACL_SUPPORT=m

diff-ubc-nootheropt-20070206

Patch from Andrey (saw@), modified by Evgeniy:
Fix for over-optimization of OTHERSOCKBUF accounting.

For those sockets there is no protection by socket sock.

Bug was provoked by optimization of charging/uncharging othersockbufs: diff-ubc-tcpsndopt-20060429

In brief idea is the following: optimization is based on assumption that soket is always locked by lock_sock and protected from using the socket by more than one users simultaneously. But current assumption is wrong for datagram sockets (for example PF_UNIX ones), that are not locked in the majority of cases. This provokes race condition between 2 users of ths same dgram socket. As for tcp sockets - they are always locked (or it can be done so), - this prevents races.

Bug #70974.
Bug #74089.

diff-cpt-pagein-swapoff-fix

Patch from Andrey:
Some error can occur during rst_swapoff() and sys_swapoff().

In case of -EINVAL we do not need to perform cleanup. In all other cases we should do it.

Move cleanup in separate function and perform it in loop unless success or -EINVAL. Clear TIF_SIGPENDING flag in case of pending signals to make sure that sys_swapoff() won't be interrupted, restore this flag on exit if it was cleared.

Bug #74725.

diff-cpt-ubc-adjust-on-restore

Patch from Andrey:

While restore process we can exceed UBC limits, because during restore process more resources are used.

Bug #71159.

diff-cpt-ubc-change-image-format-b

Patch from Andrey:
Change ubc image format to remove magic numbers like 6 and 12.

diff-cpt-ubc-change-image-format

Patch from Andrey:

Change order of ubc parameters in image file. Now we are storing resource pairs (ub_parms and ub_store) as one unit:
KMEMSIZE parms, KMEMSIZE store, LOCKEDPAGES parms, LOCKEDPAGES store, ...

Previous format was:
KMEMSIZE parms, LOCKEDPAGES parms, ..., KMEMSIZE store, LOCKEDPAGES store, ...

With new format it is simpler to increase number of ubc resources.

diff-ms-fs-preparewrite-eh-20070202

Patch from Kirill:

The original patch which was used in OVZ/VZ was diff-ms-fs-preparewrite-eh-20061005.

It is a pity, but it was broken by RH when commited to RHEL4 update (linux-2.6.13-buffer.patch). __block_prepare_write() error handling is done incorrectly, since IO initiated on some of the buffers should be waited for to complete (wait_on_buffer).

Fix it with this incremental patch which makes VZ code the same as it was for a long time already.

diff-cpt-pgin-alloc-index-fix

Patch from Andrey:

1. Index of lazy page was checked incorrectly:

-               if (page_nr > PAGE_SIZE/sizeof(struct pgin_desc*)) {
+               if (page_nr >= PGINDIR_SIZE/sizeof(struct pagein_desc*)) {

so we could try to access outside of array boundaries and oops.

Bug #74455.
Bug #75539.

2. Current lazy migration is limited to 512MB on x86-64. Increase table size to be able to store up to 2097152 lazy pages (8 Gb).

diff-ms-kmap-pte0-20070207

Patch from Vasily:
fixes kmap PTE0 leakage: pte_unmap() missed on error path in install_page()

Bug #75560.

diff-arch-4gb-pgdctor

Patch from Kirill:

During 4GB split port to 2.6.18 it was found that 2.6.9 kernel incorrectly inserts unitialized yet pgd to pgd_list. it is wrong, initialize it first.

diff-cpt-iter-pfn-fix

Patch from Andrey:

pfn index was checked incorrectly while lookup/alloc, so that we could get out of the array boundaries and oops.

related to the same bugs with lazy migration:
Bug #74455.
Bug #75539.

diff-cpt-pte-unmap-lost-20070205

Patch from Pavel, found by Vasiliy:

When porting to new mm locking one unmap+unlock was lost. Found due to (but not fixes):
Bug #75448.

diff-cpt-ubc-save-restore-fix

Patch from Andrey:
UBC were saved and restored incorrectly:

for (i = 0; i < UB_RESOURCES; i++)
	dump_one_bc_parm(v->cpt_parms, bc->ub_parms, 0);

Only KMEMSIZE values were saved and restored in this case.

1. Do not restore UBC if we get image with previous version.

2. cpt_parms has space for 32x2 resources, however, first UB_RESOURCES * 2 are used. i.e. not 24 of 32 and 24 of 32. keep this for compatibility.

diff-cpt-check-image-version

Patch from Andrey:
Add check for image version.

Allow to restore only images from 2.6.9 kernel. Actually the following combinations are now allowed only:
2.6.8/2.6.9 <-> 2.6.9+plus patch
2.6.9 -> 2.6.16+ (this patch disables this combination)

diff-ms-ext3-quota-drop

patch from Dmitry (dmonakhov@):
Backported from mainstream v2.6.13
[PATCH] ext3: drop quota references before releasing inode

We must drop references to quota structures before releasing the inode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

commit ab6862e6dab813ecde9ae7da506188dc1e9f11bb

diff-ms-ext2-quota-drop

patch from Dmitry (dmonakhov@):
Backported from mainstream v2.6.13
[PATCH] ext2: drop quota reference before releasing inode

We must drop references to quota structures before releasing the inode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

commit c7e9a52ef0089492bba457dfb8eba1a54e19f24a

diff-simfs-reiserfs-statfs-20070117

Patch from Evgeny:

1. when DISK_QUOTA is switched off in /etc/vz/vz.config, sim_statfs takes kstatfs from underlying fs. reiserfs do not initialize f_ffree (free inodes) and f_files in kstatfs. So we need to zero out kstatfs structure before asking reiserfs.

2. reiserfs used to initialize f_ffree to -1 (in 2.4.x). it was an exception among other filesystems that could be used for determining that fs is reiserfs. In 2.6.x f_ffree is not initialized by reiserfs at all. So need to distinguish reiserfs another way. Use fsmagic.

OpenVZ Bug #199.

diff-ms-security-pt-interp-20070124

Patch from Alexey Dobriyan:

Proposed patch to fix #5 in
http://www.isec.pl/vulnerabilities/isec-0017-binfmt_elf.txt
aka
http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2004-1073

To reproduce, do

  • grab poc at the end of advisory.
  • add line "eph.p_memsz = 4096;" after "eph.p_filesz = 4096;"
    where first "4096" is something equal to or greater than 4096.
  • ./poc /usr/bin/sudo && ls -l

Here I get:

-rw------- 1 ad   ad   102400 2007-01-15 19:17 core
---s--x--x 2 root root 101820 2007-01-15 19:15 /usr/bin/sudo

Check for MAY_READ as binfmt_misc.c does.

diff-emt64-vsyscall-b-20060718

Patch from Vasily:

fixed initialization of sysctl_vsyscall variable, currently vsyscall_init always overwrites the value zeroed in time_init_gtod()

Bug #73353.

diff-cpt-sigsuspend-lockup

Patch from Alexey:
[CPT] sigsuspend could hang forever after restore

Do not restart syscalls with TIF_RESTORE_SIGMASK in cpt.

It was severe bug. First, we do not need to restart such syscalls, they are restarted by core on exit from syscall. Second, it was wrong to restart syscall but do not clear TIF_RESTORE_SIGMASK and do not restore mask. If some signal happens here, it will be delivered, but syscall is restarted and sigsusend() will not exit hanging forever.

To knowledge base: restart without checking for TIF_RESTORE_SIGMASK is not allowed.

diff-ms-compat-emt64-stime-20070131

Patch from Alexandr Andreev:
32bit compat_sys_stime is missing on x86_64

OpenVZ Bug #438.

diff-ms-compat-stat32-ino-20061229

Patch from mainstream:
[PATCH 1/3] make static counters in new_inode and iunique be 32 bits

From: Jeff Layton <jlayton@redhat.com>

To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org

When a 32-bit program that was not compiled with large file offsets does a stat and gets a st_ino value back that won't fit in the 32 bit field, glibc (correctly) generates an EOVERFLOW error. We can't do anything about fs's with larger permanent inode numbers, but when we generate them on the fly, we ought to try and have them fit within a 32 bit field.

This patch takes the first step toward this by making the static counters in these two functions be 32 bits.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-By: Kirill Korotaev <dev@openvz.org>

diff-cfq-timeslice-20070129

Patch from Vasily Tarasov (vtaras@), improves CFQ IO scheduler:

"CFQ in 2.6.9 kernel creates a request queue for each process and performs round robin procedure other these queues, selecting one request from each queue. This patch adds time-slice for each per-process queue: it means, that during timeslice only requests from the certain queue are serviced. Such mechanism is used in CFQ 2.6.18."

Bug #71929.

linux-2.6.9-3w-9xxx-2.26.05.006.patch

Patch from Vasiliy:
3ware driver update to 2.26.05.006

diff-ve-nfs-execenv-20070221

Patch from Evgeny:

svc_recvfrom (net/sunrpc/svcsock.c) function switches the context to ve0 and never returns to ve context. This may cause oops when VE private area is placed on nfs partition.

Bug #76354.

diff-ms-unlock-buffer-barrier

Patch from mainstream:
[PATCH] buffer: memorder fix

unlock_buffer(), like unlock_page(), must not clear the lock without ensuring that the critical section is closed.

Mingming later sent the same patch, saying:

We are running SDET benchmark and saw double free issue for ext3 extended attributes block, which complains the same xattr block already being freed (in ext3_xattr_release_block()). The problem could also been triggered by multiple threads loop untar/rm a kernel tree.

The race is caused by missing a memory barrier at unlock_buffer() before the lock bit being cleared, resulting in possible concurrent h_refcounter update. That causes a reference counter leak, then later leads to the double free that we have seen.

Inside unlock_buffer(), there is a memory barrier is placed *after* the lock bit is being cleared, however, there is no memory barrier *before* the bit is cleared. On some arch the h_refcount update instruction and the clear bit instruction could be reordered, thus leave the critical section re-entered.

The race is like this: For example, if the h_refcount is initialized as 1,

cpu 0:                                   cpu1
--------------------------------------   -----------------------------------
lock_buffer() /* test_and_set_bit */
clear_buffer_locked(bh);
lock_buffer() /* test_and_set_bit */
h_refcount = h_refcount+1; /* = 2*/     h_refcount = h_refcount + 1; /*= 2 */
clear_buffer_locked(bh);
....                                    ......

We lost a h_refcount here. We need a memory barrier before the buffer head lock bit being cleared to force the order of the two writes. Please apply.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

GIT: 72ed3d035855841ad611ee48b20909e9619d4a79
http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.5353.22.215

diff-simfs-mntcount-20070208

Patch from Evgeny Kravtsunov <emkravts@openvz.org>:
[SIMFS] get lower vfsmount on simfs mount

This prevents lower FS from being umounted while simfs is mounted.

OpenVZ Bug #451.
http://git.openvz.org/?p=kernel-028;a=commit;h=63f1ecae912ee9614bcad23cc147ca5557f8b547

diff-ms-ext3-unlink-race

Patch from mainstream:
[PATCH] return ENOENT from ext3_link when racing with unlink

Return -ENOENT from ext[34]_link if we've raced with unlink and i_nlink is 0. Doing otherwise has the potential to corrupt the orphan inode list, because we'd wind up with an inode with a non-zero link count on the list, and it will never get properly cleaned up & removed from the orphan list before it is freed.

[akpm@osdl.org: build fix]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

GIT: 2988a7740dc0dd9a0cb56576e8fe1d777dff0db3
http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.5353.22.208

Bug #74302.

diff-ubc-twcountlimit-20070213

Patch from Denis:
This patch changes default for per/UB TW buckets limitations

OpenVZ Bug #460.

diff-dbg-pb-add-list-ref

Patch from Kirill:
Print some debug info instead of BUG in pb_add_list_ref().

Actually there is a bug in copy_page_range logic: if the page was reserved then it is never tied to UB with PBC. Good. However, if the page is unreserved later, then next copy_page_range will blindly assume that it should have been tied already(!). And can be dissappointed by the fact it is not.

The bad thing is that packet_mmap() from net/packet/af_packet.c maps exactly such pages...

but I don't see message:
printk(KERN_DEBUG "packet_mmap: vma is busy: %d\n", atomic_read(&po->mapped));
Sigh...

diff-dbg-spinlock

Patch from Kirill:
Debug for valid_swaphandles() oops from Strato.

Check for correct swp_entry and print spinlock magic when doing BUG()

diff-ve-wait-vpids-20070228

Patch from Alexey Kuznetsov:
Forgotten bits of pid virtualization in sys_wait*

diff-rh-panic-on-oops-20070228

Patch from Kirill:

RH has changed the default behaviour of the kernel: now it panics on oops :/ return it back to continue

linux-2.6.9-arcmsr-1.20.0X.13-61107.patch

patch ported by Kostja (khorenko@):
Areca driver v1.20.0X.13-61107 added.

Sources from Areca site: ftp://ftp.areca.com.tw/RaidCards/AP_Drivers/Linux/DRIVER/SourceCode/

Bug #59933.