From OpenVZ Virtuozzo Containers Wiki
Jump to: navigation, search



  • Added ext3 online resize
  • sysfs in VPS support (required for SUSE templates)
  • IA64 fixes/updates
  • HPET, autofs in VPS
  • Memory leak in bounce buffers
  • /proc/modules in VPS
  • uname virtualization
  • Minor security fixes
  • Drivers and mainstream updates.


Introduced new env create parameters allowing:

  • to turn on SYSFS inside VPS
  • configure number of VCPUs inside VPS


Same as 022stab064.1, plus

  • +CONFIG_BINFMT_MISC=y (on IA64 only)
  • +CONFIG_I2C=m
  • +CONFIG_I2C_ALI1535=m
  • +CONFIG_I2C_ALI1563=m
  • +CONFIG_I2C_ALI15X3=m
  • +CONFIG_I2C_AMD756=m
  • +CONFIG_I2C_AMD8111=m
  • +CONFIG_I2C_I801=m
  • +CONFIG_I2C_I810=m
  • +CONFIG_I2C_SIS5595=m
  • +CONFIG_I2C_SIS630=m
  • +CONFIG_I2C_SIS96X=m


  • -CONFIG_VE_SYSFS=n (now runtime configurable)



Patch from Vasiliy:

This patch adds support for 3ware 9550SX RAID controllers, 3w-9xxx driver updated up to version Sources are taken from

Bug 38702.


Patch from Kostya:

This patch adds support for DELL OpenManage, dcdbas version 5.6.0-1 was taken from mainstream 2.6.15 kernel.

Bug 55618.


Patch from Vasiliy:

This patch adds support for new AMD and NForce in-chipset IDE controllers taken from 2.6.15 mainstream kernel

Bug 33606.


Prepared by Vasiliy:

added hunks required for:

  • cciss driver,
  • intel ich7/esb2 ide driver update,
  • ide amd74xx driver


Patch from Pavel:

When CONFIG_HUGETLB is on follow_huge_addr argument is not defined. (2.6.15 does not have this problem). OpenVZ Bug #88.


Patch from Pavel:

Hide /proc/config.gz from ve proc tree

backported from 2.6.15

OpenVZ Bug #92.


Patch from Kirill:

fixed compilation with CONFIG_DEBUG_STACKOVERFLOW=n

OpenVZ Bug #96.


Patch from Pavel:

Due to unsignedness of ret variable return value from ub_memory_charge was ignored. This lead to overuncharging then.

2.6.15 does not have this problem...

OpenVZ Bug #104.


Patch from mainstream:

[PATCH] fix bio_uncopy_user() mem leak

Bug 58180.


Patch from Kirill:

migration_thread_stop() should call yield() instead of cpu_relax(). On UP machines it can relax forever, since migration_thread won't be able to do any progress - it will never be scheduled.

Bug 58372.


Patch from Denis:

This patch fixes netstat output listing TW buckets from other VPSs.

Bug 58839.


Patch from Denis:

This is a kludge for 'ifconfig venet0' which falls to UNSPEC link type where all 16 bytes of hwaddr is reported and buffer was not initialized in user space.

Bug 58834.


Patch from mainstream:

This patch adds kzalloc(), mostly required by driver updates and future patches. Originally required by dcdbas.


Patch from mainstream:

This patch adds "gfp_t" type, mostly required by driver updates and future patches. Originally required by dcdbas.


Patch from mainstream:
[IGMP]: workaround for IGMP v1/v2 bug (minor)

With IGMP version 1 and 2 it is possible to inject a unicast report to a client which will make it ignore multicast reports sent later by the router.

The fix is to only accept the report if is was sent to a multicast or unicast address.

Signed-off-by: David S. Miller <>
GIT: 24c6927505ca77ee4ac25fb31dcd56f6506979ed

RHEL4u2: linux-2.6.9-CVE-2005-2185-igmp-dos.patch


Patch from mainstream:

[PATCH] x86_64: missing lock prefix in switch_to

Add the missing "lock" prefix in switch_to macro.

Signed-off-by: Suresh Siddha <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-x86_64-switch_to-missinglock.patch


Patch from mainstream:

[PATCH] x86_64: Fix signal FPU leak on i386 and x86-64

On i386, if a signal handler is started, the kernel saves the fpu-state of the interrupted routine in the sigcontext on the stack. Calling unlazy_fpu() and setting current->used_math=0, the kernel supplies the signal-handler with a cleared virtual fpu. On sigreturn(), the old fpu-state of the interrupted routine is restored.

If a process never used the fpu, it virtually has a cleared fpu. If such a process is interrupted by a signal handler, no fpu-context is saved and sigcontext->fpstate is set to NULL.

Assume, that the signal handler uses the fpu. Then, AFAICS, on sigreturn current->used_math will be 1. Since sigcontext->fpstate still is NULL, restore_sigcontext() doesn't call restore_i387(). Thus, no clear_fpu() is done, current->used_math is not reset.

Now, the interrupted processes fpu no longer is cleared!

Fix by AK. Just clear the FPU again when this happens.

patch for i386 and x86-64.

Signed-off-by: Andi Kleen <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream:

[PATCH] APIC/LAPIC hanging problems on nForce2 system

current state:
Systems with Nforce2 could freeze on high disk i/o activity in APIC mode when CPU Disconnect is enabled. If bios doesn't fix this, current kernel fix changes the registers according to follwing table:

       * Chip  Old value   New value
       * C17   0x1F0FFF01  0x1F01FF01
       * C18D  0x9F0FFF01  0x9F01FF01

But this is only done, if cpu disconnect has been enabled in bios.

why change this:
If CPU disconnect is not enabled in bios, and bios is broken (some manufacturers like Abit don't care about their customers and even the latest bios doesn't fix this; I have an Abit mainboard), the kernel doesn't apply the fix, so if cpu disconnect is enabled at a later stage (in userspace), the system will be unstable and most likely freeze.

new behaviour:
The fix is now applied regardless of cpu disconnect being enabled at boot time, or not. As you only have to change byte 3 to 0x01, reading out chipset version isn't needed, so the patch simplifies the fix. Now turning cpu disconnect on, at later stage won't break the system, and if it was already enabled, it gets fixed, as the old version did.

Signed-off-by: Prakash Punnoor <>
Acked-by: Bartlomiej Zolnierkiewicz <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream:

[NET]: Fix memory leak in sys_{send,recv}msg() w/compat

From: Dave Johnson <>

sendmsg()/recvmsg() syscalls from o32/n32 apps to a 64bit kernel will cause a kernel memory leak if iov_len > UIO_FASTIOV for each syscall!

This is because both sys_sendmsg() and verify_compat_iovec() kmalloc a new iovec structure. Only the one from sys_sendmsg() is free'ed.

I wrote a simple test program to confirm this after identifying the problem:

Signed-off-by: Andrew Morton <>
Signed-off-by: David S. Miller <>

GIT: d64d3873721cfe870d49d73c3744f06260779ce7


Patch from mainstream:

[PATCH] x86_64: fix bug in csum_partial_copy_generic()

I was observing reproducible crashes on the "movw %bx,(%rsi)" instruction below while a process in a recvfrom() system call was copying packet data to user space. The patch below fixes the exception table and causes the crash to no longer reproduce. Please apply.

Acked-by: Andi Kleen <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: 92ed0223aefa795d1873427e25599cb70b2148ee

RHEL4u2: linux-2.6.9-x8664-csum-copy.patch


Patch from mainstream:
[PATCH] x86_64: Don't allow accesses below register frame in ptrace

There was a "off by one quad word" error in there. I don't think it is exploitable because it will only store into a unused area, but better to plug it.

Found and fixed by John Blackwood

Signed-off-by: Andi Kleen <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: c4d1fcf3a2ea89b6d6221fa8b4588c77aff50995

RHEL4u2: linux-2.6.9-CAN-2005-1765-x8664-ptrace-overflow.patch


Patch from mainstream:
[PATCH] x86_64: check if ptrace RIP is canonical

This works around an AMD Erratum.

Signed-off-by: Andi Kleen <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: d1099e8a18960693c04507bdd7b9403db70bfd97
RHEL4u2: linux-2.6.9-CAN-2005-1762-x86_64-ptrace-canonical-addr.patch


Patch from mainstream:
[PATCH] add TCSBRKP to compat_ioctl.h

Move ioctl TCSBRKP support to compat layer. Same rationale as TCSBRK.

RHEL4u2: linux-2.6.9-x86_64-missing-compat-ioctls.patch
(part of it went to diff-tty-compatioctls-20050905)


Patch from mainstream:
[PATCH] x86-64: Fix missing TLB flushes in change_page_attr

Fix bug in change_page_attr - with multiple pages it would not flush correctly. Also add a small optimization of not flushing when not needed.

Found and fixed by Andrea.

Signed-off-by: Andi Kleen <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-x86_64-change_page_attr-flush-fix.patch


Patch from mainstream:
[PATCH] x86: no interrupts from secondary CPUs until officially online

Andi Kleen reported a problem where a very slow boot caused the timer interrupt on a secondary CPU to go off before the CPU was actually brought up by the core code, so the CPU_PREPARE notifier hadn't been called, so the per-cpu timer code wasn't set up.

This was caused by enabling interrupts around calibrate_delay() on secondary CPUs, which is not actually neccessary (interrupts on CPU 0 increments jiffies, which is all that is required). So delay enabling interrupts until the actual __cpu_up() call for that CPU.

Signed-off-by: Rusty Russell <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-x86-irq-boot-disable-dualcore.patch


Patch from mainstream:
[NET]: Check for SOL_SOCKET in compat_sys_getsockopt

Some 32bit apps failed due to incorrect fixups

Signed-off-by: David S. Miller <>

RHEL4u2: linux-2.6.10-ac-selected-bits.patch


Patch from mainstream:
minor security issues in VE0 in /proc/scsi.



Patch from mainstream:
[PATCH] Unwind information fix for the vsyscall DSO

When working on GDB support I found a typo. I assume the comment is correct. If you step to this particular instruction and backtrace, GDB gets lost.


Patch from mainstream:
[PATCH] faster signal handling on x86

Optimize away the unconditional write to debug registers on signal delivery path. This is already done on x86_64.


Patch from mainstream:
[PATCH] del_timer() vs. mod_timer() SMP race

We just spent some days fighting a rare race in one of the distro's who backported some of timer.c from 2.6 to 2.4 (though they missed a bit).

The actual race we found didn't happen in 2.6 _but_ code inspection showed that a similar race is still present in 2.6, explanation below:

Code removing a timer from a list (run_timers or del_timer) takes that CPU list lock, does list_del, then timer->base = NULL.

It is mandatory that this timer->base = NULL is visible to other CPUs only after the list_del() is complete. If not, then mod timer could see it NULL, thus take it's own CPU list lock and not the one for the CPU the timer was beeing removed from the list, and thus the list_add in mod_timer() could race with the list_del() from run_timers() or del_timer().

Our race happened with run_timers(), which _DOES_ contain a proper smp_wmb() in the right spot in 2.6, but didn't in the "backport" we were fighting with.

However, del_timer() doesn't have such a barrier, and thus is subject to this race in 2.6 as well. This patch fixes it.

Signed-off-by: Benjamin Herrenschmidt <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-timer-barrier.patch


Patch from mainstream:
[PATCH] epoll: handle timeout overflow

Handle the timeout upper boundary for epoll.

Bug 58718.


Patch from Pavel:

Since vsyscall page may be not mapped on emt64, elf_core_dump may oops fetching info from it.

Bug 58677.



Patch from Kirill:
[PATCH] make per-VPS sysfs to be tunable on VPS start

This patch allows to control whether sysfs will be enabled in VPS on VPS start, and not by kernel compilation option as it was before.

It extends env_create_data ioctl() introducting new "features" bitmask, binary compatible with old vzctl.

As a cleanip/optimization this patch also removes ve_hook_init_data, which was never used.

new vzctl should call env_create_data with new env_create_param2 parameter. if ioctl() returns -EINVAL it should retry with old env_create_param parameter. To enable sysfs in VPS set env_create_param2->features_mask |= VE_FEATURE_SYSFS;


Patch from Pavel:

If alloc_vpid() for init fails ipcs are not cleaned up in do_env_create().


Patch from mainstream:
[PATCH] VFS: local denial-of-service with file leases

CVE-2005-3857 RedHat Bug #174337]


Patch from mainstream:
[PATCH] NLS: Fix overflow of nls_ascii (minor)

The nls_ascii conversion table is just for 128 entries, but should be 256.

Signed-off-by: OGAWA Hirofumi <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream:
[PATCH] Fix EDID_INFO in zero-page

(faced during Intel certification)

EDID_INFO is encroaching on the space meant for E820 map in zero-page. This will result in E820 map corruption on any system that has more=20 than 18 E820 entries and CONFIG_VIDEO_SELECT. Not sure how this bug=20 managed to hide for more than a year.

Attached patch should fix the bug.

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream:

This patch fixes the error in the UHCI driver, cause of messages like host controller process error, something bad happened!

Bug 55401.


Patch from mainstream:
[PATCH] make sync_dirty_buffer() return something useful

Make sync_dirty_buffer() return the result of its syncing.

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

added submit_bh() return 0, see diff-ms-ide-writebarrier


Patch from mainstream:
[TIME]: Put jiffies_to_usecs in time.h

Move local version put in tcp_diag.c into time.h where it belongs. Also, make it smarted about HZ values math.

Based upon suggestions from Joe Perches <>

Signed-off-by: Stephen Hemminger <>
Signed-off-by: David S. Miller <>


Patch from mainstream (was in kernel-ve already):
[PATCH] beginning of endianness annotations

This adds the types and annotates conversion functions. I've converted the ...p() versions to inlines; AFAICS, everything's still happy...

Signed-off-by: Al Viro <>
Signed-off-by: Linus Torvalds <>




IA64 bug fixes and RPM build fix.


Patch from mainstream:

[IA64] page_not_present fault in region 5 is normal

When copying data from user-space to kernel-space by __copy_user(), a page_not_present fault sometimes occurs at vmalloced kernel address because of VHPT pre-fetching.

Ignore the page_not_present fault in ia64_do_page_fault() before jumping into exception handlers.

Signed-off-by: Kiyoshi Ueda <>
Signed-off-by: Jun'ichi Nomura <>
Signed-off-by: Tony Luck <>

GIT: 63028aa7f581d9d4e6889f9dc06ded2534250a76
RHEL4u2: linux-2.6.9-ia64-handle-page-not-present.patch


Patch from mainstream:
[IA64] Fill holes in FIXADDR_USER space with zero pages.

This fixes an oops reported by Jason Baron.

Signed-off-by: David Mosberger-Tang <>
Signed-off-by: Tony Luck <>

GIT: ad597bd518559f59ede8d01262cdf4467e13282e
RHEL4u2: linux-2.6.9-ia64-map-gate-page.patch


Patch from mainstream:
[IA64] Fix race condition in the rt_sigprocmask fastcall

current->blocked will be set to the value of current->thread_info->flags if the cmpxchg to update thread_info->flags fails. For performance reasons the store into current->blocked was placed in the cmpxchg loop. However, the cmpxchg overwrites the register holding the value to be stored. In the rare case of a retry the value of thread_info->flags will be written into current->blocked.

The fix is to use another register so that the register containing the current->blocked value is not overwritten.

Signed-off-by: Christoph Lameter <> Signed-off-by: Tony Luck <>

RHEL4u2: linux-2.6.9-ia64-sigprocmask-race.patch
Bug 58363.


Patch from mainstream:
[PATCH] arch hook for notifying changes in PTE protections bits

Recently on IA-64, we have found an issue where old data could be used by apps. The sequence of operations includes few mprotects from user space (glibc) goes like this:

  1. The text region of an executable is mmaped using PROT_READ|PROT_EXEC. As a result, a shared page is allocated to user.
  2. User then requests the text region to be mprotected with PROT_READ|PROT_WRITE. Kernel removes the execute permission and leave the read permission on the text region.
  3. Subsequent write operation by user results in page fault and eventually resulting in COW break. User gets a new private copy of the page. At this point kernel marks the new page for defered flush.
  4. User then request the text region to be mprotected back with PROT_READ|PROT_EXEC. mprotect suppport code in kernel, flushes the caches, updates the PTEs and then flushes the TLBs. Though after updating the PTEs with new permissions, we don't let the arch specific code know about the new mappings (through update_mmu_cache like routine). IA-64 typically uses update_mmu_cache to check for the defered flush flag (that got set in step 3) to maintain cache coherency lazily (The local I and D caches on IA-64 are incoherent).

DavidM suggeested that we would need to add a hook in the function change_pte_range in mm/mprotect.c This would let the architecture specific code to look at the new ptes to decide if it needs to update any other architectual/kernel state based on the updated (new permissions) PTE values.

We have added a new hook lazy_mmu_prot_update(pte_t) that gets called protection bits in PTEs change. This hook provides an opportunity to arch specific code to do needful. On IA-64 this will be used for lazily making the I and D caches coherent.

Signed-off-by: David Mosberger <>
Signed-off-by: Rohit Seth <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-ia64-update-mmu-cache.patch


Patch from mainstream:
[IA64-HP] Fix for bits_wanted in sba_iommu.c

bits_wanted is expanded to bytes using the wrong shift value (when iovp_shift != PAGE_SHIFT), resulting in an explosion of used iommu resources.

This potentially results in mistakenly running out of DMA mapping resources when the system is under *heavy* i/o load.

Signed-off-by: Nigel Croxon <>
Signed-off by: Alex Williamson <>
Signed-off-by: Tony Luck <>

RHEL4u2: linux-2.6.9-ia64-sba_iommu-size.patch


Patch from mainstream:
[IA64] When we exhaust the supply of records to read, clear the event status.

Patch written by Ben Woodard. Sanity checked by Jesse Barnes.

Signed-off-by: Tony Luck <>
Bug 58393.


Patch from mainstream, modified by Kirill:
[PATCH] ps shows wrong ppid

/proc shows the wrong PID as parent in the following case

Process A creates Threads 1 & 2 (using pthread_create) Thread 2 then forks and execs process B getppid() for Process B shows Process A (rightly) as parent, however /proc/B/status shows Thread 3 as PPid (incorrect).

Signed-off-by: Dinakar Guniguntala <>
Acked-by: Ingo Molnar <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-procfs-getpid-fix.patch

diff-2.6.9-ext2, diff-2.6.9-ext3, diff-2.6.9-jbd

Patch from mainstream:
some trivial changes from 2.6.8 to 2.6.9 in ext2/ext3/jbd


Patch from mainstream:
[PATCH] fix for prune_icache()/forced final iput() races

Based on analysis and a patch from Russ Weight <>

There is a race condition that can occur if an inode is allocated and then released (using iput) during the ->fill_super functions. The race condition is between kswapd and mount.

For most filesystems this can only happen in an error path when kswapd is running concurrently. For isofs, however, the error can occur in a more common code path (which is how the bug was found).

The logic here is "we want final iput() to free inode *now* instead of letting it sit in cache if fs is going down or had not quite come up". The problem is with kswapd seeing such inodes in the middle of being killed and happily taking over.

The clean solution would be to tell kswapd to leave those inodes alone and let our final iput deal with them. I.e. add a new flag (I_FORCED_FREEING), set it before write_inode_now() there and make prune_icache() leave those alone.

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: 991114c6fa6a21d1fa4d544abe78592352860c82
RHEL4u2: linux-2.6.9-prune-icache-vs-iput.patch


Patch from Vasiliy:
Simple virtualization of vpids in autofs/autofs4


Patch from Vasiliy:
Simple virtualization of autofs4


Patch from Vasiliy:
Simple virtualization of autofs


Patch from mainstream:
[PATCH] Adjust alignment of pagevec structure

We can shrink the pagevec structure to cacheline align it. It is used all over VM reclaiming and mpage pagecache read code.

Right now it is 140 bytes on 64-bit and 72 bytes on 32-bit. Thats just a little bit more than a power of 2 (which will cacheline align), so shrink it to be aligned: 64 bytes on 32bit and 124bytes on 64-bit.

It now occupies two cachelines most of the time instead of three.


Patch from Kirill:

This patch adds total_vcpus parameter to env_create_param2 structure. It will be used to setup max number of VCPUs available in VPS. Right now should be simply zero. Added for binary compatibility with future vzctl's.


Patch from mainstream:
[PATCH] ext3: online resizing

The patch below adds online resize capability to ext3 based on Andreas patch for 2.4 and fixed up by Stephen.

The patch also removes s_debts:

s_debts is currently not used by ext3 (it is created, destroyed and checked but never set). Remove it for now.

Resurrecting this will require adding it back in changed form. In existing form it's already unsafe wrt. byte-tearing as it performs unlocked byte increment/decrement on words which may be being accessed simultaneously on other CPUs. It is also the only in-memory dynamic table which needs to be extended by online-resize, so locking it will require care.

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.5-ext3-online-resize.patch


Patch from RHEL4/mainstream:
[PATCH] Sync in core time granuality with filesystems

This patch corrects a problem that was originally added with the nanosecond timestamps in stat patch. The problem is that some file systems don't have enough space in their on disk inode to save nanosecond timestamps, so they truncate the c/a/mtime to seconds when flushing an dirty node. In core the inode would have full jiffies granuality.

This can be observed by programs as a timestamp that jumps backwards under specific loads when an inode is flushed and then reloaded from disk.

The problem was already known when the original patch went in, but it wasn't deemed important enough at that time. So far there has been only one report of it causing problems. Now Tridge is worried that it will break running Excel over samba4 because Excel seems to do very anal timestamp checking and samba4 will supply 100ns timestamps over the network.

This patch solves it by putting the time resolution into the superblock of a fs and always rounding the in core timestamps to that granuality.

This also supercedes some previous ext2/3 hacks to flush the inode less often when only the subsecond timestamp changes.

I tried to keep the overhead low, in particular it tries to keep divisions out of fast paths as far as possible.

The patch is quite big but 99% of it is just relatively straight forward search'n'replace in a lot of fs. Unconverted filesystems will default to a 1ns granuality, but may still show the problem if they continue to use CURRENT_TIME. I converted all in tree fs.

One possible future extension of this would be to have two time granualities per superblock - one that specifies the visible resolution, and the other to specify how often timestamps should be flushed to disk, which could be tuned with a mount option per fs (e.g. often m/atimes don't need to be flushed every second). Would be easy to do as an addon if someone is interested.

Signed-off-by: Andi Kleen <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-ext3-sub-second-timestamp.patch


Patch from mainstream:
[PATCH] ext2/3 file limits to avoid overflowing i_blocks

As discussed before, we can overflow i_blocks in ext2/ext3 inodes by growing a file up to 2TB. That gives us 2^32 sectors of data in the file; but once you add on the indirect tree and possible EA/ACL metadata, i_blocks will wrap beyond 2^32. Consensus seemed to be that the best way to avoid this was simply to stop files getting so large that this was a problem in the first place; anything else would lead to complications if a sparse file tried to overflow that 2^32 sector limit while filling in holes.

I wrote a small program to calculate the total indirect tree overhead for any given file size, and 0x1ff7fffe000 turned out to be the largest file we can get without the total i_blocks overflowing 2^32.

But in testing, that *just* wrapped --- we need to limit the file to be one page smaller than that to deal with the possibility of an EA/ACL block being accounted against i_blocks.

So this patch has been tested, at least on ext3, by letting a file grow densely to its maximum size permitted by the kernel; at 0x1ff7fffe000, stat shows the file to have wrapped back exactly to 0 st_blocks, but with the limit at 0x1ff7fffd000, du shows it occupying the expected 2TB-blocksize bytes.

Signed-off-by: Stephen Tweedie <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: ilinux-2.6.9-ext3-file-limit.patch


Patch from mainstream:
[PATCH] ext3: handle attempted double-delete of metadata.

This patch improves ext3's ability to deal with corruption on-disk. If we try to delete a metadata block twice, we confuse ext3's internal revoke error-checking, resulting in a BUG(). But this can occur in practice due to a corrupt indirect block, so we should attempt to fail gracefully.

Downgrade the assert failure to a JH_EXPECT_BH failure, and return EIO when it occurs.

This is easily reproduced with a sample ext3 fs image containing an inode which references the same indirect block more than once. Deleting that inode will BUG() an unfixed kernel with:

Assertion failure in journal_revoke() at fs/jbd/revoke.c:379:

With the fix, ext3 recovers gracefully.

Signed-off-by: Stephen Tweedie <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-ext3-handle-double-revoke.patch


Patch from mainstream:
[PATCH] ext3: handle attempted delete of bitmap blocks.

This patch improves ext3's ability to deal with corruption on-disk. If we ever get a corrupt inode or indirect block, then an attempt to delete it can end up trying to remove any block on the fs, including bitmap blocks. This can cause ext3 to assert-fail as we end up trying to do an ext3_forget on a buffer with b_committed_data set.

The fix is to downgrade this to an IO error and journal abort, so that we take the filesystem readonly but don't bring down the whole kernel.

Make J_EXPECT_JH() return a value so it can be easily tested and yet still retained as an assert failure if we build ext3 with full internal debugging enabled. Make journal_forget() return an error code so that in this case the error can be passed up to the caller.

This is easily reproduced with a sample ext3 fs image containing an inode whose direct and indirect blocks refer to a block bitmap block. Allocating new blocks and then deleting that inode will BUG() with:

Assertion failure in journal_forget() at fs/jbd/transaction.c:1228:

With the fix, ext3 recovers gracefully.

Signed-off-by: Stephen Tweedie <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-ext3-handle-bitmapdel.patch


Patch from mainstream:
[PATCH] ext3: cleanup handling of aborted transactions.

This patch improves ext3's error logging when we encounter an on-disk corruption. Previously, a transaction (such as a truncate) which encountered many corruptions (eg. a single highly-corrupt indirect block) would emit copious "aborting transaction" errors to the log.

Even worse, encountering an aborted journal can count as such an error, leading to a flood of spurious "aborting transaction: Journal has aborted" errors.

With the fix, only emit that message on the first error. The patch also restores a missing \n in that printk path.

Signed-off-by: Stephen Tweedie <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

RHEL4u2: linux-2.6.9-ext3-cleanup-abort.patch


Patch from mainstream:
[PATCH] i386: fix hpet for systems that don't support legacy replacement

Currently the i386 HPET code assumes the entire HPET implementation from the spec is present. This breaks on boxes that do not implement the optional legacy timer replacement functionality portion of the spec.

This patch, which is very similar to my x86-64 patch for the same issue, fixes the problem allowing i386 systems that cannot use the HPET for the timer interrupt and RTC to still use the HPET as a time source. I've tested this patch on a system systems without HPET, with HPET but without legacy timer replacement, as well as HPET with legacy timer replacement.

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: 35492df5ae0f36f717448b2aea908d3a8891d1c4
RHEL4u2: linux-2.6.9-hpet-legacy.patch

Bug 58958.


Patch from Kirill:

This patch fixes error path in do_env_create(). It doesn't set current vpid=1 until the final enter.

Bug 59123.


Patch from Kirill:

This patch fixes unkillable vzctl, due to bug in error path on VPS create. vzctl changed it's vpid to 1, but failed to enter to VPS. So it was ignoring SIGKILL.

Error path in do_env_create() should be fixed separately.

Bug 59123.


Patch from mainstream:
[PATCH] i386: fix hpet for systems that don't support legacy replacement

Currently the i386 HPET code assumes the entire HPET implementation from the spec is present. This breaks on boxes that do not implement the optional legacy timer replacement functionality portion of the spec.

This patch, which is very similar to my x86-64 patch for the same issue, fixes the problem allowing i386 systems that cannot use the HPET for the timer interrupt and RTC to still use the HPET as a time source. I've tested this patch on a system systems without HPET, with HPET but without legacy timer replacement, as well as HPET with legacy timer replacement.

Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

GIT: 35492df5ae0f36f717448b2aea908d3a8891d1c4

Bug 58958.


Patch from mainstream:
[PATCH] x86: HPET setup, duplicate HPET_T0_CMP needed for some platforms

This patch fixes the issue with HPET on some platforms.

According to Vojtech Pavlik:
The first write after writing TN_SETVAL to the config register sets the counter value, the second write sets the threshold.

When you only do the first write you never set the threshold and interrupts won't be generated properly.

Thanks to John Stultz and Andrew Walrond for reporting, root causing the issue and verifying this fix.

Signed-off-by: Venkatesh Pallipadi <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus Torvalds <>

Bug 58958.


Patch from mainstream:

This patch fixes infinite loop in __get_block_slow() and messages "__find_get_block_slow() failed".

Bug 58971.


Patch from mainstream:
[PATCH] Fix bounced bio and dm panic

Make sure that a bio doesn't contain NULL pages in the front of its vec, if a device bounces a bio that doesn't start from 0.

Problem noted by Mark Haverkamp.

Signed-off-by: Jens Axboe <>
Signed-off-by: Linus Torvalds <>


Patch from mainstream:
[PATCH] fix highmem bouncing leaking pages

In highmem end_io handling, we need to iterate over the completed bio from 0, not bio->bi_idx. If not we leak N-1 pages for any bio with N pages where N > 1.

Signed-off-by: Jens Axboe <>
Signed-off-by: Linus Torvalds <>

Bug 58893.


Patch from Andrey:

This patch fixes illegal __GFP_FS allocation inside ext3 transaction in ext3_symlink. Such allocation may re-enter ext3 code from try_to_free_pages. But JBD/ext3 code keeps a pointer to current journal handle in task_struct and, hence, is not reentrable. Patch not tested.

Bug 59062.


Patch form Vasiliy Tarasov:
Add empty /proc/modules file inside VPS

Bug 59053.


Patch from mainstream:
[PATCH] Make sure interleave masks have at least one node set (minor)

Otherwise a bad mem policy system call can confuse the interleaving code into referencing undefined nodes.

Originally reported by Doug Chapman

I was told it's CVE-2005-3358 (one has to love these security people - they make everything sound important)

Signed-off-by: Andi Kleen <>
Signed-off-by: Linus Torvalds <>
GIT: 8f493d797bc1fe470377adc9d8775845427e240e

RHEL4u2: linux-2.6.9-CVE-2005-3358-mempolicy.patch


Patch from Alexander, modified by Denis:

This patch virtualizes kernel version in `uname -a` and VPS /proc/version. it is inherited on VPS start from /proc/sys/kernel/virt_osrelease

Bug 59227.