When you have an oops

From OpenVZ Virtuozzo Containers Wiki
Jump to: navigation, search

What is an oops?

Sometimes due to bug or bad hardware OOPS occurs in linux kernel. It means, that some event unexpected by kernel happened. Typical example is when some pointer inside kernel equals NULL, but kernel code logic is organized in such way, that kernel just uses this pointer without any doubt. CPU trys to dereference such pointer but can't do it and gives a signal to kernel, which produces error message. Once a system has experienced an oops, various internal resources may no longer be accounted for. Memory leaks may have occurred, as well as other undesirable side effects from the active task being killed.

Have I had an oops?

User can detect that oops has happened by error message. It is displayed on system console. Usually also some log daemon works (klogd, syslogd, etc.) on your system, thus error message can be found in logs: commonly in /var/log/messages. Below is an example of real error message caused by real oops:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c018c3c8
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: vznetdev vzmon af_packet simfs vfat fat loop vzdquota
ipt_length ipt_ttl ipt_tcpmss ipt_TCPMSS iptable_mangle ipt_multiport ipt_limit
ipt_tos ipt_REJECT iptable_filter ip_tables i2c_dev i2c_core sunrpc vzdev
thermal processor fan button battery asus_acpi ac uhci_hcd ehci_hcd usbcore
e100 mii floppy
CPU:    0, VCPU: 0:0
EIP:    0060:[<c018c3c8>]    Not tainted
EFLAGS: 00010203   (2.6.8-022stab077.1) 
EIP is at vfs_quota_on_file+0x1f8/0x350
eax: 00000000   ebx: f3fd6ca8   ecx: f3f8b124   edx: 00000000
esi: f3fd6c00   edi: c6500e40   ebp: f4c2a19c   esp: c0c9fe40
ds: 007b   es: 007b   ss: 0068
Process quotaon (pid: 2802, veid=0, threadinfo=c0c9f000 task=ec39ecc0)
Stack: f3fd6c00 00000000 00000101 f3fd6ccc f3f8b124 00000022 ffffffea f3f8b0b4 
       f3d8e89c 00000010 c6500e40 f32ac000 00800002 00000002 c018c578 c6500e40 
       00000000 00000002 f3fd6c00 f32ac000 c01d8681 f3fd6c00 00000000 00000002 
Call Trace:
 [<c018c578>] vfs_quota_on+0x58/0x80
 [<c01d8681>] ext3_quota_on+0xb1/0x100
 [<c016d79b>] link_path_walk+0x76b/0xd30
 [<c016c8a6>] getname+0x76/0xc0
 [<c018e9c2>] do_quotactl+0x292/0x520
 [<c0176df5>] dput+0x25/0x30
 [<c016cb75>] path_release+0x15/0x50
 [<c0167eeb>] lookup_bdev+0x6b/0xc0
 [<c01345dc>] uncharge_dcache+0x2c/0x40
 [<c016c8a6>] getname+0x76/0xc0
 [<c018ecc0>] quota_get_sb+0x70/0x80
 [<c018f56d>] sys_quotactl+0x8d/0xd9
 [<c03fc2ef>] syscall_call+0x7/0xb
Code: ff 10 85 c0 0f 84 20 01 00 00 8b 4c 24 1c ba 01 00 ff ff 8b

Error message contains useful information to determine what was the reason of an oops. This includes contents of registers, information about the process caused the oops and the contents of a stack. A call trace is a decoded stack that allows developers to understand how the kernel comes to an oops.

To determine whether your system has had an oops or not, grep your logs:

grep -E "Call Trace|Code" /var/log/messages*

If you have had an oops

If you have had an oops, the first thing you have to do is to check your hardware. It's described in article Hardware testing. If all tests are passed, then this is unfortunately kernel bug and we would please ask you to send us a report about this bug in our bug tracker: https://bugs.openvz.org. Report must contain:

  • Kernel version and architecture (output of uname -a command on the kernel that caused a problem)
  • In case you compiled your kernel yourself — your .config file
  • In case you have used some additional kernel patches — a link to those patches
  • Full text of kernel oops message
  • Description of how to reproduce the oops.
Yellowpin.svg Note: some oopses are so fatal that they can't be written into a log file. In that case, you should set up a remote console to catch the oops.