Difference between revisions of "When you have an oops"

From OpenVZ Virtuozzo Containers Wiki
Jump to: navigation, search
(Initial edition of article)
(No difference)

Revision as of 08:06, 8 June 2006

What is an oops?

Sometimes due to bug or bad hardware OOPS occurs in linux kernel. It means, that some event unexpected by kernel happened. Typical example is when some pointer inside kernel equals NULL, but kernel code logic is organized in such way, that kernel just uses this pointer without any doubt. CPU trys to dereference such pointer but can't do it and gives a signal to kernel, which produces error message. Once a system has experienced an oops, various internal resources may no longer be accounted for. Memory leaks may have occurred, as well as other undesirable side effects from the active task being killed.

Have I had an oops?

User can detect that oops has happened by error message. It is displayed on system console. Usually also some log daemon works (klogd, syslogd, etc.) on your system, thus error message can be found in logs: commonly in /var/log/messages. Below is an example of real error message caused by real oops:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c018c3c8
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: vznetdev vzmon af_packet simfs vfat fat loop vzdquota
ipt_length ipt_ttl ipt_tcpmss ipt_TCPMSS iptable_mangle ipt_multiport ipt_limit
ipt_tos ipt_REJECT iptable_filter ip_tables i2c_dev i2c_core sunrpc vzdev
thermal processor fan button battery asus_acpi ac uhci_hcd ehci_hcd usbcore
e100 mii floppy
CPU:    0, VCPU: 0:0
EIP:    0060:[<c018c3c8>]    Not tainted
EFLAGS: 00010203   (2.6.8-022stab077.1) 
EIP is at vfs_quota_on_file+0x1f8/0x350
eax: 00000000   ebx: f3fd6ca8   ecx: f3f8b124   edx: 00000000
esi: f3fd6c00   edi: c6500e40   ebp: f4c2a19c   esp: c0c9fe40
ds: 007b   es: 007b   ss: 0068
Process quotaon (pid: 2802, veid=0, threadinfo=c0c9f000 task=ec39ecc0)
Stack: f3fd6c00 00000000 00000101 f3fd6ccc f3f8b124 00000022 ffffffea f3f8b0b4 
       f3d8e89c 00000010 c6500e40 f32ac000 00800002 00000002 c018c578 c6500e40 
       00000000 00000002 f3fd6c00 f32ac000 c01d8681 f3fd6c00 00000000 00000002 
Call Trace:
 [<c018c578>] vfs_quota_on+0x58/0x80
 [<c01d8681>] ext3_quota_on+0xb1/0x100
 [<c016d79b>] link_path_walk+0x76b/0xd30
 [<c016c8a6>] getname+0x76/0xc0
 [<c018e9c2>] do_quotactl+0x292/0x520
 [<c0176df5>] dput+0x25/0x30
 [<c016cb75>] path_release+0x15/0x50
 [<c0167eeb>] lookup_bdev+0x6b/0xc0
 [<c01345dc>] uncharge_dcache+0x2c/0x40
 [<c016c8a6>] getname+0x76/0xc0
 [<c018ecc0>] quota_get_sb+0x70/0x80
 [<c018f56d>] sys_quotactl+0x8d/0xd9
 [<c03fc2ef>] syscall_call+0x7/0xb
Code: ff 10 85 c0 0f 84 20 01 00 00 8b 4c 24 1c ba 01 00 ff ff 8b

Error message contains usefull information to determine what was the reason of oops. This is contents of registers, information about the process caused oops and contents of a stack. Call trace is decoded stack that allows developers to understand how kernel comes to oops. So, to determine wether your system has had an oops or not you can grep:

grep -E "Call Trace|Code" /var/log/messages*

If you have had an oops

If you have had an oops, the first thing you have to do is to check your hardware. It's described in article Hardware testing. If all tests are passed, then this is unfortunately kernel bug and we please you to send us a report about this bug in our bugzilla: http://bugzilla.openvz.org. Report must contain:

  • Kernel version, architecture (output of "uname -a" command on kernel, that caused problem)
  • If you comiled your kernel by self - your config file.
  • Kernel message caused by oops
  • Is it reproducable on your node? How?