When you have an oops
What is an oops?
Sometimes due to bug or bad hardware OOPS occurs in linux kernel. It means, that some event unexpected by kernel happened. Typical example is when some pointer inside kernel equals NULL, but kernel code logic is organized in such way, that kernel just uses this pointer without any doubt. CPU trys to dereference such pointer but can't do it and gives a signal to kernel, which produces error message. Once a system has experienced an oops, various internal resources may no longer be accounted for. Memory leaks may have occurred, as well as other undesirable side effects from the active task being killed.
Have I had an oops?
User can detect that oops has happened by error message. It is displayed on system console. Usually also some log daemon works (klogd, syslogd, etc.) on your system, thus error message can be found in logs: commonly in /var/log/messages. Below is an example of real error message caused by real oops:
Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c018c3c8 *pde = 00000000 Oops: 0000 [#1] Modules linked in: vznetdev vzmon af_packet simfs vfat fat loop vzdquota ipt_length ipt_ttl ipt_tcpmss ipt_TCPMSS iptable_mangle ipt_multiport ipt_limit ipt_tos ipt_REJECT iptable_filter ip_tables i2c_dev i2c_core sunrpc vzdev thermal processor fan button battery asus_acpi ac uhci_hcd ehci_hcd usbcore e100 mii floppy CPU: 0, VCPU: 0:0 EIP: 0060:[<c018c3c8>] Not tainted EFLAGS: 00010203 (2.6.8-022stab077.1) EIP is at vfs_quota_on_file+0x1f8/0x350 eax: 00000000 ebx: f3fd6ca8 ecx: f3f8b124 edx: 00000000 esi: f3fd6c00 edi: c6500e40 ebp: f4c2a19c esp: c0c9fe40 ds: 007b es: 007b ss: 0068 Process quotaon (pid: 2802, veid=0, threadinfo=c0c9f000 task=ec39ecc0) Stack: f3fd6c00 00000000 00000101 f3fd6ccc f3f8b124 00000022 ffffffea f3f8b0b4 f3d8e89c 00000010 c6500e40 f32ac000 00800002 00000002 c018c578 c6500e40 00000000 00000002 f3fd6c00 f32ac000 c01d8681 f3fd6c00 00000000 00000002 Call Trace: [<c018c578>] vfs_quota_on+0x58/0x80 [<c01d8681>] ext3_quota_on+0xb1/0x100 [<c016d79b>] link_path_walk+0x76b/0xd30 [<c016c8a6>] getname+0x76/0xc0 [<c018e9c2>] do_quotactl+0x292/0x520 [<c0176df5>] dput+0x25/0x30 [<c016cb75>] path_release+0x15/0x50 [<c0167eeb>] lookup_bdev+0x6b/0xc0 [<c01345dc>] uncharge_dcache+0x2c/0x40 [<c016c8a6>] getname+0x76/0xc0 [<c018ecc0>] quota_get_sb+0x70/0x80 [<c018f56d>] sys_quotactl+0x8d/0xd9 [<c03fc2ef>] syscall_call+0x7/0xb Code: ff 10 85 c0 0f 84 20 01 00 00 8b 4c 24 1c ba 01 00 ff ff 8b
Error message contains useful information to determine what was the reason of an oops. This includes contents of registers, information about the process caused the oops and the contents of a stack. A call trace is a decoded stack that allows developers to understand how the kernel comes to an oops.
To determine whether your system has had an oops or not, grep your logs:
grep -E "Call Trace|Code" /var/log/messages*
If you have had an oops
If you have had an oops, the first thing you have to do is to check your hardware. It's described in article Hardware testing. If all tests are passed, then this is unfortunately kernel bug and we would please ask you to send us a report about this bug in our bug tracker: https://bugs.openvz.org. Report must contain:
- Kernel version and architecture (output of uname -a command on the kernel that caused a problem)
- In case you compiled your kernel yourself — your .config file
- In case you have used some additional kernel patches — a link to those patches
- Full text of kernel oops message
- Description of how to reproduce the oops.
|Note: some oopses are so fatal that they can't be written into a log file. In that case, you should set up a remote console to catch the oops.|