Sometimes when you have a kernel panic, oops, or other fatal crash, this is not programmers whom to blame. This article describes how to properly test your hardware to check it's in a good shape.
Random Access Memory (RAM) is sometimes faulty, which leads to some very strange system crashes. It is though highly recommended to test your system RAM. A several approaches and tools can be used.
Memtest86 and Memtest86+
Memtest86 is a stand-alone RAM tester. It can either be booted from a CD, or from your normal Linux bootloader, such as GRUB or LILO.
Memtest86+ is a forked version of Memtest86 with some features added.
You can either download and install one of this programs from the sites above, or they can be a part of your Linux distribution already.
For Fedora Core, memtest86+ is available:
yum install memtest86+
For Gentoo, both programs are available:
emerge memtest86 emerge memtest86+
To test your system for faulty RAM, install either memtest and reboot into it. Run it for at least a few hours (at least 2-3 iterations). If there will be even a single error reported, you have to change your RAM chips (or, if your system is overclocked, downclock it to normal speed).
Memtester is a userspace utility for testing the memory subsystem for faults. It is a part of some distributions.
For Fedora Core:
yum install memtester
The good thing is you can test your memory without a need to reboot the server, and you can run other programs with it. The bad thing is not all the memory is tested.
Invoke memtester as a root, giving an amount of memory it will test as an argument, e.g.:
# /usr/sbin/memtester 512M
The more memory you will specify the better.
CPU cooling tests
Сpuburn is an utility to burn your CPU as high as possible. It tests your system stability by checking how the CPU and the whole system is working under high temperatures.
For Fedora Core you can get the RPMS from DAG.
For other systems: download tarball from the home page, untar and run.
It is recommended to switch to single-user mode and remount all the partitions read-only, just in case of system hang.
burnBX || echo $? &
for at least 15 minutes. If you have more than one physical CPU, repeat the command. If nothing is happening withing 15-20 minutes, and your system is still responding, you can conclude the test is passed, and kill the process(es):
killall -TERM burnBX
You can also use burnMMX utility:
burnMMX J || echo $? &
Cpuburn author says burnMMX is not optimal for AMD processors; use burnBX if you have AMD.